Q & A PARSER Submitted By
Name of the Student
Roll no
PRANAY B.MHATRE SUDHARMA S.PATIL SWAPNIL B.PRADHAN
30 37 41
In partial fulfillment for the award of
Bachelor of Engineering (Computer Engineering)
Guided by Ms. Deepti Vijay Chandran
Department of Computer Engineering Smt. Indira Gandhi College of Engineering Affiliated to Mumbai University Mumbai (M.S.)
(2010 - 2011)
B.E. (COMS) =====
Q & A P A R S E R
===== 2010 2011
Q & A PARSER Submitted by
Name of the Student
Roll no
PRANAY B.MHATRE
30
SUDHARMA S.PATIL
37
SWAPNIL B.PRADHAN
41
In partial fulfillment of
Bachelor of Engineering (Computer Engineering)
Name of the Guide Ms.Deepti Vijay Chandran
Department of Computer Engineering Smt.Indira Gandhi College of Engineering Koparkharine, Navi Mumbai.
(2010 - 2011)
CERTIFICATE
This is to certify that, the project “Q & A PARSER” submitted by Name of the Student
Roll No.
PRANAY B.MHATRE
30
SUDHARMA S.PATIL
37
SWAPNIL B.PRADHAN
41
is a bonafide work completed under my supervision and guidance in partial fulfillment for award of Bachelor of Engineering (Computer Engineering) Degree of Mumbai University, Mumbai.
Place : Koparkharine Date
:
(Ms. Deepti Vijay Chandran) Guide
(Prof. K.T. Patil) Head of the Department
Examiner
Dr. S.K. Narayankhedkar Principal Smt. Indira Gandhi College of Engineering Koparkharine, Navi Mumbai.
INDEX CONTENTS 1. INTRODUCTION
PAGE NO. 7
1.1 Problem Definition
8
1.2 Mathematical M orphology
8
1.3 Scope of the Project 1.4 Methodology Used
9 9
2. LITERATURE SURVEY
10
3. PROJECT SCHEDULE TIMELINE CHART
12
3.1 Milestones and Timeline
4. REQUIREMENT GATHERING 4.1 Hardware Requirement 4.2 Software Requirement 4.3 Feasibility Study 4.4Language Used
5. DESIGN 5.1 Software Development Model 5.2 UML Diagrams
13
17 18 18 19 21
29 30 31
6. IMPLEMENTATION 6.1 Basic Algorithm 6.2 Front End:: JAVA 6.3 Functional components of the project 6.4 Screen Layout 6.5 More Results 6.6 Software Results
7. CONCLUSION. 7.1Advantages 7.2Disadvantages 7.3Future Aspects
REFERENCES
37 38 39 41 43 52 54
55 56 56 56
57
List Of Figures:
FigNo.
Diagram
Pg.No
1
Spiral Model
9
2
Sequence Diagram
31
4
Use Case Diagram
32
5
Class Diagram
33
6
Activity Diagram
34
7
State chart Diagram
35
List of Tables Table
Name
Page No.
Table 1
Milestones and Timelines
13
Table 2
Input Output Table
42
List Of Graph Graph
Name
Page No.
Graph1
Timelines
15
Chapter 1 INTR ODUCTION
INTRODUCTION 1.1 Problem Definition: Why Natural Language Processing is a Critically Needed Technology
Anyone who has used a search engine to perform market, consulting, or financial research, can tell you the pain of spending hours looking for the answer to a seemingly simple question. Add up all the questions a researcher must ask and the hours really rack up. Just how big is the search problem? According to International Data Group the average knowledge worker makes $60,000 per year out of which $14,000 is spent on search. Knowledge workers spend 24% of their time on search. Here is a quote from Network World, "A company that employs 1,000 information workers can expect more than $5 million in annual salary costs to go down the drain because of the time wasted looking for information and not finding it, IDC research found last year." Furthermore an Accenture study found that 50% of information retrieved in search by middle managers is useless. In the document heavy financial services sector researchers are frequently forced to give up looking for answers, or cannot check the accuracy of answers with multiple sources because it would be time prohibitive. Senior risk management is comprised of a firm‟s most senior executives whose job is to evaluate if you are doing your job correctly to mitigate risk at the most upper levels of the firm. Now imagine you are on the phone with your firm‟s senior risk managers (your boss‟s boss‟s boss) and you are asked a question that you don‟t know the answer to? Imagine if you could type a short question into a search box and come up with an answer in time to provide an intelligent and correct response to the question? That is the power of natural language processing, you type in a question in “natural language” and be provided with an instant result containing the answer that saves the day. Our biggest challenge was being flexible enough to revisit our designs when it turned out that our original ideas were not as viable as we had hoped. During initial planning, we felt that we would need to apply an array of different techniques to bring the system together, including named entity recognition, part of speech tagging, reference resolution, and text classification. However, once we started building the system we were surprised to find that our test results did not always align with our ideas. For instance, using named entity recognition to generate questions did not always generate fluent or even answerable questions. In the case of our answer" program, however, the initial simple solution turned out to be quite effective all by itself. More concretely, the area where we struggled the most was question asking. When we first began considering options, we hadn't learned much about the necessary tools. Parsing in particular wasn't something we had spent much time on in class at that point, so we didn't realize how useful it could be. Instead, we focused on using named entity recognition, an approach that didn't pan out. In addition, while we planned to ask questions of all difficulty levels, while testing we found that our medium" and \hard" questions were too often either disfluent or unanswerable given the article. This motivated us to restrict our question asking to easy"difficulty questions.
1.2 Q & A parser (NLP)
Natural Language Processing (NLP) is the technology that evaluates the relationships of words such as actions, entities, or events, comprised within unstructured text, meaning sentences within paragraphs found in a variety of text based documents. Question Answering Natural Language Processing Search is the Natural Language Processing technology that specifically solves the problem of finding answers to a question which can be asked by simply entering it into a search interface using natural human language, for example, “Who is Barack Obama?” Unlike keyword search in Google or Yahoo for example, Natural Language Processing Question Answering Search specifically allows users to ask questions in their natural language and then retrieves the most relevant answers within seconds. The standard search process requires the execution of multiple keyword combinations that then force the searcher to click on links only too frequently to find no answer and then they process of searching and liking continues until the user finds something or gives up. With Natural Language Processing Search there is no extra work and no need to search multiple links, resulting in immense time savings. Entering a question is simple for the user even though the technology behind the scenes is highly complex.
Fig.1-comparision of search
Natural Language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. In theory, naturallanguage processing is a very attractive method of human-computer interaction. Naturallanguage understanding is sometimes referred to as an AI-complete problem, because naturallanguage recognition seems to require extensive knowledge about the outside world and the ability to manipulate it.
NLP has significant overlap with the field of computational linguistics, and is often considered a sub-field of artificial intelligence. Modern NLP algorithms are grounded in machine learning, especially statistical machine learning. Research into modern statistical NLP algorithms requires an understanding of a number of disparate fields, including linguistics, computer science, statistics (particularly Bayesian statistics), linear algebra and optimization theory. Text – to-Speech Intro: Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer , and can be implemented in software or hardware. A textto-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1]
Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.[2] The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.
1.2 Objective of Project
The goal of the Question & Answer parsing (NLP) is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. This goal is not easy to reach. "Understanding" language means, among other things, knowing what concepts a word or phrase stands for and knowing how to link those concepts together in a meaningful way. 1.3 Scope of Project
In this project, we developed Q & A system, an open-domain question answering system which is based on parsing and read keylogg algorithm. It does not try to understand the semantics of a question or answer but it uses statistical methods that rely on data redundancy. In addition, some linguistic transformations of the question are performed. It can handle factoid questions and definition questions. For most Types of questions, it tries to return accurate answers. If this is not possible, it returns Short text passages, which are single sentences or sentence fragments that are assumed to contain the answer.
1.4 Methodology Used
The model for the system development life cycle that I have used is the Spiral Lifecycle Model. This model of development combines the features of the prototyping model and the waterfall model.
Fig 1. Spiral Model
Chapter 2 LITER A TUR E SUR V E Y
2. Literature Survey
Keywords
Who,staffs, PSG
Who, staffs, PSG
Query
Who are the Staffs of PSG?
Image Set 1
Search Engine
Image Set 2
Result Set (As Per Ranking)
Staffs, PSG
Conventional Dat abase [Images + Description]
: :
User
Fig.2 Text based image retrieval The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a realtime written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably - on the basis of the conversational content alone - between the program and a real human. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in
machine translation was conducted until the late 1980's, when the first statistical machine translation systems were developed. Some notably successful NLP systems developed in the 1960's were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?” During the 70's many programmers began to write 'conceptual ontologies', which structured realworld information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky. Up to the 1980's, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
Review of Literature Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.
Project Review:
First, the question (from the upper field) is parsed to get a Reed-Kellogg tree syntax graph. Then the graph is transformed into its direct answer form. For example Question Clause syntax node is replaced with a Clause syntax node. The resulting graph is used as a syntax-lexical pattern. Then the algorithm scans the text in the second field and tries to find utterances, most similar to the pattern. First, it compares a syntax node from the pattern with a syntax node from target text. If syntax nodes match, it compares the meanings of words on the nodes. To compare word meanings it simply compares the Lexemes. If both syntax and meanings match, algorithm goes down the syntax trees and builds the syntax fragment common for both utterances. The more syntax nodes have been matched, the higher is matching score. The best answers are shown as a result. If question has a question word, the tool assures that question word is always matched. The node in the answer graph, which matches the question word is the short answer (possibly with all underlying words in a syntax tree)
Review of Text To Speech A text-to-speech system (or "engine") is composed of two parts- a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end — often referred to as the synthesizer — then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the output speech.
Fig.3 TTS synthesis
Chapter 3 PR OJECT SHEDULE & TIMELINE CH A R RT
3.1 Milestones and Timeline
Number Milestone Name
Milestone Description
Timeline
Remarks
(in weeks)
1
Requirement
A requirement
3 weeks
Attempt should be
Specification
specification document
made to identify
should be delivered.
additional features which can be incorporated at a later point of time. Brainstorming comprising of all the members should be done
2
Technology
Understanding of
Familiarization
technology. Each person
5 weeks
should get themselves as expert in each of the technology and should arrange a half day session to share the info and come up with a document for reference 3
System Setup
Setup up a test
1 week
environment Design
A high level architecture diagram and detailed design of all the modules.
2 weeks
Also a data dictionary document should be delivered 4
Implementation
A working code for the
of 1st phase
1st module should be
6 weeks
developed. This should bring up the Server Program and admin functionality 5
Testing and
Testing and fixing the
rework for 1st
bugs
1.5 weeks
phase 6
7
Implementation
The working code for the
of 2nd phase
1st and 2nd phase
Testing and
Testing and fixing the
rework for 2nd
bugs
4 weeks
1.5 weeks
phase 8
9
Implementation
The working code for the
of 3rd phase
1st and 2nd phase
Testing and
Testing and fixing the
rework of the
bugs
5 weeks
3 weeks
entire application 10
Deployment of
Deploy the application
1 week
the application
Table 1. Milestones and Timeline
Graph 1. Timeline
Months and Weeks
AUGUST
SEPTEMBER
OCTOBER
NOVEMBER
1 2 3 4 5
1
1 2 3 4 5 1 2 3 4 5
Tasks
1. Identify need of the project 2.Research a. Existing software b. What improvement we can provide 3.Requirement analysis 4. Conduct Feasibility study 5.Desgning
2
3
4
5
Months and Weeks Tasks 6.Designing of GUI
JANUARY
1
2 3 4 5
FEBRUARY
1
2
3
4
7.Coding of GUI 8. Algorithm development
9.Algorithm implementati on
10. Designing and coding of Software Modules 11.Complete software testing 12.Analysis of different conditions 13.Final release of the software Graph.1 Timeline
MARCH
5
APRIL
1 2 3 4 5 1 2 3 4 5
Chapter 4 R EQUIR EMENT A N A L Y SIS
4.1 System Requirement
HARDWARE CONFIGURATION
The hardware used for the development of the project is:
PROCESSOR
:
PENTIUM III 866 MHz
RAM
:
128 MD SD RAM
MONITOR
:
15” COLOR
HARD DISK
:
20 GB
FLOPPY DRIVE
:
1.44 MB
CDDRIVE
:
LG 52X
KEYBOARD
:
STANDARD 102 KEYS
MOUSE
:
LOGITECH MOUSE
SOFTWARE CONFIGURATION
The software used for the development of the project is:
OPERATING SYSTEM
:
Windows XP Professional
ENVIRONMENT
:
Visual studio 2008
4.2 Feasibility Study Feasibility analysis is performed to choose the system that meets the performance requirements at least cost. The most essential tasks performed by a feasibility analysis are the identification and description of candidate systems and the selection of the best of the candidate systems. The best system means the system that meet performance requirements at the least cost. The most difficult part of a feasibility analysis is the identification of the candidate systems and evaluation of their performances and costs. The new system has no additional expenses to implement the system. It has advantages such as we can easily access file from any client in the network, accurate output for accurate input and this application is more user-friendly. We can use this application not only in this organization but also in other firms. So it is worth solving problem. Analysts should concentrate on providing the answers to four key questions:
How much? The cost of the new system
What? The objectives of the new system
When? The delivery timescale
How? The means and procedures used to produce the new system.
4.2.1 Economical Feasibility:
Economic analysis is the most frequently used method for evaluating the effectiveness of the candidate system. More commonly known as cost/benefit analysis, the procedure is to determine the benefits and savings that are expected from a candidate system and compare them with cost. This analysis phase determines how much cost is needed to produce the proposed system. This system is economically feasible since it does not require any initial set up cost as the organization has required machines in supporting programs for the application execute itself. It does not need additional staffing requirements.
4.2.2 Technical Feasibility:
Technical feasibility analysis is performed to check whether the proposed system is technically feasible or not. Technical feasibility centers on the existing computer system (Hardware, Software, etc.,) and to what extent it can support the proposed addition. This involves financial consideration to accommodate technical enhancement. This Project is technically feasible. The input can be done through mobile, which are both interactive and user-friendly. A normal user can also operate the system.
4.2.3 Operational Feasibility:
Operational feasibility analysis is performed to check whether the system is operationally feasible or not. Using command buttons throughout the program enhances operational feasibility. So the maintenance and modification is found to be easier. Will the system be used if it is developed and implemented? Will there be resistance from users that will undermine the possible application benefits? The feasibility study is carried out by a small group of people who are familiar with information systems techniques, understand the part of the business or organization that will be involved or affected by the project, and are skilled in the systems analysis and design process. This project to be implemented is feasible in all respects, since it uses existing resources; there is no need to spend more money to buy an entirely new system. It enhances the performance of the network, so it is operationally feasible.
4.3 Language Used C# (c sharp):
1. C# is a simple, modern, object oriented language derived from C++ and Java. 2. It aims to combine the high productivity of Visual Basic and the raw power of C++. 3. It is a part of Microsoft Visual Studio7.0. 4. Visual studio supports Vb, VC++, C++, Vbscript, Jscript. All of these languages provide access to the Microsoft .NET platform. 5. .NET includes a Common Execution engine and a rich class library. 6. Microsoft's JVM equiv. is Common language run time (CLR). 7. CLR accommodates more than one language such as C#, VB.NET, Jscript, ASP.NET, C++. 8. Source code --->Intermediate Language code (IL) ---> (JIT Compiler) Native code. 9. The classes and data types are common to all of the .NET languages. 10. We may develop Console application, Windows application, and Web application using C#. 11. In C# Microsoft has taken care of C++ problems such as Memory management, pointers etc. 12.It supports garbage collection, automatic memory management and a lot.
MAIN FEATURES OF C#
Simple
Modern
Object Oriented
Type Safe
Interoperability
Scalable & Updateable
SIMPLE:
1. Pointers are missing in C#. 2. Unsafe operations such as direct memory manipulation are not allowed. 3. In C# there is no usage of "::" or "->" operators. 4. Since it`s on .NET, it inherits the features of automatic memory management and garbage collection. 5. Varying ranges of the primitive types like Integer, Floats etc. 6. Integer values of 0 and 1 are no longer accepted as Boolean values. Boolean values are pure true or false values in C# so no more errors of "="operator and "=="operator. "==" is used for comparison operation and "=" is used for assignment operation.
MODERN
1.C# has been based according to the current trend and is very powerful and simple for building interoperable, scalable, robust applications. 2. C# includes built in support to turn any component into a web service that can be invoked over the Internet from any application running on any platform.
OBJECT ORIENTED
1. C# supports Data Encapsulation, inheritance, polymorphism, interfaces. 2. (int, float, double) are not objects in java but C# has introduces structures(structs) which enable the primitive types to become objects int i=1; string a=i.Tostring(); //conversion (or) Boxing
TYPE SAFE
1. In C# we cannot perform unsafe casts like convert double to a Boolean. 2. Value types (primitive types) are initialized to z eros and reference types (objects and classes are initialized to null by the compiler automatically. 3. Arrays are zero base indexed and are bound checked. 4. Overflow of types can be checked.
INTEROPERABILITY
1. C# includes native support for the COM and windows based applications. 2. Allowing restricted use of native pointers. 3. Users no longer have to explicitly implement the unknown and other COM interfaces, those features are built in. 4. C# allows the users to use pointers as unsafe code blocks to manipulate your old code. 5. Components from VB NET and other managed code languages and directly be used in C#.
SCALABLE AND UPDATEABLE
1. .NET has introduced assemblies, which are self-describing by means of their manifest. Manifest establishes the assembly identity, version, culture and digital signature etc. Assemblies need not to be register anywhere. 2. To scale our application we delete the old files and updating them with new ones. No registering of dynamic linking library. 3. Updating software components is an error prone task. Revisions made to the code can affect the existing program C# support versioning in the language. Native support for interfaces and method overriding enable complex frame works to be developed and evolved over time.
4.4 Dataflow Diagrams:
LEVEL 0:
Input Question
USER
0.0
& Text
Voice Mode
QUERY PROCESSING
Fig.4.4.1 DFD Level 0
DISPLAY OF QUALITY
LEVEL 1: USER
Input text
1.0 SYNTAX ANALYSER
& question
Syntactically Right Answer
2.0 PART OF SPEECH TAGGING
TOKEN
4.0
3.0 Parse Tree Generation
Keyword
COMPARER
Keyword Extraction
All Possible Answer
5.0 MATCHING SCORE
Fig, 4.4.2 DFD Level 1
Main Answer
DISPLAY OF ANSWER
LEVEL 2:
Fig.4.4.3 DFD Level 2
Chapter 5 DESIGN
5.1 SYSTEM ARCHITECTURE :
5.2 INTERFACE DESIGN:
Q & A PARSER USING NLP
INPUT SENTENCES
ANSWER
QUESITON
SENTENCE PARSE TREE
QUESTION PARSE TREE
5.3 CLASSES USED FOR PROCESSING AN APPLICATION: 1. Lexeme class description:
Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll public class Lexeme : IUserData, ICloneable Lexeme is a string of characters. There are three types Lexeme.LexType of Lexemes. Lexemes with syntax and semantic information contain Words.
There may be several syntactically different Words associated with the same Lexeme. For example the same lexeme "code" has two Words: noun and verb; it plays different syntax roles and carries different semantic in the following utterances: "We code the project. The code is complex." Which Word is used can be determined only during higher levels of processing. Words may belong to different languages. Currently NlpLib supports only en-US language. You can use lexical ambiguity in OCR or speech recognition when Lexeme is not clearly recognized. Instead of processing different lexemes, overload the same Lexeme with possible Words and let syntax parser to make a choice.
2. NLParser class description:
Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll public class NLParser : IUserData NLParser is a natural language parser that allows lexical and syntax parsing of English text. Note: converting plain text into Lexemes and Utterances. The easiest way is to use NLParser.Text() or NLParser.Text() eumerators: Syntax:
NLParser parser = new NLParser(); foreach(Lexeme lexeme in parser.Text(@"c:\test.txt", Encoding.UTF8)) { if ((Lexeme.LexType.word == lexeme.LexemeType) && !lexeme.HasWords) Console.WriteLine(lexeme.Text); }
Alternatively you can use Parse() method and subscribe to NLParser.OnLexeme or NLParser.OnUtterance event. When parsing is complete, call Flush() method otherwise some portion of text may remain in internal buffers. You can continue parsing after calling Flush(). Flush() may be used if you want to force an end of an utterance. Note: NLParser supports plain text or Lexemes as input
3. SyntaxNode class description:
Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll public class SyntaxNode : IUserData, ICloneable
SyntaxNode is used to build a Reed-Kellogg tree graph. The root of a tree graph represents a syntax diagram of an Utterance. Utterance allows mutual diagrams when syntax is ambiguous. SyntaxNode may have associated Words. SyntaxNode has methods allowing sorting and search operations. SyntaxNode allows syntax graph transformation. For example Utterance can be simplified by cutting less important parts. Another typical transformation is conversion between question and answer forms in natural language queries.
4. Utterance class description:
Namespace: Nlp4Net.NlpLib Assembly: NlpLib.dll public class Utterance : IUserData, ICloneable
Utterance represents the smallest piece of exchangeable semantic information. You can analyze syntax relations between Words to convert Utterance.Syntaxes into your semantic representation. Utterance is the result of applying Natural Language processor to a text.
5.4 FLOWCHART:
Answer in Voice Display Answer
Form
Example:
Display Answer in Voice Form
5.5 Uml Diagrams: 1.0 Use case Diagram:
1.1:
1.2:
2.0 Class Diagram:
3.0 Sequence Diagram:
4.0 Activity Diagram:
Chapter 6 IMPLEMENT A TION
6.1 ALGORITHM DEVELOPMENT: Text file:
The longest river in the world is Nile. The second longest river in China is Huanghe. The Yangtze River is the longest river in Asia. Strawberries contain no fat. There is nearly no fat in strawberries. Now step1: What is the second longest river in China? Step2: Once we have given input question, the program actually checks for a valid question using Reed-Kellogg syntax function it checks whether we have a proper question. To find above utterances, we will use the code below:
NLParser parser = new NLParser(); foreach(Utterance utterance in parser.Text(@"c:\test.txt", Encoding.UTF8)) { if ((null != utterance.Syntaxes) && (0 != utterance.Syntaxes.Length)) Console.WriteLine(utterance.Syntaxes[0].ToString()); } And this is the output of the code
For phrase this is the syntax diagram output if the phrase is “second longest river in China”
Step 3:
If no question found in syntax tree then below message is returned if (null == question) { Console.WriteLine("Cannot parse question: " + szQuestion); return; } If there is no utterances in sentence then it will display “ no answer found” Step4:
Syntax graphs are matched This is syntax graph for the question “what is the second longest river in China?”
And this is syntax graph for the input text i.e the second longest river in China is Huanghe
Step5: If syntax nodes match, then meanings of words associated with syntax nodes are compared here meanings of the words mean “Lexemes” and lexemes mean, if river is noun in questions, and if it is noun in input text then river lexeme is matched. Step6: If both syntax and meanings are equal, and if the utterance are considered to be equal, then matching score is incremented. The more the matches of lexemes, the more the score and the more score gets the output answer.
Once it has found matches we will have the output.
6.2 APPLICATION:
6.3 HIPO CHART:
Fig 6.3 HIPO chart
6.4 IPO
Fig.6.4 IPO
Chapter 7 CONCLUSION
7.1 CONCLUSION:
Q & A systems have been extended in recent years to explore critical new scientific and practical dimensions. For example, systems have been developed to automatically answer temporal and geospatial questions, definitional questions, biographical questions, multilingual questions, and questions from multimedia (e.g., audio, imagery, video). Additional aspects such as interactivity (often required for clarification of questions or answers), answer reuse, and knowledge representation and reasoning to support question answering have been explored. Future research may explore what kinds of questions can be asked and answered about social media, including sentiment analysis. Some problem in word segmentation, POS tagging is needed to be performed in a more generally way in order to apply this model for wider domain of application. In our model, these problems are solved by a technical solution. It is defining a dictionary for the system and assigning POS label for words. This is only suitable in case of a specific application with some clear information about structure of data and predictable searching scenarios.
7.2 ADVANTAGES:-
1. Quick response time 2. Customized processing 3. Small memory factor 4. No database needed
7.3 DISADVANTAGES:-
1. Cannot decode complex sentences. 2. Since it is first version it will have some bugs 3. Cannot take more than 15 sentences
7.4 FUTURE MODIFICATION:1)
Increasing sentence capacity
2)
Adding spell check features
3)
Voice based browser
4)
Text Summarization
5)
Language Recognizer & Translator
REFERENCES: IEEE paper: Natural Language Question Answering Model Applied To Document Retrieval System Nguyen Tuan Dang, and Do Thi Thanh Tuyen
[1] Enrique Alfonseca, Marco De Boni, José-Luis Jara-Valencia, Suresh Manandhar, “A prototype Question Answering system using syntactic and semantic information for answer retrieval”, Proceedings of the 10th Text Retrieval Conference, 2002. [2] Carlos Amaral, Dominique Laurent, “Implementation of a QA system in a real context”, Workshop TellMeMore, November 24, 2006. [3] Eric Brill, Susan Dumais, Michele Banko, “An Analysis of the AskMSR Question-Answering System”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002. [4] Boris Katz, Jimmy Lin, “Selectively Using Relations to Improve Precision in Question Answering”, Proceedings of the EACL 2003 Workshop on Natural Language, 2003. [5] Boris Katz, Beth Levin, “Exploiting Lexical Regularities in Designing Natual Language Systems”, Proceedings of the 12th International Conference on Computational Linguistics.
[6] Callison-Bruch, Chris, A computer model of a grammar for English questions, Undergraduate honors thesis, Stanford University, 2000. [7] Nguyen Kim Anh, “Translating the logical queries into SQL queries in natural language query systems”, Proceedings of the ICT.rda‟06 in Hanoi Capital, 2006. [8] Nguyen Tuan Dang, Do Thi Thanh Tuyen, “E-Library Searching by Natural Language Question-Answering System”, Proceedings of the Fifth International Conference on Information Technology in Education and Training (IT@EDU2008), Pages: 71-76, Ho Chi Minh and Vung Tau, Vietnam, December 15-16 , 2008.
[9] Nguyen Tuan Dang, Do Thi Thanh Tuyen “Document Retrieval Based on Question Answering System”, accepted paper, The Second International Conference on Information and Computing Science, Manchester, UK, May 21-22, 2009. [10] Riloff, Mann, Phillips, "Reverse-Engineering Question/Answer Collections from Ordinary Text", in Advances in Open Domain Question Answering, Springer Series: Text, Speech and Language Technology , Vol. 32, 2006.