You can apply Natural language parser to look
up a direct answer for a question. Similar technique can be applied in information retrieval and automatic document
clustering and summarization.
The sample code demonstrates how to convert a question into an answer
pattern and how to search for this pattern in a text.
Question-Answer online shows a web version.
The algorithm takes an input question and converts question form into an answer form. The transformation is relatively
straightforward,
although for some kind of questions you may need a specific transformation. For example for question with subject
"you"
(What are you doing?) the "you" word should be matched with "I", "we", or parallel construction (like
"Tom and myself").
In a real application you may need to add such logic.
Then algorithm reads utterances from a text file, matches them with the pattern
and calculates matching score.
Syntax graphs are matched first. If syntax nodes match, then meanings of words
associated with syntax nodes are compared. If both syntax and meanings are equal, the
syntax node form the pattern and from the
utterance are considered to be equal and the matching score is incremented.
When all text is parsed, the answers with the best score are suggested.
Algorithm makes sure that for questions with a question word the question word is always matched. For example for the question
"who moved my cheese" the answer must contain the word matching the question word
"who".
Syntax-lexical pattern vs. syntax-semantic pattern.
The code uses syntax-lexical pattern. When comparing word meanings it compares the strings
representing a lexeme. It is not quite accurate because word form may vary.
For example verb form depends on verb time in English.
If you want to use syntax-lexical pattern in a real application, check all word inflections as well as synonyms.
If you have a database with word meanings you can map words to the corresponding meanings.
Then you can search for syntax-semantic pattern and compare actual meanings
instead of lexemes. You don't have to build
meaning mapping for all words, but only for words in the application knowledge
domain and for few question words (how, what,
when, how many, etc).
Syntax-semantic pattern can match meaning of a question word, while syntax-lexical pattern lacks this ability.
Direct answer means no inference.
The application doesn't make any inference. A direct answer should appear in the
searched text. For example for the question "how much fat is in strawberries"
the "Strawberries contain no fat" is a correct answer but it is not a direct
answer. Matching this answer requires inference.
Direct answer is not necessary the same sequence of lexemes, because search algorithm relies on syntax
graphs. For example the code finds the correct answer "There is nearly no fat in
strawberries" to the question "how much fat is in strawberries".
Advantages of natural language query.
Comparing with keyword-based searching algorithms natural language query can
find an exact answer because it understands syntax information in a user
question.
Ability to give a short answer when possible is a unique feature, which makes it
more user-friendly and may save time. For example user can read only through a
list of short answers rather than their full versions.
Natural language query allows asking a question to a computer as if talking to a
human.
Syntax-lexical or syntax-semantic comparing algorithm is not question
specific. It can be applied to compare assertion clauses in the same way.
To compare texts you can use multiple
utterances in the search pattern.
C# code
You can copy-paste, compile and execute the code in Visual Studio.
Here is the content of text.txt file. File contains text with a direct answer:
The longest river in the world is Nile.
The second longest river in China is Huanghe.
The Yangtze River is the longest river in Asia.
Strawberries contain no fat.
There is nearly no fat in strawberries.
Here is the output from question.exe :
C:\question>question.exe "What is the second longest river in China?" text.txt
What is the second longest river in China?
Score 10. Huanghe. The second longest river in China is Huanghe.
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
// reference to NlpLib.dll is required
using Nlp4Net.NlpLib;
namespace question
{
internal class CQuestion
{
internal Utterance m_QuestionUtterance;
private bool m_blnHasQuestionWord;
private SyntaxNode[] m_Questions;
internal List<CAnswer> m_lstAnswers;
// prepares the question from Utterance
private CQuestion(Utterance question)
{
m_QuestionUtterance = question;
m_blnHasQuestionWord = false;
List<SyntaxNode> lstQuestions = new List<SyntaxNode>(question.Syntaxes);
for(int i = lstQuestions.Count -1; i > -1; i--)
{
SyntaxNode root = lstQuestions[i];
if (RKStruct.QuestionClause != root.NodeType)
{
// if node is not a question, remove it
lstQuestions.RemoveAt(i);
}
else
{
// check if at least one syntax graph contains a question word
if (null != root.Find(delegate(SyntaxNode child)
{
return (0 != (child.Flags & SyntaxNode.Flag.questionWord));
}))
{
m_blnHasQuestionWord = true;
break;
}
}
}
m_Questions = lstQuestions.ToArray();
// transform question into answer
foreach (SyntaxNode root in m_Questions)
root.NodeType = RKStruct.Clause;
// prepare the storage for answers
m_lstAnswers = new List<CAnswer>();
}
// class factory, checks if Utterances is a questionClause
// and creates a CQuestion
internal static CQuestion create(Utterance question)
{
CQuestion ret = null;
// utterance must have at least one question clause
if (null != question.Syntaxes)
{
foreach (SyntaxNode root in question.Syntaxes)
{
if (RKStruct.QuestionClause == root.NodeType)
{
ret = new CQuestion(question);
break;
}
}
}
return ret;
}
internal bool addAnswer(Utterance utterance)
{
bool ret = false;
int intCurrentScore = (0 == m_lstAnswers.Count) ? 0 : m_lstAnswers[0].m_intMatchScore;
// check if there are any better answers then in lstAnswers
if ((null != utterance.Syntaxes) && (0 != utterance.Syntaxes.Length))
{
foreach (SyntaxNode questionNode in m_Questions)
{
foreach (SyntaxNode answerNode in utterance.Syntaxes)
{
SyntaxNode shortAnswer = null;
int intScore = match(questionNode, answerNode, ref shortAnswer);
// Answer must contain answer word
if (m_blnHasQuestionWord && (null == shortAnswer))
continue;
// answer score must be GE to be added
bool blnAdd = intScore >= intCurrentScore;
// don't want to have same utterance twice
if (ret && (intScore == intCurrentScore))
blnAdd = false;
if (intScore > intCurrentScore)
{
m_lstAnswers.Clear();
intCurrentScore = intScore;
}
if (blnAdd)
{
m_lstAnswers.Add(new CAnswer(intScore, this.m_QuestionUtterance, questionNode
, utterance, answerNode, shortAnswer));
ret = true;
}
}
}
//List<Synt>
}
return ret;
}
private static int charsMatched(string szLeft, string szRight, out int intLen)
{
int ret = 0;
intLen = szLeft.Length;
if (szRight.Length < intLen)
intLen = szRight.Length;
for (int i = 0; i < intLen; i++)
{
if (char.ToLower(szLeft[i]) != char.ToLower(szRight[i]))
break;
ret = i;
}
return ret;
}
private static int sameMeaning(SyntaxNode left, SyntaxNode right)
{
int ret = 0;
if (0 == left.Words.Count)
{
if (0 == right.Words.Count)
ret = 1;
}
else
{
if (0 != right.Words.Count)
{
List<Word> lstRight = new List<Word>(right.Words);
foreach (Word leftWord in left.Words)
{
for (int i = 0; i < lstRight.Count; i++)
{
Word rightWord = lstRight[i];
if (leftWord.POS != rightWord.POS)
continue;
// compare meanings here
string szLeftMeaning = leftWord.Text;
string szRightMeaning = rightWord.Text;
bool blnMeaningsMatch = false;
if (PartOfSpeech.verb == leftWord.POS)
{
// remove verb form suffix -ed,-ing, -s,es,
int intLen = 0;
int intCharsMatched = charsMatched(szLeftMeaning, szRightMeaning, out intLen);
blnMeaningsMatch = intLen - intCharsMatched < 3;
}
else
{
blnMeaningsMatch = 0 == StringComparer.InvariantCultureIgnoreCase.Compare(szLeftMeaning, szRightMeaning);
}
if (blnMeaningsMatch)
{
lstRight.RemoveAt(i);
ret++;
break;
}
}
}
}
}
return ret;
}
// calcualtes if a word from question matches the word from syntax
private static int match(SyntaxNode questionNode, SyntaxNode answerNode, ref SyntaxNode shortAnswer)
{
int ret = 0;
if ((questionNode.NodeRole == answerNode.NodeRole) && (questionNode.NodeType == answerNode.NodeType))
{
// assign shortAnswer
// no processing for multy-questions like "who and when did it?"
if (0 != (questionNode.Flags & SyntaxNode.Flag.questionWord))
{
// question word must be satisfied
if (null == shortAnswer)
{
shortAnswer = answerNode;
ret += 1;
}
}
else
{
// other node, compare meanings
// in this sample meaning is equal if there are words
// in real application meanings must be compared.
int intMeaningMatch = sameMeaning(questionNode, answerNode);
if (0 != intMeaningMatch)
{
ret += intMeaningMatch;
}
}
}
// if core meanings match, try to match additional meanings if any
if ((0 != ret) && (0 != questionNode.Count) && (0 != answerNode.Count))
{
// make a copy
List<SyntaxNode> lstAnswerNodes = new List<SyntaxNode>(answerNode);
foreach (SyntaxNode questionChild in questionNode)
{
int intBestMatchScore = 0;
int intBestMatchIndex = -1;
SyntaxNode bestShortAnswer = null;
// remember if question word is already satisfied
bool blnAlreadyHasAnswer = (null != shortAnswer);
for (int i = 0; i < lstAnswerNodes.Count; i++)
{
SyntaxNode childShortAnswer = shortAnswer;
int intMatchScore = match(questionChild, lstAnswerNodes[i], ref childShortAnswer);
// remember the best match
if (intMatchScore > intBestMatchScore)
{
intBestMatchScore = intMatchScore;
intBestMatchIndex = i;
bestShortAnswer = childShortAnswer;
}
}
// if matching node is found, remove it from the list
if (-1 != intBestMatchIndex)
{
ret += intBestMatchScore;
lstAnswerNodes.RemoveAt(intBestMatchIndex);
if ((null == shortAnswer) && (null != bestShortAnswer))
shortAnswer = bestShortAnswer;
if (0 == lstAnswerNodes.Count)
break; // no more nodes to match
}
}
}
return ret;
}
}
internal class CAnswer
{
internal Utterance m_QuestionUtterance;
internal SyntaxNode m_FullQuestion;
internal Utterance m_AnswerUtterance;
internal SyntaxNode m_FullAnswer; // matches FullQuestion
internal SyntaxNode m_ShortAnswer; // matches question word if any
internal int m_intMatchScore; // match metric
internal CAnswer(int intMatchScore, Utterance questionUtterance, SyntaxNode fullQuestion
, Utterance answerUtterance, SyntaxNode fullAnswer, SyntaxNode shortAnswer
)
{
m_QuestionUtterance = questionUtterance;
m_FullQuestion = fullQuestion;
m_AnswerUtterance = answerUtterance;
m_FullAnswer = fullAnswer;
m_ShortAnswer = shortAnswer;
m_intMatchScore = intMatchScore;
}
}
class Program
{
static void Main(string[] args)
{
if (args.Length < 2)
{
Console.WriteLine("usage: question.exe \"your question\" <fileName> [encoding:utf-8(default),asci,utf-16]");
Console.WriteLine("examples:\r\n");
Console.WriteLine("question.exe \"What is the second longest river in China?\" c:\text.txt");
Console.WriteLine("question.exe \"How much fat is in strawberries?\" c:\text.txt ASCI");
return;
}
string szQuestion = args[0];
// szQuestion = "What is the second longest river in China?";
// szQuestion = "how much fat is in strawberries?";
string szFile = args[1];
if (!File.Exists(szFile))
{
Console.WriteLine("Cannot find file: " + szFile);
return;
}
// is encoding specified explicitly?
Encoding encoding = (args.Length > 2) ? Encoding.GetEncoding(args[2]) : Encoding.UTF8;
CQuestion question = null;
NLParser parser = new NLParser();
// find first question if any
foreach (Utterance utterance in parser.Text<Utterance>(szQuestion))
{
question = CQuestion.create(utterance);
if (null != question)
break;
}
if (null == question)
{
Console.WriteLine("Cannot parse question: " + szQuestion);
return;
}
long lngUtterances = 0;
// score words in a text
foreach (Utterance utterance in parser.Text<Utterance>(szFile, encoding))
{
lngUtterances++;
if ((null != utterance.Syntaxes) && (0 != utterance.Syntaxes.Length))
{
question.addAnswer(utterance);
}
}
Console.WriteLine(question.m_QuestionUtterance.ToString());
if (0 == question.m_lstAnswers.Count)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("Cannot find an answer.");
}
else
{
Console.ForegroundColor = ConsoleColor.Green;
foreach (CAnswer answer in question.m_lstAnswers)
{
Console.Write("Score " + answer.m_intMatchScore + ". ");
if (null != answer.m_ShortAnswer)
{
bool blnFirst = true;
foreach (Word w in answer.m_ShortAnswer.Words)
{
if (blnFirst)
{
blnFirst = false;
}
else
{
Console.Write(" ");
}
Console.Write(w.Text);
}
Console.Write(". ");
}
Console.WriteLine(answer.m_AnswerUtterance.ToString().Trim());
}
}
}
}
}