Natural Language Question-Answer.

Article explains how to search a direct answer for a natural language question.

You can apply Natural language parser to look up a direct answer for a question. Similar technique can be applied in information retrieval and automatic document clustering and summarization.

The sample code demonstrates how to convert a question into an answer pattern and how to search for this pattern in a text. Question-Answer online shows a web version.

The algorithm takes an input question and converts question form into an answer form. The transformation is relatively straightforward, although for some kind of questions you may need a specific transformation. For example for question with subject "you" (What are you doing?) the "you" word should be matched with "I", "we", or parallel construction (like "Tom and myself"). In a real application you may need to add such logic.

Then algorithm reads utterances from a text file, matches them with the pattern and calculates matching score. Syntax graphs are matched first. If syntax nodes match, then meanings of words associated with syntax nodes are compared. If both syntax and meanings are equal, the syntax node form the pattern and from the utterance are considered to be equal and the matching score is incremented. When all text is parsed, the answers with the best score are suggested.

Algorithm makes sure that for questions with a question word the question word is always matched. For example for the question "who moved my cheese" the answer must contain the word matching the question word "who".

Syntax-lexical pattern vs. syntax-semantic pattern.

The code uses syntax-lexical pattern. When comparing word meanings it compares the strings representing a lexeme. It is not quite accurate because word form may vary. For example verb form depends on verb time in English. If you want to use syntax-lexical pattern in a real application, check all word inflections as well as synonyms.

If you have a database with word meanings you can map words to the corresponding meanings. Then you can search for syntax-semantic pattern and compare actual meanings instead of lexemes. You don't have to build meaning mapping for all words, but only for words in the application knowledge domain and for few question words (how, what, when, how many, etc).

Syntax-semantic pattern can match meaning of a question word, while syntax-lexical pattern lacks this ability.

Direct answer means no inference.

The application doesn't make any inference. A direct answer should appear in the searched text. For example for the question "how much fat is in strawberries" the "Strawberries contain no fat" is a correct answer but it is not a direct answer. Matching this answer requires inference.

Direct answer is not necessary the same sequence of lexemes, because search algorithm relies on syntax graphs. For example the code finds the correct answer "There is nearly no fat in strawberries" to the question "how much fat is in strawberries".

Advantages of natural language query.

Comparing with keyword-based searching algorithms natural language query can find an exact answer because it understands syntax information in a user question.

Ability to give a short answer when possible is a unique feature, which makes it more user-friendly and may save time. For example user can read only through a list of short answers rather than their full versions.

Natural language query allows asking a question to a computer as if talking to a human.

Syntax-lexical or syntax-semantic comparing algorithm is not question specific. It can be applied to compare assertion clauses in the same way.

To compare texts you can use multiple utterances in the search pattern.

C# code

You can copy-paste, compile and execute the code in Visual Studio.

Here is the content of text.txt file. File contains text with a direct answer:

The longest river in the world is Nile.
The second longest river in China is Huanghe.
The Yangtze River is the longest river in Asia.
Strawberries contain no fat.
There is nearly no fat in strawberries.

Here is the output from question.exe :

C:\question>question.exe "What is the second longest river in China?" text.txt
What is the second longest river in China?
Score 10. Huanghe. The second longest river in China is Huanghe.
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;

// reference to NlpLib.dll is required
using Nlp4Net.NlpLib;

namespace question
{
    internal class CQuestion
    {
        internal Utterance m_QuestionUtterance;
        private bool m_blnHasQuestionWord;
        private SyntaxNode[] m_Questions;
        internal List<CAnswer> m_lstAnswers;

        // prepares the question from Utterance 
        private CQuestion(Utterance question)
        {
            m_QuestionUtterance = question;
            m_blnHasQuestionWord = false;
            List<SyntaxNode> lstQuestions = new List<SyntaxNode>(question.Syntaxes);
            for(int i = lstQuestions.Count -1; i > -1; i--)
            {
                SyntaxNode root = lstQuestions[i];
                if (RKStruct.QuestionClause != root.NodeType)
                {
                    // if node is not a question, remove it
                    lstQuestions.RemoveAt(i);
                }
                else
                {
                    // check if at least one syntax graph contains a question word
                    if (null != root.Find(delegate(SyntaxNode child)
                    {
                        return (0 != (child.Flags & SyntaxNode.Flag.questionWord));
                    }))
                    {
                        m_blnHasQuestionWord = true;
                        break;
                    }
                }
            }
            m_Questions = lstQuestions.ToArray();

            // transform question into answer
            foreach (SyntaxNode root in m_Questions)
                root.NodeType = RKStruct.Clause;

            // prepare the storage for answers
            m_lstAnswers = new List<CAnswer>();
        }

        // class factory, checks if Utterances is a questionClause
        // and creates a CQuestion
        internal static CQuestion create(Utterance question)
        {
            CQuestion ret = null;
            // utterance must have at least one question clause
            if (null != question.Syntaxes)
            {
                foreach (SyntaxNode root in question.Syntaxes)
                {
                    if (RKStruct.QuestionClause == root.NodeType)
                    {
                        ret = new CQuestion(question);
                        break;
                    }
                }
            }

            return ret;
        }

        internal bool addAnswer(Utterance utterance)
        {
            bool ret = false;
            int intCurrentScore = (0 == m_lstAnswers.Count) ? 0 : m_lstAnswers[0].m_intMatchScore;
            // check if there are any better answers then in lstAnswers
            if ((null != utterance.Syntaxes) && (0 != utterance.Syntaxes.Length))
            {
                foreach (SyntaxNode questionNode in m_Questions)
                {
                    foreach (SyntaxNode answerNode in utterance.Syntaxes)
                    {
                        SyntaxNode shortAnswer = null;
                        int intScore = match(questionNode, answerNode, ref shortAnswer);

                        // Answer must contain answer word
                        if (m_blnHasQuestionWord && (null == shortAnswer))
                            continue;

                        // answer score must be GE to be added
                        bool blnAdd = intScore >= intCurrentScore;

                        // don't want to have same utterance twice
                        if (ret && (intScore == intCurrentScore))
                            blnAdd = false;

                        if (intScore > intCurrentScore)
                        {
                            m_lstAnswers.Clear();
                            intCurrentScore = intScore;
                        }

                        if (blnAdd)
                        {
                            m_lstAnswers.Add(new CAnswer(intScore, this.m_QuestionUtterance, questionNode
                                , utterance, answerNode, shortAnswer));
                            ret = true;
                        }
                    }
                }
                //List<Synt>
            }
            return ret;
        }

        private static int charsMatched(string szLeft, string szRight, out int intLen)
        {
            int ret = 0;
            intLen = szLeft.Length;
            if (szRight.Length < intLen)
                intLen = szRight.Length;

            for (int i = 0; i < intLen; i++)
            {
                if (char.ToLower(szLeft[i]) != char.ToLower(szRight[i]))
                    break;
                ret = i;
            }
            return ret;
        }

        private static int sameMeaning(SyntaxNode left, SyntaxNode right)
        {
            int ret = 0;
            if (0 == left.Words.Count)
            {
                if (0 == right.Words.Count)
                    ret = 1;
            }
            else
            {
                
                if (0 != right.Words.Count)
                {
                    List<Word> lstRight = new List<Word>(right.Words);
                    foreach (Word leftWord in left.Words)
                    {
                        for (int i = 0; i < lstRight.Count; i++)
                        {
                            Word rightWord = lstRight[i];
                            
                            if (leftWord.POS != rightWord.POS)
                                continue;

                            // compare meanings here
                            string szLeftMeaning = leftWord.Text;
                            string szRightMeaning = rightWord.Text;

                            bool blnMeaningsMatch = false;

                            if (PartOfSpeech.verb == leftWord.POS)
                            {
                                // remove verb form suffix -ed,-ing, -s,es,
                                int intLen = 0;
                                int intCharsMatched = charsMatched(szLeftMeaning, szRightMeaning, out intLen);
                                blnMeaningsMatch = intLen - intCharsMatched < 3;
                            }
                            else
                            {
                                blnMeaningsMatch = 0 == StringComparer.InvariantCultureIgnoreCase.Compare(szLeftMeaning, szRightMeaning);
                            }
                            
                            if (blnMeaningsMatch)
                            {
                                lstRight.RemoveAt(i);
                                ret++;
                                break;
                            }
                        }
                    }
                }
            }
            return ret;
        }

        // calcualtes if a word from question matches the word from syntax
        private static int match(SyntaxNode questionNode, SyntaxNode answerNode, ref SyntaxNode shortAnswer)
        { 
            int ret = 0;
            if ((questionNode.NodeRole == answerNode.NodeRole) && (questionNode.NodeType == answerNode.NodeType))
            {
                // assign shortAnswer
                // no processing for multy-questions like "who and when did it?"
                if (0 != (questionNode.Flags & SyntaxNode.Flag.questionWord))
                {
                    // question word must be satisfied
                    if (null == shortAnswer)
                    {
                        shortAnswer = answerNode;
                        ret += 1;
                    }
                }
                else
                { 
                    // other node, compare meanings
                    // in this sample meaning is equal if there are words 
                    // in real application meanings must be compared.
                    int intMeaningMatch = sameMeaning(questionNode, answerNode);
                    if (0 != intMeaningMatch)
                    {
                        ret += intMeaningMatch;
                    } 
                }                
            }

            // if core meanings match, try to match additional meanings if any
            if ((0 != ret) && (0 != questionNode.Count) && (0 != answerNode.Count))
            {
                // make a copy
                List<SyntaxNode> lstAnswerNodes = new List<SyntaxNode>(answerNode);
                foreach (SyntaxNode questionChild in questionNode)
                {

                    int intBestMatchScore = 0;
                    int intBestMatchIndex = -1;
                    SyntaxNode bestShortAnswer = null;
                    // remember if question word is already satisfied
                    bool blnAlreadyHasAnswer = (null != shortAnswer);
                    for (int i = 0; i < lstAnswerNodes.Count; i++)
                    {
                        SyntaxNode childShortAnswer = shortAnswer;
                        int intMatchScore = match(questionChild, lstAnswerNodes[i], ref childShortAnswer);
                        
                        // remember the best match
                        if (intMatchScore > intBestMatchScore)
                        {
                            intBestMatchScore = intMatchScore;
                            intBestMatchIndex = i;
                            bestShortAnswer = childShortAnswer;
                        }
                    }

                    // if matching node is found, remove it from the list
                    if (-1 != intBestMatchIndex)
                    {
                        ret += intBestMatchScore;
                        lstAnswerNodes.RemoveAt(intBestMatchIndex);

                        if ((null == shortAnswer) && (null != bestShortAnswer))
                            shortAnswer = bestShortAnswer;

                        if (0 == lstAnswerNodes.Count)
                            break; // no more nodes to match
                    }

                }
            }
            return ret;
        }
    }

    internal class CAnswer
    {
        internal Utterance  m_QuestionUtterance;
        internal SyntaxNode m_FullQuestion;

        internal Utterance m_AnswerUtterance;
        internal SyntaxNode m_FullAnswer; // matches FullQuestion
        internal SyntaxNode m_ShortAnswer; // matches question word if any

        internal int m_intMatchScore; // match metric

        internal CAnswer(int intMatchScore, Utterance questionUtterance, SyntaxNode fullQuestion
            , Utterance answerUtterance, SyntaxNode fullAnswer, SyntaxNode shortAnswer
            )
        {
            m_QuestionUtterance = questionUtterance;
            m_FullQuestion = fullQuestion;
            m_AnswerUtterance = answerUtterance;
            m_FullAnswer = fullAnswer;
            m_ShortAnswer = shortAnswer;
            m_intMatchScore = intMatchScore;
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length < 2)
            {
                Console.WriteLine("usage: question.exe \"your question\" <fileName>  [encoding:utf-8(default),asci,utf-16]");
                Console.WriteLine("examples:\r\n");
                Console.WriteLine("question.exe \"What is the second longest river in China?\" c:\text.txt");
                Console.WriteLine("question.exe \"How much fat is in strawberries?\" c:\text.txt ASCI");
                return;
            }

            string szQuestion = args[0];
            // szQuestion = "What is the second longest river in China?";
            // szQuestion = "how much fat is in strawberries?";
            string szFile = args[1];

            if (!File.Exists(szFile))
            {
                Console.WriteLine("Cannot find file: " + szFile);
                return;
            }

            // is encoding specified explicitly?
            Encoding encoding = (args.Length > 2) ? Encoding.GetEncoding(args[2]) : Encoding.UTF8;

            CQuestion question = null;
            NLParser parser = new NLParser();

            // find first question if any
            foreach (Utterance utterance in parser.Text<Utterance>(szQuestion))
            {
                question = CQuestion.create(utterance);
                if (null != question)
                    break;
            }

            if (null == question)
            {
                Console.WriteLine("Cannot parse question: " + szQuestion);
                return;
            }

            long lngUtterances = 0;
            // score words in a text
            foreach (Utterance utterance in parser.Text<Utterance>(szFile, encoding))
            {
                lngUtterances++;
                if ((null != utterance.Syntaxes) && (0 != utterance.Syntaxes.Length))
                {
                    question.addAnswer(utterance);
                }
            }

            Console.WriteLine(question.m_QuestionUtterance.ToString());
            if (0 == question.m_lstAnswers.Count)
            {
                Console.ForegroundColor = ConsoleColor.Red;
                Console.WriteLine("Cannot find an answer.");
            }
            else
            {
                Console.ForegroundColor = ConsoleColor.Green;
                foreach (CAnswer answer in question.m_lstAnswers)
                {
                    Console.Write("Score " + answer.m_intMatchScore + ". ");
                    if (null != answer.m_ShortAnswer)
                    {
                        bool blnFirst = true;
                        foreach (Word w in answer.m_ShortAnswer.Words)
                        {
                            if (blnFirst)
                            {
                                blnFirst = false;
                            }
                            else
                            {
                                Console.Write(" ");
                            }
                            Console.Write(w.Text);
                        }
                        Console.Write(". ");
                    }
                    Console.WriteLine(answer.m_AnswerUtterance.ToString().Trim());
                }
            }
        }
    }
}