NLP for .NET overview

The paper briefly describes NLP for .NET library

NLP for .NET is natural language processing software for .NET developers.
Natural language parser allows lexical and syntax parsing of English text.
Library includes also non language specific components like SpellChecker and WordDictionary.

NLP for .NET goal.

Main goal is to let computer understand language in the way human beings do. Humans are quite flexible creatures and can adjust their language down to the perforated cards, but it doesn't make life easier. It would be much better if a computer were able to understand natural language. The major goal of NLP for .NET. is to help to adjust the computer in a favor of human world. As Microsoft Natural Language Processing group says, "This goal is not easy to reach." But it is interesting and challenging task.

Some tasks where you can apply NLP for .NET

Natural Language document processing

  1. automatic text summarization and document simplification
  2. breaking text into syntax fragments like: subject -- verb, subject -- verb -- object
  3. convert plain text into syntax graphs
  4. build a list of key words for the text
  5. spell-checking, based on your own dictionaries
  6. autocomplete for your lists
  7. language detection

Natural Language Search

  1. natural language query: look up a direct answer for the question
  2. search a phrase by "syntax + semantic" pattern
  3. search a phrase by "syntax + key words" template
  4. find documents with same theme

Human -- Software interaction

  1. consume input in natural language

How it works.

Communication process between software and human is asymmetric for the computer program. While producing human readable output is relatively simple task, consuming natural language is in orders more difficult. A lot of things have to happen before string of characters can make sense for a program. It is a task of immense complexity for computer.

NLP for .NET comes with an English syntax parser based on Reed-Kellogg grammar. It wraps the process into NLParser class. Natural Language Parser takes a text stream on input and produces Utterances on output. Internally it is implemented as a pipeline of 2 parsers.

nlp4net architecture

Lexical parsing

Lexical parser is the first one in the pipeline. It takes a string of characters as input and produces Lexemes on output.
The parser has a built-in English dictionary, and can be used out of the box.

It supports compound words, hyphenation, carrions and more.

Application can feed own words when it needs a specific dictionary or when word is unknown for the parser.

The parser allows incorporation of non-lexical information in the processed stream like text formatting, DTMF input or arbitrary user data.

One of the features of lexical parser is word ambiguity. It is useful for integration with OCR or speech recognition when exact word cannot be recognized.

Lexical parser produces Lexemes. Lexeme is unbreakable string of characters, it may be a string of whitespaces or may have associated Words
Word has syntactical information like Part Of Speech and additional syntax tags. This information is used by the second parser in the NLParser pipeline -- the Syntax parser.

Syntax parsing

Syntax parser takes a sequence of Lexemes on input and combines Words into a syntax graph.
The graph is a tree of Syntax Nodes.

The links in the tree correspond to the syntax roles in a standard Reed-Kellogg diagram like: subject--verb, verb--complement, subject--adjective, verb--adverb and so on. Sometimes links imply a certain sub-tree like Clause, Gerund or Participle. Such RK structures typically appear on pedestal or tower in classical Reed-Kellogg diagram.

Words are leafs in a syntax tree.

The graph is called Reed-Kellogg tree graph, because it is essentially a classical Reed-Kellogg diagram but with enforced tree structure.
Classical Reed-Kellogg diagram is perfect for human understanding but it's less attractive than tree-based graphs for computer processing. Reed-Kellogg tree representation gives computer program all the advantages of classical Reed-Kellogg diagram combined with simplicity of tree graphs.

Reed-Kellogg tree grammar belongs to Type-0 in Chomsky classification and has scalability of human syntax, not reachable in context-free grammars.

Sequence of Words linked into a syntax graph gives an Utterance.

Usually same words can be joined syntactically in many ways, which results in different meanings. That's why Utterance may be associated with multiple Reed-Kellogg trees. Syntactic ambiguity is an essential feature of NLP for .NET, because a syntax parser producing single syntax graph would be ultimately incorrect. Syntactic ambiguity allows following semantic and pragmatic layers to make a decision about the meaningful interpretation.

Utterances are the output of NLParser and can be used for further semantic analysis.

Reed-Kellogg syntax

NLP for .NET is built on Reed-Kellogg grammar but can be easily adapted for other kind of grammars. Syntax graph is an important but still intermediary step in natural language processing. It is as good as it helps subsequent semantic layer to acquire information.

Reed-Kellogg tree graph is easy to convert into another syntax as long as it is a tree based graph.

Variations of Chomsky phrase marker are popular methods to describe syntax.
Another excellent but may be less known grammar is Tesnière dependency grammar.
There are also other types of grammars.

More information: Why Reed-Kellogg diagrams?

Other components

SpellChecker helps to add auto-complete and spell-checking features to your application.

You have full control over the dictionary. It can be list of your products, names in HR database, e-mails, urls or a language dictionary.

Speller has high suggestion quality. In most cases it suggests a single word, which allows using it in automatic text processing, when there is no human to do correction.

SpellChecker is a high performance speller designed for multithreaded applications.

Force suggestion is a unique feature of SpellChecker. It allows you to get nearest suggestions even if word is correct. For example if your user has found a product he or she may be interested in products with similar spelling.

You have full control over word normalization. You can have case-sensitive, case-insensitive speller or implement your own normalization algorithm.