Text-proofing is one of the basic tasks in text processing. Searching for typos
rarely requires cognitive activity. More over such activity is somehow
undesirable because humans tend to correct mistakes unconsciously and just don't
catch them. Computers could be more efficient than humans in this area, at
least it greatly saves time to make a draft proofreading with a software
program. Draft proofreading is relatively simple and is done on lexical level.
Suggesting correct spelling, finding a typo, which is lexically a correct word
but logically is a mistake, or locating non-typical phrases ideally requires text
understanding and common knowledge. That's why complete spell-checking task may
be considered as part of artificial intelligence. Complete proofreading requires
processing at semantic (or even pragmatic) level.
You can use NLP for .NET to create a draft spell checker or integrate proofing functionality into your
application. The smallest sample requires 6 lines of code:
NLParser parser = new NLParser();
foreach(Lexeme lexeme in parser.Text<Lexeme>(@"c:\test.txt", Encoding.UTF8))
{
if ((Lexeme.LexType.word == lexeme.LexemeType) && !lexeme.HasWords)
Console.WriteLine(lexeme.Text);
}
spell-checking
tool demonstrates the online implementation.
similar code may be used to build a lexicon used by
author or in corporate document storage.
Below is a proofreading console application written in C#.
You can
copy-paste, compile and execute the code in Visual Studio. In the same way
you can process a bulk of documents or entire web-sites.
Suggestions
If you need suggestions for misspelled word, or you need your own dictionary, please use SpellChecker.
SpellChecker is specially designed for proofreading.
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
// reference to NlpLib.dll is required
using Nlp4Net.NlpLib;
namespace spellcheck
{
class Program
{
// Return value:
// 'true' if lexeme is found in custom dictionary,
// otherwise 'false'.
private static bool IsInCustomDictionary(string szLexeme)
{
// Look up a lexeme in your own dictionary here.
return false;
}
static void Main(string[] args)
{
if (0 == args.Length)
{
Console.WriteLine("usage: spellcheck.exe <fileName> [encoding] (default encoding is utf-8)");
Console.WriteLine("examples:\r\n");
Console.WriteLine("spellcheck.exe text.txt");
Console.WriteLine("spellcheck.exe text.txt ASCI");
return;
}
string szFile = args[0];
// assure file exists
if (!File.Exists(szFile))
{
Console.WriteLine("Cannot find file: " + szFile);
return;
}
Encoding encoding = Encoding.UTF8; // assume default encoding UTF8
// is encoding specified explicitly?
if (args.Length > 1)
encoding = Encoding.GetEncoding(args[1]);
NLParser parser = new NLParser();
// counters
long lngKnownWords = 0;
long lngMisspelledWords = 0;
// store misspelled words in dictionary
SortedList<string, int> lstMisspelledWords = new SortedList<string, int>(StringComparer.InvariantCultureIgnoreCase);
using (StreamReader reader = new StreamReader(szFile, encoding))
{
// enumerate through all Lexemes in the text stream.
foreach (Lexeme lexeme in parser.Text<Lexeme>(reader))
{
// For spell-check proofing only words are relevant.
// Skip spaces (LexType.space) and format lexemes (LexType.format)
if (Lexeme.LexType.word != lexeme.LexemeType)
continue;
// OK: If Lexeme has words, it is found in dictionary.
if (lexeme.HasWords)
{
lngKnownWords++;
continue; // OK: Lexeme found in built-in dictionary.
}
// Misspelled word. Last check: look up in user dictionary:
if (IsInCustomDictionary(lexeme.Text))
{
lngKnownWords++;
continue; // OK, lexeme found in custom dictionary.
}
lngMisspelledWords++;
if (!lstMisspelledWords.ContainsKey(lexeme.Text))
{
lstMisspelledWords[lexeme.Text] = 1;
}
else
{
lstMisspelledWords[lexeme.Text] = lstMisspelledWords[lexeme.Text] + 1;
}
}
}
// show results
long lngTotalWords = lngMisspelledWords + lngKnownWords;
if (0 == lngTotalWords)
{
Console.WriteLine("No words found in the file: " + szFile);
}
else
{
Console.WriteLine(string.Format("{0} lexemes in file: {1}", lngTotalWords, szFile));
if (0 == lngMisspelledWords)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("All words are correct.");
}
else {
Console.WriteLine(string.Format("Misspelled words: {0} ({1}%)"
, lstMisspelledWords.Keys.Count
, Math.Round((decimal) (lngMisspelledWords * 100 / lngTotalWords))
));
Console.WriteLine("There are misspelled words:\r\n");
Console.ForegroundColor = ConsoleColor.Red;
foreach (string szMisspelledWord in lstMisspelledWords.Keys)
{
Console.WriteLine(szMisspelledWord);
}
}
}
}
} // class
} // namespace