Generating A Spelling Correction Benchmark

The problem with current spelling correction datasets is that those resources are not available for free, do not have any context, or are completely artificial. This leads me to generating a, although artificial, new freely available dataset, containing the most typical spelling errors. By the identification of the most common misspelling types we are able to not only generate a purely artificial data set, but one that is also lead by how real misspellings are introduced in documents.

Common Spelling Errors

Before going into the details about generating an artificial error corpus we should first investigate the most common error types during spelling. As I am not the only one working on that small subset of NLP related topic some already did the numbers on “Which error type is most common” and “What is the most common (Levenshtein) edit distance of misspelled tokens”. The results can be seen in the subsequent tables, with the first one dealing with the most common misspelling types found in the ETS spelling corpus and the second one identifying the most common edit distances for misspellings. Without surprise, the most common error type is the one of a single character typo, leading to an edit distance of one. The most common error type is the generation of a non-word, which can be seen as obvious, simply due to the fact that, when pressing another key than the intended one, the likelihood of generating another real-word in significantly add proof lower than generating a non-word.

Classification of annotated misspellings in the ETS spelling corpus

Table was gathered by Flor for "Four types of context for automatic spelling correction". Source: [0]

Type Description Amount
1 Single token non-word misspelling (e.g. businees instead of business) 87.04%
2 Non-word misspelling for which no plausible correction was found 0.21%
3 Multi-token non-word misspelling (e.g. "mor efun" unstead of "more fun") 1.58%
4 Single token real-word misspelling (e.g. "they" instead of "then") 9.42%
5 Multi-token real-word misspelling (e.g. "with out" instead of "without") 1.75%
Distribution of errors by edit distance to correct form, in TOEFL-Spell

Table was gathered by Flor et al. for "A Benchmark Corpus on English Misspellings and a Minimally-supervised Model for Spelling Correction". Source: [1]

Edit Distance Count Percentage
1 5066 82.76
2 769 12.56
3 198 3.23
>= 4 88 1.45

More Natural Spelling Error Generation

With the information from above I generated an error model (hopefully) capable of generating close to natural misspellings. For this to happen a palette of different error patterns and data sources were generated.

Near-Neighbour Dictionary

Beforehand I parsed the word information from https://oed.com and generated the so-called near-neighbour dictionary. This dictionary contains all previously extracted words, lexically sorted, index-list of other real-words within the edit distances 1 to n, and a index-list of phonetically near-neighbour words using the Double Metaphone algorithm.

Error Models

Based on the previous information multiple different error generation models were created. I picked out the ones that correlate most with the previously mentioned errors, althrough there are more models.

  • Insertion Errors Inserts a random character at any position within a given token.
  • Deletion Errors Deletes a random character at any position within a given token.
  • Swap Errors Swaps to characters within a given word. The capitalization will be kept correct.
  • Real-Word Error Another real-word from the near-neighbour dictionary is picked for replacement.
  • Gaussian Keyboard This represents a special kind of error generator that can not only (partially) generate errors as the models before but, in addition to that, simulates the natural type-flow where one does not hit the intended key but a surrounding one.
    img

Based on the information given from the previously mentioned publications we get that approximately 2.22% of the total words contain a misspelling of any kind (Flor et al. 2.36%, Flor 2.07%), which will be taken into consideration when noising a given text.

One may note that each of the error generation models can generate real-word errors by accident, so each generated error has to be cross checked agains the previously generated language dictionary.

Datasets

Now that all information about errors and their frequency are generated we will need a data source for text. Due to the easy accessibility I decided to use the English Wikipedia as raw text, where small pages are ignored. Ignored pages include those one which are redirecting to another page or just represent a collection of links to other pages. The remaining information was structured that each article remains in its own text file and for each of those articles a parallel json representation was generated containing the erroneous tokens and their correction.

Benchmarking

Now that we have the erroneous and the correct representation of a text those can be used for evaluation the correctness of any tool. For this each test set of the benchmark is provided as json containing information about a randomly selected amount of articles. Each tool has to provide the results as json, through which it can add additional information, like the detected error type – if supported – and additional correction suggestions.

As an example on can consider the following example, where S denotes the source (the noisy input sentence), P is the prediction, and G is the groundtruth.

  • First of all, P is tokenised. Through the previous generation step we already habe the tokenisation information of S and G
    img
  • Next the longest common subsequences between the pairs S<->P and G<->P are identified. One can see the regions A and B which cannot be aligned in any way. The tokens connectied by the green arrows can be automatically aligned by the alignment knowledge between S and G.
    img
  • For the last step I used a simple heuristic that can link the entities by using the Levenshtein edit distance, Jaro-Winkler similarity, and – dependent on the token position within the block – subsequent comparisons of suffixes and prefixes. Within this example one can also see the case that a token was still not assigned to any other of S or G. Those tokens are then matched to the tokens-connections to S and G of their neighbouring ones.
    img

Source

The source for the generator and the benchmark itself is available at GitHub.

Reference

[0] Flor, M. (2012). Four types of context for automatic spelling correction. TAL, 53(3), 61-99.

[1] Flor, M., Fried, M., & Rozovskaya, A. (2019, August). A benchmark corpus of English misspellings and a minimally-supervised model for spelling correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 76-86).

Markus Näther
Markus Näther
Machine Learning Engineer

My research interests include distributed robotics, mobile computing and programmable matter.