The Basic Understanding of Algorithms Used in Spell Checkers
The first-ever spell check was called WordCheck and released in 1981 on Commodore computer systems. Forty years later, Grammarly is taking chunks out of the competition with its range of features, but even this tool is looking over its shoulder now as AI-generated text is looking to take humans out of the equation entirely.
Any essay writer will tell you that a spell checker is an invaluable tool for their craft. Machines are great at spotting errors based on simple spelling mistakes. As computing power has grown, it has become possible for spell checkers to evolve into grammar and even tone checkers.
Algorithms are sets of instructions carried out by computers. So if you’re using one, you’re saving time and effort, but a great deal of effort was exerted in order to produce a fine set of algorithms that can deal with the vagrancies of the human language. Here’s a basic view of what algorithms are up to when checking your texts.
A level of sophistication
A legitimate spell-checker needs to have some sophistication; a poem by Jerrold H. Zar written in 1992 illustrates this rather nicely.
Eye have a spelling chequer,
It came with my Pea Sea.
It plane lee marks four my revue
Miss Steaks I can knot sea.
A spell-checker using only an algorithm that examines each individual word will not find a problem with that stanza. Indeed, when we read it, we get the joke immediately. Sophisticated spell-checkers need to include a language model that can check each word not only for letter-order but for the context in which a word appears.
Today, spell-checking programs are available for a wide range of languages – almost all of them in fact – and each language brings its own set of quirks that require algorithmic tweaks. Highly synthetic languages – like German – need to include heavily morphological analysis, that is the way words are formed.
In German, words are built up from root words by adding additional letters in a process called inflexion or agglutination. All languages use this process to an extent, but some much more than others. That’s how you can end up with very long words like ‘rechtsschutzversicherungsgesellschaften’. Language variations like American or British English are much easier to deal with than these kinds of problems.
Native spell-checkers are at an advantage when the language in question is complex, as people developing the program are more aware of its deployment. Some spell-checkers use a solely data-driven approach, which can save time but can increase the number of errors before the final version is rolled out and ready to use.
Some key algorithms
Certain algorithms are so successful they have names, rather than being abstractly simple and serve ‘general purpose’. For instance, Levenshtein distance is a form of string matching that measures the differences of two or more strings. For those not in the know, a string is the computer science term for a set of characters.
BK-Trees are a more recent development; it is a data structure based on Levenshtein. It calculates the number of changes a word has to go through in order for it to turn into another word. You may recognize this as a popular puzzle in the games section of newspapers where you have to gradually turn one word into another. Yes, just like an algorithm!
Markov Chains is another popular algorithm used in spell-checking software. A Markov Chain uses corpus analysis to predict which character should come next in the sequence. This algorithm is used not only in spell-checkers but in stock market analysis and other time-series based problems, such as football scores.
Checking your tone
In essence, spell-checkers seem simple. They break down your text into words, and those words are compared against a dictionary. However, it is the ability to suggest accurate changes that set apart a good spell-checker from merely a ‘verifier’.
The development and advancement of spell-checkers also open up the question of where the boundary of ‘correct’ spelling lies. With spell-checkers that grade your tone and voicings, the software is assessing whether or not you are writing in a way that fits the purpose of the paper.
All spell-checkers need to use a corpus to be highly successful. We’re adding to this corpus all the time, and are able to create tools that blur the line between checking the formulation of words and actually writing content.
N-Grams, also known as Markov Chains, are essential in the fields of computational linguistics; they deal with probability in sequences and are very powerful tools. The main algorithm in question however is still Levenshtein Distance. If you can get to grips with the necessary approaches behind this algorithm, you will be quickly on the way to building your own spelling, and even grammar, checker.