Unix pipelines for basic spelling error detection

Jason BaldridgeFebruary 4th, 2013Last Updated: February 3rd, 2013

0 145 6 minutes read

Introduction

We can of course write programs to do most anything we want, but often the Unix command line has everything we need to perform a series of useful operations without writing a line of code. In my Applied NLP class today, I show how one can get a high-confidence dictionary out of a body of raw text with a series of Unix pipes, and I’m posting that here so students can refer back to it later and see some pointers to other useful Unix resources.

Note: for help with any of the commands, just type “man <command>” at the Unix prompt.

Checking for spelling errors

We are working on automated spelling correction as an in-class exercise, with a particular emphasis on the following sentence: This Facebook app shows that she is there favorite acress in tonw So, this has a contextual spelling error (there), an error that could be a valid English word but isn’t (acress) and an error that violates English sound patterns (tonw).

One of the key ingredients for spelling correction is a dictionary of words known to be valid in the language. Let’s assume we are working with English here. On most Unix systems, you can pick up an English dictionary in /usr/share/dict/words, though the words you find may vary from one platform to another. If you can’t find anything there, there are many word lists available online, e.g. check out the Wordlist project for downloads and links. We can easily use the dictionary and Unix to check for words in the above sentence that don’t occur in the dictionary. First, save the sentence to a file.

$ echo 'This Facebook app shows that she is there favorite acress in tonw' > sentence.txt

Next, we need to get the unique word types (rather than tokens) is sorted lexicographic order. The following Unix pipeline accomplishes this.

$ cat sentence.txt | tr ' ' '\n' | sort | uniq > words.txt

To break it down:

The cat command spills the file to standard output.
The tr command “translates” all spaces to new lines. So, this gives us one word per line.
The sort command sorts the lines lexicographically.
The uniq command makes those lines uniq by making adjacent duplicates disappear. (This doesn’t do anything for this particular sentence, but I’m putting it in there in case you try other sentences that have multiple tokens of the type “the”, for example.)

You can see these effects by doing each in turn, building up the pipeline incrementally.

$ cat sentence.txt
This Facebook app shows that she is there favorite acress in tonw
$ cat sentence.txt | tr ' ' '\n'
This
Facebook
app
shows
that
she
is
there
favorite
acress
in
tonw
$ cat sentence.txt | tr ' ' '\n' | sort
Facebook
This
acress
app
favorite
in
is
she
shows
that
there
tonw

We can now use the comm command to compare the file words.txt and the dictionary. It produces three columns of output: the first gives the lines only in the first file, the second are lines only in the second file, and the third are those in common. So, the first column has what we need, because those are words in our sentence that are not found in the dictionary. Here’s the command to get that.

$ comm -23 words.txt /usr/share/dict/words
Facebook
This
acress
app
shows
tonw

The -23 options indicate we should suppress columns 2 and 3 and show only column 1. If we just use -2, we get the words in the sentence with the non-dictionary words on the left and the dictionary words on the right (try it).

The problem of course is that any word list will have gaps. This dictionary doesn’t have more recent terms like Facebook and app. It also doesn’t have upper-case This. You can ignore case with comm using the -i option and this goes away. It doesn’t have shows, which is not in the dictionary since it is an inflected form of the verb stem show. We could fix this with some morphological analysis, but instead of that, let’s go the lazy route and just grab a larger list of words.

Extracting a high-confidence dictionary from a corpus

Raw text often contains spelling errors, but errors don’t tend to happen with very high frequency, so we can often get pretty good expanded word lists by computing frequencies of word types on lots of text and then applying reasonable cutoffs. (There are much more refined methods, but this will suffice for current purposes.)

First, let’s get some data. The Open American National Corpus has just released v3.0.0 of its Manually Annotated Sub-Corpus (MASC), which you can get from this link.

– http://www.anc.org/masc/MASC-3.0.0.tgz

Do the following to get it and set things up for further processing:

$ mkdir masc
$ cd masc
$ wget http://www.anc.org/masc/MASC-3.0.0.tgz
$ tar xzf MASC-3.0.0.tgz

(If you don’t have wget, you can just download the MASC file in your browser and then move it over.)

Next, we want all the text from the data/written directory. The find command is very handy for this.

$ find data/written -name '*.txt' -exec cat {} \; > all-written.txt

To see how much is there, use the wc command.

$ wc all-written.txt
   43061  400169 2557685 all-written.txt

So, there are 43k lines, and 400k tokens. That’s a bit small for what we are trying to do, but it will suffice for the example.

Again, I’ll build up a Unix pipeline to extract the high-confidence word types from this corpus. I’ll use the head command to show just part of the output at each stage.

Here are the raw contents.

$ cat all-written.txt | head

I can't believe I wrote all that last year.
Acephalous

Friday, 07 May 2010

Now, get one word per line.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | head

I
can
t
believe
I
wrote
all
that
last

The tr translator is used very crudely: basically, anything that is not an ASCII letter character is turned into a new line. The -cs options indicate to take the complement (opposite) of the ‘A-Za-z’ argument and to squeeze duplicates (e.g. A42, becomes A with a single new line rather than three).

Next, we sort and uniq, as above, except that we use the -c option to uniq so that it produces counts.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | head
   1
 737 A
  22 AA
   1 AAA
   1 AAF
   1 AAPs
  21 AB
   3 ABC
   1 ABDULWAHAB
   1 ABLE

Because the MASC corpus includes tweets and blogs and other unedited text, we don’t trust words that have low counts, e.g. four or fewer tokens of that type. We can use awk to filter those out.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | awk '{ if($1>4) print $2 }' | head
A
AA
AB
ACR
ADDRESS
ADP
ADPNP
AER
AIG
ALAN

Awk makes it easy to process lines of files, and gives you indexes into the first column ($1), second ($2), and so on. There’s much more you can do, but this shows how you can conditionally output some information from each line using awk.

You can of course change the threshold. You can also turn all words to lower-case by inserting another tr call into the pipe, e.g.:

$ cat all-written.txt | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\n' | sort | uniq -c | awk '{ if($1>8) print $2 }' | head
a
aa
ab
abandoned
abbey
ability
able
abnormal
abnormalities
aboard

It all comes down to what you need out of the text.

Combining and using the dictionaries

Let’s do the check on the sentence above, but using both the standard dictionary and the one derived from MASC. Run the following command first.

$ cat all-written.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | awk '{ if($1>4) print $2 }' > /tmp/masc_vocab.txt

Then in the directory where you saved words.txt, do the following.

$ cat /usr/share/dict/words /tmp/masc_vocab.txt | sort | uniq > big_vocab.txt
$ comm -23 words.txt big_vocab.txt
acress
tonw

Ta-da! The MASC corpus provided us with enough examples of other words that This, Facebook, app, and shows are no longer detected as errors. Of course, detecting there as an error is much more difficult and requires language models and more.