Sunday 11 September 2016

Lexicon 1

For a long time, I've been thinking about implementing a word puzzle dictionary to help solve (or cheat at, if you prefer) crosswords, codewords, Scrabble, and the rest. You know the sort of thing:
"One across, nine letters: C-blank-O-blank-S-blank-blank-R-blank" *
I've always assumed that running this on the client-side in JavaScript would be far too slow. It turns out I was very wrong; if you're careful with your coding, you can get it running in a browser on a low-end tablet or phone with memory and CPU cycles to spare.

My first problem was getting hold of a good lexicon or word list. The obvious avenue to explore was the "official" Scrabble word list used in competitions, but it quickly became apparent that there are at least two such lists:
These are simply lists of words that are considered "valid"; the former is over 178,000 words and the latter over 267,000 words. They do not contain definitions nor any metadata about the words such as frequency.

In contrast, SCOWL (Spell Checker Oriented Word Lists) contains over 423,000 words with information such as relative frequency and regional variations (American/Canadian/British). Some of the words are suspect (domain-specific abbreviations and bad optical character recognition are two likely sources of suspicious "words") but the frequency information makes it easy to cull outliers.

Instead of choosing just one word list, I decided to go for all three and divided the superset of words into five levels of "rarity" based on the SCOWL frequency metadata. It turns out that SOWPODS is a superset of TWL06, but contains some words not found in SCOWL.

After a lot of curation (in Excel, of all tools!) I ended up with a list of 432,852 words split into five levels of increasing "rarity". Each word is implicitly tagged with the word lists in which it is found. I limited the words to those between two and fifteen letters (inclusive) where each "letter" is one of the twenty-six Roman letters "A" to "Z" (no digits, apostrophes, hyphens or accents are allowed).

I could have stored the lexicon as a simple text file or as a JSON array of strings, but this would have taken up several megabytes, so I decided to have a go at trivial compression. I know, I know ... most HTTP servers will serve up compressed files out of the box and there exist any number of JavaScript libraries that do client-side decompression of ZIP files, but I felt like re-inventing a wheel. Give me a break!

The main constraint I gave myself was to try to get the lexicon to the client-side in a usable form as quickly as possible. This meant a trade-off between file size and time taken to decompress and store in a suitable data structure. One thing you don't want to do is sort half a million words into alphabetical order on the client. That means the lexicon (or parts thereof) needs to be pre-sorted.

Sorted word lists have a lot of redundancy at the beginning of each entry. For example, imagine a list starting like this:
  • AARDVARK
  • AARDVARKS
  • AARDWOLF
  • AARDWOLVES
  • ...
If we replace the beginning of each word with a digit ('0' to '9') specifying how many letters from the previous word we should replicate, the list becomes:
  • 0AARDVARK
  • 8S
  • 4WOLF
  • 7VES
  • ...
Another technique is to substitute common letter sequences with ASCII characters outside the ranges '0' to '9' and 'A' to 'Z'. It turns out that the most common four-letter sequences near the end of the words are:
NESS TION ABLE ICAL MENT INGS LESS
For three-letter sequences, the list is:
ING OUS IES ATE TIC TER ISM IST SES INE ION IVE 
While the most common near-end two-letter sequences are:
ED ER ES AL LY AN IC EN IT OR IS ON AR LE IN IA ST ID RA PH UL LA OM IL CH TE CE UUR AT OL
Using these two simple techniques, the 432,852 words can be compressed into a single ASCII string of just over 1.5 million characters. This string can be efficiently converted into a high-performance in-memory data structure within a second on a typical PC.

The full demo can be found here, whilst the JavaScript file that decompresses the word list is lexicond.js.

* "CLOISTERS" obviously

No comments:

Post a Comment