"One across, nine letters: C-blank-O-blank-S-blank-blank-R-blank" *I've always assumed that running this on the client-side in JavaScript would be far too slow. It turns out I was very wrong; if you're careful with your coding, you can get it running in a browser on a low-end tablet or phone with memory and CPU cycles to spare.
My first problem was getting hold of a good lexicon or word list. The obvious avenue to explore was the "official" Scrabble word list used in competitions, but it quickly became apparent that there are at least two such lists:
- The Official Tournament and Club Word List (or, more specifically, "TWL06" for the 2006 version) used in USA, Canada and Thailand ; and
- SOWPODS used just about everywhere else.
In contrast, SCOWL (Spell Checker Oriented Word Lists) contains over 423,000 words with information such as relative frequency and regional variations (American/Canadian/British). Some of the words are suspect (domain-specific abbreviations and bad optical character recognition are two likely sources of suspicious "words") but the frequency information makes it easy to cull outliers.
Instead of choosing just one word list, I decided to go for all three and divided the superset of words into five levels of "rarity" based on the SCOWL frequency metadata. It turns out that SOWPODS is a superset of TWL06, but contains some words not found in SCOWL.
After a lot of curation (in Excel, of all tools!) I ended up with a list of 432,852 words split into five levels of increasing "rarity". Each word is implicitly tagged with the word lists in which it is found. I limited the words to those between two and fifteen letters (inclusive) where each "letter" is one of the twenty-six Roman letters "A" to "Z" (no digits, apostrophes, hyphens or accents are allowed).
I could have stored the lexicon as a simple text file or as a JSON array of strings, but this would have taken up several megabytes, so I decided to have a go at trivial compression. I know, I know ... most HTTP servers will serve up compressed files out of the box and there exist any number of JavaScript libraries that do client-side decompression of ZIP files, but I felt like re-inventing a wheel. Give me a break!
The main constraint I gave myself was to try to get the lexicon to the client-side in a usable form as quickly as possible. This meant a trade-off between file size and time taken to decompress and store in a suitable data structure. One thing you don't want to do is sort half a million words into alphabetical order on the client. That means the lexicon (or parts thereof) needs to be pre-sorted.
Sorted word lists have a lot of redundancy at the beginning of each entry. For example, imagine a list starting like this:
- AARDVARK
- AARDVARKS
- AARDWOLF
- AARDWOLVES
- ...
- 0AARDVARK
- 8S
- 4WOLF
- 7VES
- ...
NESS TION ABLE ICAL MENT INGS LESSFor three-letter sequences, the list is:
ING OUS IES ATE TIC TER ISM IST SES INE ION IVEWhile the most common near-end two-letter sequences are:
ED ER ES AL LY AN IC EN IT OR IS ON AR LE IN IA ST ID RA PH UL LA OM IL CH TE CE US UR AT OLUsing these two simple techniques, the 432,852 words can be compressed into a single ASCII string of just over 1.5 million characters. This string can be efficiently converted into a high-performance in-memory data structure within a second on a typical PC.
The full demo can be found here, whilst the JavaScript file that decompresses the word list is lexicond.js.
* "CLOISTERS" obviously
No comments:
Post a Comment