Sunday, 27 November 2016

Monsters of the Movies - Achievement Unlocked

After almost forty years, I've finally managed to see all the films listed in Monsters of the Movies. I've been tracking my progress in a spreadsheet and a letterboxd list and over the last few months, I've been slowly filling my gaps by downloading the films from archive.org or watching the films online. The exception was Mad Love (1935), which I had to get through Amazon as a Spanish-subtitled DVD and then watch with the subtitles turned off!

Amongst the recently-watched films, the two stand-out gems for me were The Manster (1959) a.k.a. The Split:


and The Abominable Dr. Phibes (1971):


I kept the latter until the very end ... and I wasn't disappointed.

Lexicon 2

To finish off the discussion of my Lexicon demo, I'll outline the in-memory data structure used to store the dictionary.

In the first part, I described how the lexicon is stored trivially-compressed for transmission "over the wire". The words are then expanded in memory into a form that makes searching simple and efficient.

The main data structure is a four-dimensional array of JavaScript objects named "Lexicon.byRaritySubsetLength":

  1. The first index is an integer rarity between one and five inclusive: one means "not very rare", whereas five means "esoteric".
  2. The second index is an integer bit-field (one to seven inclusive) which encodes which dictionaries the corresponding list is part of: TWL06, SOWPODS and SCOWL.
  3. The third index is the integer length of the word (two to fifteen inclusive).
  4. The fourth index is an integer index into a list sorted alphabetically (starting from zero).
By default, the demo looks at all rarities and word lengths for the SOWPODS lexicon. This means that without changing the criteria, when matching words, the client must look through 5 * 4 * 14 = 70 lists of words.

Each word entry is a JavaScript object of the following form:

    word : string
    rarity : integer
    subset : integer
    mask : integer

The first three fields contain the uppercase word, rarity index and lexicon subset mask. The last field is used when searching for anagrams. It is a bit-field that encodes which letters are contained within the word: 1 for "A", 2 for "B", 4 for "C", 8 for "D", etc.

For instance, to find anagrams for "WORDS" in SOWPODS, the client needs to search through only 20 lists of words: five rarities (1 to 5), four lists for SOWPODS and one word length (5). These lists contain 12,478 entries which can be scanned quickly by looking for those that have a "mask" with the bits for "D", "O", "R", "S" and "W" all set. It will typically come up with the solution ("SWORD" and "DROWS") within a couple of milliseconds.