Monday, November 20, 2006

Cryptogram solver

Newspaper puzzles beware. I have been working on soving newspaper cryptograms for the past few days in my free time. It is a bit more interesting of a problem to solve than sudoku I think. The algorithm I've implementing is nothing alltogether impressive and it relies on a good dictionary file to get solutions.

The algorithm works like so:
First you need an index which is a map of abstract words to lists of words.
An abstract word is taking a word and turning it into a pattern, for instance, 'cat' has the pattern '123', as does 'hat', and 'fat'. 'mom' has the pattern '121'.

Then it takes the sentence you have given it, splits it on spaces.
Popping the first word off the list, finds the list in the index of possible words it could be, iterates over it generating a map of letters to letters and recurses on the rest of the sentence, once it has reached the end of the sentence it puts the map it has generated into a list of possible solutions. If a generated map for a word conflicts with teh current map it is not a valid solution and moves onto the next possible solution.
The solve function returns a list of solution maps that can eb applied to any sentence to get the result.

Little things that make it helpful include being able to give an intial map, if you are sure some letters map to other letters you can give that as a hint. It can also remove any words whos pattern does not appear in the map.

Solve only those subsets of words that, if solved, will result in the entire alphabet being solved. It should do this for all possible combinations of words in teh sentence that will result in this. This is not an optimizations, it should probably make it take longer to solve actually, however it allows it to solve sentences with words that might not exist in the dictionary but could come about as a result of solving the other words. I have a few ideas of how to do this but havn't had time to implement it yet.

The current code can be downloaded here.


  1. I did a similar project in ocaml. I didn't do the ``abstract word'' thing until later on in the evolution of the development, but instead started by counting the letter frequencies and sorting the words by cost to isolate the most valuable word to solve first and then figuring out the additional mappings from there.

    For example, given the input:

    Wjko ko f weow oerwerce wh grocxfsbme

    It'd decide to process ``oerwerce'' first because there were five unique letters, many of them existing in other words. There were also only 5 matches given my word classification (abcdbceb) once I added that.

    Then, of course, for the rest of the words, I just have to fill in whatever letters are left and check all the remaining words against the dictionaries I loaded by classification.

  2. Of course, the traditional method without brute force lookups is to use letter frequencies. In English the most common letters are roughly ETAONIRSHDLU (there are variants and you can build your own frequencies easily from open source text).

    The manual technique also involved recognizing patterns. For example in Dustin's phrase the two letter word "ko" is also the ending for "Wjko". "This is" could be a good guess to get things started.

    The method was generally:

    1) Count letter frequencies
    2) Count two-letter frequencies
    3) Note any subsequences that are also words
    4) Apply any constraints of #2 and #3 that are unambiguous or only have 2 or 3 choices.
    5) Try letters using the frequency expected vs the frequency observed, replacing all occurrences
    6) Fill in words with 1 or 2 or 3 letters missing.
    7) Backtrack if you get stuck.

  3. The anonymous poster makes a good point on letter frequency analysis. Unfortunately, I'd solved the one cryptogram I set out to solve before getting into that, but I did at least have the frequency lists ready (please excuse me for posting in ocaml):

    let single_letters = [
    'e'; 't'; 'o'; 'a'; 'n'; 'i'; 'r'; 's'; 'h'; 'd'; 'l'; 'c'; 'w';
    'u'; 'm'; 'f'; 'y'; 'g'; 'p'; 'b'; 'v'; 'k'; 'x'; 'q'; 'j'; 'z';

    let digraphs = [
    "th"; "er"; "on"; "an"; "re"; "he"; "in"; "ed"; "nd"; "ha"; "at"; "en";
    "es"; "of"; "or"; "nt"; "ea"; "ti"; "to"; "it"; "st"; "io"; "le"; "is";
    "ou"; "ar"; "as"; "de"; "rt"; "ve";

    let trigraphs = [
    "the"; "and"; "tha"; "ent"; "ion"; "tio"; "for"; "nde"; "has"; "nce";
    "edt"; "tis"; "oft"; "sth"; "men";

    let doubles = [
    "ss"; "ee"; "tt"; "ff"; "ll"; "mm"; "oo";

    let initial_letters = [
    't'; 'o'; 'a'; 'w'; 'b'; 'c'; 'd'; 's'; 'f'; 'm'; 'r'; 'h'; 'i'; 'y'; 'e';
    'g'; 'l'; 'n'; 'p'; 'u'; 'j'; 'k';

    let final_letters = [
    'e'; 's'; 't'; 'd'; 'n'; 'r'; 'y'; 'f'; 'l'; 'o'; 'g'; 'h'; 'a'; 'r'; 'm';
    'p'; 'u'; 'w';

    I forgot *where* I got these, but I would assume it had something to do with Bruce Schneier.

    The thing I used the most was a complete ordered dictionary of words by frequency I got from, I believe, here:

    The biggest problem I've got at this point is that my dictionary is a bit too big (tg doesn't look like a word to me), but without the program understanding grammar, I still have to read the possibilities myself to figure out what the right answer is.

  4. The target text that I'm trying to solve are the cryptograms that appear in most Sunday American newspapers. I'm not sure a statistical model would work so well for those since the input is generally only a sentence or two.

    My solver has a similar problem in terms of outputting a lot of things that have 100% valid words but they make no sense. Perhaps instead of understanding the grammar one could have a heirarchy of what matches seem better than others. For instance if you solve something via the large words that have fewer matches in your dictionary perhaps those would be better matches than those that have lots of choices. SO if solving those words leads to a full dictionary perhaps that is a better match, but then that runs into the issue of using some odd word that isn't in the dictionary.

  5. How would a statistical model work with a high number of irregular words?

  6. I think the statistical model breaks down a bit in that situation. But it depends on what you mean by 'irregular'. If you define an irregular series of words as those that contain letter counts that do not fit what you would statistically expect, then sure there is an issue. If you define 'irregular' as a series of words that people just don't use very often, the letter frequency there might still be in the range you expect.