On placing letters – Insignificant Bits

Work continues on my fork of the QXW crossword filler, which by now is about 50% rewritten by me and ultimately will end up looking nothing like its predecessor. Already it is a command-line, monte carlo automatic filler instead of a user-directed GTK program. Both have their uses, I think.

In a previous post, I wished that QXW supported scored dictionaries. It turns out that it does, though in my tests it gave unexpected results in many cases. I’ve reworked most of that code and am pretty happy with the results I get now. Just to cherry-pick a before-and-after example:

input:

...#...
...#...
.......
##.X.##
.......
...#...
...#...

before:

SBA#STA
BJS#AIX
WASHMEN
##AXA##
ICGARDS
IUE#AWK
PFS#HHM

after:

THE#BHE
FOR#EAT
COUNCIL
##PXA##
LETTUCE
ARE#SEE
DID#EAR

The latter grid looks a lot more like actual words, though it does still have some garbage entries.

QXW’s algorithm could be roughly stated as “find the hardest to fill cell, then fill it with the highest scoring letter, repeat.” It is a naturally recursive algorithm (implemented iteratively with stacks) that either terminates when the grid is filled, or unwinds and tries the next best letter.

I have some issues with this approach, the main one being that maximizing letter score seems unlikely to maximize score of whole words. I’ve been thinking about this problem some, including the form of optimality that I would like to achieve: maximize the sum of the scores of every word.

A simple brute force algorithm implementing this could be “fill each entry with a word, compute the score, and repeat until all words in all entries have been tried, taking the grid with maximum score.” It’s fairly easy to realize this exponential algorithm though you may be waiting a looooooooooong time for it to finish.

What I have now is something like this greedy algorithm:

Base case (no more entries):
1. if grid is filled, return the grid
2. return None
Choose the longest unfilled entry crossing a “critical” cell
Fill with the next highest ranked word for that entry
Recurse. Repeat step 2 until grid is filled or no more words

In practice, with a large enough dictionary, this usually terminates without searching forever but you get some junk like PXA. I have added an arbitrary upper limit at which point I just give up the search.

An important thing to note is that the problem does not exhibit optimal substructure, so dynamic programming is not applicable. For example, take this template:

    ..E
    IPA
    PAT

We can fill it any number of ways. My algorithm will fill it like this:

    THE
    IPA
    PAT

…which is terrible (HPA — [Millibar alt.?]). Much better would be:

    SEE
    IPA
    PAT

…which is what you get if you fill the child “EPA” first.

There are a couple of things at work here: first, I pick the longest entry so .IP and .PA don’t even get a chance. Does this make sense? Should I instead pick shortest entry? Entry with fewest completions? Any of these might make better puzzles or at least move the junk around.

Second, THE is really really really common. The commonest three letter word ever. The previous sentence had one and so does this one. That’s a lot of THEs. Also it is a terrible word for crosswords. [Well, looking back through my NYT database, it has been used over a hundred times: {It’s definite} or {Genuine article?} being decent clues.] Unfortunately, THE is so very common and so highly ranked that practically every high scoring puzzle will have one. So, my optimality metric may be sub-optimal in that respect. Flattening the word probability curve somewhat could partially address this issue.

It is clear that we could pick worse words at some levels to get a better overall result. So I implemented a very simple monte carlo search that looks like this:

Run the greedy algorithm, saving decision stack
Pick a random location in the stack, and from there, pick a random word (instead of best word). Re-start algorithm from there.
Repeat this zillions of times, keeping track of best N grids as seeds for future iterations

Lots of variations on the above remain to be explored. Also, this approach opens the door for taking a completed grid, manufacturing a search tree that generated it, and then optimizing it from there. Don’t like the fill in a published puzzle? Let the computer regenerate it for you.