Pages

Sunday 15 January 2017

Creating crosswords with python-constraint [part 2]

In the previous post I was generating the patterns of white and black squares. In this post I'd like to actually fill the patterns with words so that everything matches up. This sort of problem is very well suited to a constraint satisfaction solver, as it can try many possibilities until something is found that works. The trick is specifying the constraints in an efficient manner. My first solution was to implement it the same way as the magic word squares, but this was really inefficient and there is a much better way.

I put the code for this on github, it's a bit messy but works: https://github.com/jameslyons/crossword_gen.

The trick is you don't need to represent all the different squares in the crossword as separate variables and put the constraints on those, your actual variables will be indexes into the word list, and the constraints will be whether e.g. character 4 of word 1 is the same as character 2 of word 7. These constraints will obviously be different for every crossword and will have to be generated each time, but this isn't too hard.

The inputs from the previous step are hex strings e.g.

3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg

This can be converted to a binary string by replacing e.g. 3 with '0011', same as decoding hex anywhere. The function I use is down the bottom of the previous post. This gives a binary string that can be represented as the white and black squares of the crossword.

Example run

The code takes as argument the hex representation of the pattern. It first prints a representation of the crossword with 'X's where the letters go, then once it gets solved it outputs the solved crossword, along with the list of words that got used. The dictionary I am using has some weird words in it, so you need to pay some attention to that if you want nice crosswords (I recommend using something like the 3of6game.txt from the 12dicts wordlist, it is specifically for games). The final thing that gets printed is the crossword represented as a string, this is passed to some typesetting code I have that produces nice pdfs of the crosswords.

>pypy cw_fill.py BEFB4516BFAFAAF8FE27573BFBDD5DEFEE75723F8FAAFAFEB4516FBEh
building word lists...
X X X X X X X X X X X X
X X X X X X
X X X X X X X X X X X
X X X X X X X X X X X
X X X X X X X X X
X X X X X X X X
X X X X X X X X X X X X X
X X X X X X X X X X
X X X X X X X X X X X X X
X X X X X X X X
X X X X X X X X X
X X X X X X X X X X X
X X X X X X X X X X X
X X X X X X
X X X X X X X X X X X X
starting...
count: 0
s . c r u m b . r e m a n . b
w . r . . . r . i . . . o . r
a . a . p h o e b u s . s . i
m o w n . e . l . r . g y m s
i . . . c r i m i n e . . . k
. . p o o . n . n . v e g . .
a v o . i n f a u n a . i l k
s . s u n . l . t . d i g . i
p h i . a l a n i n e . o f t
. . t a g . m . l . r a t . .
i . . . e y e l e s s . . . o
m o t s . e . o . p . w a t t
i . o . s t o p g a p . m . t
d . r . . . a . e . . . i . a
e . c l a c k . l u l l s . r

['swami', 'asp', 'imide', 'craw', 'posit', 'torc', 'coinage', 'her', 'yet', 'bro',
'inflame', 'oak', 'elm', 'lop', 'rib', 'inutile', 'gel', 'urn', 'spa', 'evaders',
'nosy', 'gigot', 'amis', 'brisk', 'kit', 'ottar', 'crumb', 'reman', 'phoebus',
'mown', 'gyms', 'crimine', 'poo', 'veg', 'avo', 'infauna', 'ilk', 'sun', 'dig',
'phi', 'alanine', 'oft', 'tag', 'rat', 'eyeless', 'mots', 'watt', 'stopgap',
'clack', 'lulls']
solution time = 1.05939997405
s.crumb.reman.bw.r...r.i...o.ra.a.phoebus.s.imown.e.l.r.gymsi...crimine...k..p
oo.n.n.veg..avo.infauna.ilks.sun.l.t.dig.iphi.alanine.oft..tag.m.l.rat..i...ey
eless...omots.e.o.p.watti.o.stopgap.m.td.r...a.e...i.ae.clack.lulls.r

The code

This takes the hex string output from the pattern generation step as an argument. It uses a wordlist called 'fewerwords.txt'. Each time it is run it should give different results because the word indexes are shuffled each time. Sometimes it will take a long time to find a solution, I suggest just killing it and running it again. Most of the time this will get a new starting point that allows it to be solved very quickly, though some patterns are just harder to solve than others.


from __future__ import print_function
from constraint import *
from timeit import default_timer as timer
from itertools import combinations
from math import sqrt
from test_bin2hex import hex2bin
import sys

''' to run from pypy
pypy cw_fill.py 3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg
'''
# this program will take a crossword pattern and try to fill it with words

if len(sys.argv) < 2: print('supply pattern as argument')
else:
pattern = hex2bin(sys.argv[1])
pattern = [int(i) for i in pattern]

s = int(sqrt(len(pattern)))

print('building word lists...')
# extract the word positions from the pattern
wordlocs = []
hwordlocs = []
vwordlocs = []
for i in range(s):
hword = []
vword = []
for j in range(s):
if pattern[i*s + j] == 0: print(' ',end=' ')
else: print('X',end=' ')
if pattern[i*s + j] == 1:
hword.append(i*s + j)
else:
if len(hword) > 1: hwordlocs.append(hword)
hword = []

if pattern[j*s + i] == 1:
vword.append(j*s + i)
else:
if len(vword) > 1: vwordlocs.append(vword)
vword = []

if j == s-1:
if len(vword) > 1 and vword not in vwordlocs: vwordlocs.append(vword)
if len(hword) > 1 and hword not in hwordlocs: hwordlocs.append(hword)

print('')
wordlocs.extend(vwordlocs)
wordlocs.extend(hwordlocs)

# now that we have the words, add constraints so each word is in the dictionary
maxL = 10

wordlist = []
f = open("fewwords.txt")
for line in f:
w = line.split()[0].lower()
wordlist.append(w)

wordlen = []
olen = 0
for c,word in enumerate(wordlist):
if olen < len(word):
wordlen.append(c)
olen = len(word)

problem = Problem()
from random import shuffle

for c,w in enumerate(wordlocs):
L = len(w)-3
r = list(range(wordlen[L],wordlen[L+1]))
shuffle(r)
problem.addVariables([c], list(r))

problem.addConstraint(AllDifferentConstraint())

def eq(a,b):
def temp(c1,c2):
word1 = wordlist[c1]
word2 = wordlist[c2]
if word1[a]==word2[b]: return True
return False
return temp

for c1,word1 in enumerate(wordlocs):
for c2,word2 in enumerate(wordlocs):
if word1==word2: continue
intsc = list(set(word1).intersection(set(word2)))
if len(intsc) > 0:
a = word1.index(intsc[0])
b = word2.index(intsc[0])
problem.addConstraint(eq(a,b),[c1,c2])

print('starting...')

def getwords(solution):
listq = []
for locs in wordlocs:
word = ''.join([solution[loc] for loc in locs])
listq.append(word)
return listq

# pretty print the output
start = timer()
b = ''
for count,a in enumerate(problem.getSolutionIter()):
print('count: '+str(count))
sol = ['.' for i in range(s*s)]
for key,val in a.iteritems():
for c,char in enumerate(wordlist[val]):
sol[wordlocs[key][c]] = char
b=sol
for i in range(s):
for j in range(s):
print(sol[i*s+j],end=' ')
print('')
print('')
# get the words we have used
zz = getwords(sol)
print(zz)
break
end = timer()
print("solution time = " + str(end-start))
print(''.join([str(b[i]) for i in range(s*s)]))

2 comments:

  1. I want a solution for some problem can you help ?

    ReplyDelete
  2. James, I am learning CSP from your code. Why is it that if I switch your game2.txt word list to another extensive word list, the code errors out or does not produce a solution?
    Also, if I have existing words as part of the grid, how does that affect the constraints and variables? Thanks!

    ReplyDelete