Pages

Sunday, 27 May 2018

Getting Apache Drill working on Windows 10

Apache Drill is a program that allows you to query many data sources from a single query using SQL. It supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.

I basically want to be able to query CSV files, sqlite files, parquet files and other RDBMS using SQL with a unified interface from python. I am looking into Apache Drill, hopefully it will solve all my problems, I'll just be using it in local mode.

Download apache drill from here: https://drill.apache.org/download/. I did the direct file download, I got version 1.13.0.

To get drill working, you'll also need Oracle JDK v8 from here: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html. I got version jdk-8u171-windows-x64.exe

Use 7zip to unzip the tar.gz file to some directory. From the command prompt, cd to the directory, it should have the following contents:

bin
conf
git.properties
jars
KEYS
LICENSE
NOTICE
README.md
sample-data
winutils

cd to the bin folder and try to run

sqlline -u jdbc:drill:zk=local
The first error I got was: "WARN: JAVA_HOME not found in your environment. Please set the JAVA_HOME variable in your environment to match the location of your Java installation". This is pretty easy to fix, make sure java JDK is installed and you have an environment variable called JAVA_HOME (see here for how to create one) that has the value "C:\Program Files\Java\jdk1.8.0_171" (assuming that is where you installed it).

After doing that, the next error I got when running sqlline -u jdbc:drill:zk=local was: Error: Failure in connecting to Drill: org.apache.drill.exec.rpc.RpcException: Failure setting up ZK for client. (state=,code=0) plus a whole lot more java error stuff. Then no matter what I typed it simply said No current connection. To fix this one, you have to put quotes around the argument passed to sqlline like this:

sqlline -u "jdbc:drill:zk=local"

Note that to exit sqlline I found Ctrl d to be the easiest, this simulates an end of file character and the program stops gracefully. Now I can run commands like:

USE cp; 
SELECT  employee_id, first_name FROM `employee.json`; 

Connecting to sqlite databases from apache drill

Of course you can connect to sqlite databases directly using the sqlite3 console program, but I would like to connect to sqlite dbs, csvs, parquet files and other RDBMS all from the same query. To do that I need the JDBC drivers for sqlite, which are available here: https://bitbucket.org/xerial/sqlite-jdbc/downloads/. I got version sqlite-jdbc-3.21.0.jar. This has to be copied to the 3rdparty folder in the apache drill install directory: C:\Users\james\Downloads\apache-drill-1.13.0\jars\3rdparty on my system. Make sure you close and reopen sqlline/drill before trying to use the driver.

Unfortunately, we can't access sqlite dbs in the same way as files, we have to create a new storage plugin for each sqlite database we have. I downloaded the sample database chinook.db from here: http://www.sqlitetutorial.net/sqlite-sample-database/. Extract the database chinook.db somewhere and remember where it is. I got the list of tables by opening the db in sqlitestudio.

Now we need to open the drill web console by navigating to http://localhost:8047/storage/ (make sure sqlline or drill is running in the background). We need to create a new storage plugin, we'll call it chinook. Down the bottom of the page there is a text entry, put chinook in there and and press the create new storage plugin button. You'll be met with a text entry field that is expecting JSON. Enter the following, but replace the path that I have entered with the path to your database file:

{
  "type": "jdbc",
  "driver": "org.sqlite.JDBC",
  "url": "jdbc:sqlite:C:\\Users\\james\\Downloads\\chinook.db",
  "username": null,
  "password": null,
  "enabled": true
}

Now we can query the sqlite database using drill (you may need to close and restart sqlline/drill):

SELECT * FROM chinook.customers WHERE lastname like 'g%';
+-------------+------------+------------+---------------------------------------------------+--------------------------+
| CustomerId  | FirstName  |  LastName  |                      Company                      |                Address   |
+-------------+------------+------------+---------------------------------------------------+--------------------------+
| 1           | Luφs       | Gonτalves  | Embraer - Empresa Brasileira de Aeronßutica S.A.  | Av. Brigadeiro Faria Lim |
| 7           | Astrid     | Gruber     | null                                              | Rotenturmstra▀e 4, 1010  |
| 19          | Tim        | Goyer      | Apple Inc.                                        | 1 Infinite Loop          |
| 23          | John       | Gordon     | null                                              | 69 Salem Street          |
| 27          | Patrick    | Gray       | null                                              | 1033 N Park Ave          |
| 42          | Wyatt      | Girard     | null                                              | 9, Place Louis Barthou   |
| 56          | Diego      | GutiΘrrez  | null                                              | 307 Macacha Gⁿemes       |
+-------------+------------+------------+---------------------------------------------------+--------------------------+
7 rows selected (0.185 seconds)

Unfortunately, not all tables could be read. If I try to open the employees table, i get:

 SELECT * FROM chinook.employees;
Error: DATA_READ ERROR: Failure while attempting to read from database.

sql SELECT *
FROM employees
plugin sqlite
Fragment 0:0

[Error Id: ac42e2eb-a8ac-478e-b5bd-de663f7bc93e on DESKTOP:31010] (state=,code=0)

The sqlite JDBC driver doesn't seem to like the datetime fields, use something like sqlitestudio to change the datatype of the fields to strings, then everything will work. You'll just have to convert the string dates to actual date objects in code later on.

Using Apache drill from python

I think the easiest way of doing this is to use pydrill (run pip install pydrill to get it). This however uses the rest api, which is likely not the most efficient. Using the ODBC drivers and pyodbc will likely be better, but I have not tested it yet.

from pydrill.client import PyDrill

drill = PyDrill(host='localhost', port=8047)

if not drill.is_active():
    raise ImproperlyConfigured('Please run Drill first')

res= drill.query('SELECT * FROM chinook.customers LIMIT 5')

for result in res:
    print(result)

# pandas dataframe
df = res.to_dataframe()
print(df)

Tuesday, 31 January 2017

Generating Gray codes with python

Gray codes are often used in positioning systems, see the wiki article for more details. I want to generate gray codes with python, and I'd like to generate all possible gray codes for a specific number of positions and number of bits. The maximum number of positions for a gray code is 2^numbits i.e. a 4 bit gray code can have at most 16 positions. It can also have fewer. If you want the gray code to be circular i.e. the first and last code have hamming distance 1, then there needs to be an even number of positions. Odd is fine if you don't care if it is circular.

There are of course algorithms for generating gray codes, but a constraint solver works really well, and also allows you to generate all possible gray codes for a specific number of positions. The python code below is for generating gray codes using python-constraint:

from constraint import *
from itertools import product

numbits = 4
numpos = 16

# this generates a list of all binary vectors of numbits bits
vecs = list(product(*(((1,0),)*numbits)))

problem = Problem()
# we will have numpos variables, i.e. one var for each code,
# each var is an index into vecs, has 2^numbits possible vals
problem.addVariables(list(range(numpos)), list(range(len(vecs))))

# vecs holds our binary vectors. 
# we want True if hamming distance between vecs[a] and vecs[b] is 1
# False otherwise. This will be our constraint. 
def hammingDist1(a,b):
    hamming = sum([vecs[a][i]^vecs[b][i] for i in range(numbits)])
    if hamming == 1: return True
    return False

# add the important constraints
for i in range(numpos):
    j = (i+1)%numpos
    problem.addConstraint(hammingDist1,[i,j])

# make sure all the vectors are different    
problem.addConstraint(AllDifferentConstraint())

# print out the first solution we come across
sol = problem.getSolution()
for i,j in enumerate(sol):
    print(vecs[sol[i]])
    

The main thing to note is that we add constraints so that each code has a hamming distance of 1 with its neighbour, and also that all the codes are diferent.

An example run looks like this (for a 4-bit, 16 position code):

C:\Users\james\Desktop>python gray.py
(0, 0, 0, 0)
(0, 0, 0, 1)
(0, 0, 1, 1)
(0, 0, 1, 0)
(0, 1, 1, 0)
(0, 1, 1, 1)
(0, 1, 0, 1)
(1, 1, 0, 1)
(1, 0, 0, 1)
(1, 0, 0, 0)
(1, 0, 1, 0)
(1, 0, 1, 1)
(1, 1, 1, 1)
(1, 1, 1, 0)
(1, 1, 0, 0)
(0, 1, 0, 0)

Monday, 16 January 2017

Latex Dictionary Template

I recently needed a latex dictionary template, there are a couple around on the internet, but they are all under non-commercial licences so I couldn't use them. Fortunately it is pretty easy to make one yourself, I have provided one here. Consider it to be under the wtfpl licence.

A short example of what the definitions look like:

%Copyright 2017 James Lyons
%This work is free. You can redistribute it and/or modify it under the
%terms of the Do What The Fuck You Want To Public License, Version 2,
%as published by Sam Hocevar. See http://www.wtfpl.net/ for more details.
\documentclass[twoside]{book}
\usepackage[utf8]{inputenc}
\usepackage{multicol}
\usepackage{fancyhdr} 
\usepackage[bindingoffset=0.2in,margin=0.75in,paperwidth=5.25in,paperheight=8in]{geometry}

% set the headers so that first mark is always on left, second on right
% this is so that each word is put in the header correctly 
\pagestyle{fancy} 
\fancyhead[LO,LE]{\rightmark}
\fancyhead[RE,RO]{\leftmark}
% set the font
\renewcommand*\rmdefault{bch}
\newcommand{\comment}[1]{}
% remove chapter numbering
\usepackage{titlesec}
\titleformat{\chapter}
  {\Large\bfseries} % format
  {}                % label
  {0pt}             % sep
  {\huge}           % before-code
% remove section numbering
\setcounter{secnumdepth}{0}
% this is how we define the actual dictionary items. markboth is for 
% setting the page headers so that first and last words appear there.
\newcommand{\ditem}[1]{\item[#1] \markboth{\footnotesize \textbf{#1}}{\footnotesize \textbf{#1}}}
\begin{document}
%%%%%%%%%%%%%%%%%%%%%%%% title page
\begin{center}

\thispagestyle{empty}
\vbox{\vspace{5cm}
{\huge The Dictionary Title}}
\vfill
{\large Author}
\vfill

\end{center}
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%% table of contents
\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%% Preface
\chapter{Preface}

This is the preface.

\chapter{The Dictionary}
% we'd like the text size to be a bit smaller for the definitions
\footnotesize
% we'd like two columns of definitions
\begin{multicols}{2}

\begin{description}
\section{A}
\ditem{Abandon} \textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abandoned} \textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2
\ditem{Abide} \textit{-v.} definition 1 \textit{-n.} definition 2
\end{description}

\end{multicols}
 
\end{document}

Sunday, 15 January 2017

Creating crosswords with python-constraint [part 2]

In the previous post I was generating the patterns of white and black squares. In this post I'd like to actually fill the patterns with words so that everything matches up. This sort of problem is very well suited to a constraint satisfaction solver, as it can try many possibilities until something is found that works. The trick is specifying the constraints in an efficient manner. My first solution was to implement it the same way as the magic word squares, but this was really inefficient and there is a much better way.

I put the code for this on github, it's a bit messy but works: https://github.com/jameslyons/crossword_gen.

The trick is you don't need to represent all the different squares in the crossword as separate variables and put the constraints on those, your actual variables will be indexes into the word list, and the constraints will be whether e.g. character 4 of word 1 is the same as character 2 of word 7. These constraints will obviously be different for every crossword and will have to be generated each time, but this isn't too hard.

The inputs from the previous step are hex strings e.g.

3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg

This can be converted to a binary string by replacing e.g. 3 with '0011', same as decoding hex anywhere. The function I use is down the bottom of the previous post. This gives a binary string that can be represented as the white and black squares of the crossword.

Example run

The code takes as argument the hex representation of the pattern. It first prints a representation of the crossword with 'X's where the letters go, then once it gets solved it outputs the solved crossword, along with the list of words that got used. The dictionary I am using has some weird words in it, so you need to pay some attention to that if you want nice crosswords (I recommend using something like the 3of6game.txt from the 12dicts wordlist, it is specifically for games). The final thing that gets printed is the crossword represented as a string, this is passed to some typesetting code I have that produces nice pdfs of the crosswords.

>pypy cw_fill.py BEFB4516BFAFAAF8FE27573BFBDD5DEFEE75723F8FAAFAFEB4516FBEh
building word lists...
X   X X X X X   X X X X X   X
X   X       X   X       X   X
X   X   X X X X X X X   X   X
X X X X   X   X   X   X X X X
X       X X X X X X X       X
    X X X   X   X   X X X
X X X   X X X X X X X   X X X
X   X X X   X   X   X X X   X
X X X   X X X X X X X   X X X
    X X X   X   X   X X X
X       X X X X X X X       X
X X X X   X   X   X   X X X X
X   X   X X X X X X X   X   X
X   X       X   X       X   X
X   X X X X X   X X X X X   X
starting...
count: 0
s . c r u m b . r e m a n . b
w . r . . . r . i . . . o . r
a . a . p h o e b u s . s . i
m o w n . e . l . r . g y m s
i . . . c r i m i n e . . . k
. . p o o . n . n . v e g . .
a v o . i n f a u n a . i l k
s . s u n . l . t . d i g . i
p h i . a l a n i n e . o f t
. . t a g . m . l . r a t . .
i . . . e y e l e s s . . . o
m o t s . e . o . p . w a t t
i . o . s t o p g a p . m . t
d . r . . . a . e . . . i . a
e . c l a c k . l u l l s . r

['swami', 'asp', 'imide', 'craw', 'posit', 'torc', 'coinage', 'her', 'yet', 'bro',
 'inflame', 'oak', 'elm', 'lop', 'rib', 'inutile', 'gel', 'urn', 'spa', 'evaders',
 'nosy', 'gigot', 'amis', 'brisk', 'kit', 'ottar', 'crumb', 'reman', 'phoebus',
 'mown', 'gyms', 'crimine', 'poo', 'veg', 'avo', 'infauna', 'ilk', 'sun', 'dig',
 'phi', 'alanine', 'oft', 'tag', 'rat', 'eyeless', 'mots', 'watt', 'stopgap',
 'clack', 'lulls']
solution time = 1.05939997405
s.crumb.reman.bw.r...r.i...o.ra.a.phoebus.s.imown.e.l.r.gymsi...crimine...k..p
oo.n.n.veg..avo.infauna.ilks.sun.l.t.dig.iphi.alanine.oft..tag.m.l.rat..i...ey
eless...omots.e.o.p.watti.o.stopgap.m.td.r...a.e...i.ae.clack.lulls.r

The code

This takes the hex string output from the pattern generation step as an argument. It uses a wordlist called 'fewerwords.txt'. Each time it is run it should give different results because the word indexes are shuffled each time. Sometimes it will take a long time to find a solution, I suggest just killing it and running it again. Most of the time this will get a new starting point that allows it to be solved very quickly, though some patterns are just harder to solve than others.

from __future__ import print_function
from constraint import *
from timeit import default_timer as timer
from itertools import combinations
from math import sqrt
from test_bin2hex import hex2bin
import sys

''' to run from pypy
pypy cw_fill.py 3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg
'''
# this program will take a crossword pattern and try to fill it with words

if len(sys.argv) < 2: print('supply pattern as argument')
else:
    pattern = hex2bin(sys.argv[1]) 
    pattern = [int(i) for i in pattern]

s = int(sqrt(len(pattern)))

print('building word lists...')
# extract the word positions from the pattern
wordlocs = []
hwordlocs = []
vwordlocs = []
for i in range(s):
    hword =  []
    vword = []
    for j in range(s):
        if pattern[i*s + j] == 0: print(' ',end=' ')
        else: print('X',end=' ')
        if pattern[i*s + j] == 1:
            hword.append(i*s + j)
        else:
            if len(hword) > 1: hwordlocs.append(hword)
            hword = []         
        
        if pattern[j*s + i] == 1:
            vword.append(j*s + i)
        else:
            if len(vword) > 1: vwordlocs.append(vword)
            vword = []
        
        if j == s-1:
            if len(vword) > 1 and vword not in vwordlocs: vwordlocs.append(vword)
            if len(hword) > 1 and hword not in hwordlocs: hwordlocs.append(hword)
            
    print('')
wordlocs.extend(vwordlocs)
wordlocs.extend(hwordlocs)

# now that we have the words, add constraints so each word is in the dictionary
maxL = 10

wordlist = []
f = open("fewwords.txt")
for line in f:
    w = line.split()[0].lower()
    wordlist.append(w)

wordlen = []
olen = 0
for c,word in enumerate(wordlist):
    if olen < len(word):
        wordlen.append(c)
    olen = len(word)

problem = Problem()
from random import shuffle

for c,w in enumerate(wordlocs):
    L = len(w)-3
    r = list(range(wordlen[L],wordlen[L+1]))
    shuffle(r)
    problem.addVariables([c], list(r))

problem.addConstraint(AllDifferentConstraint())

def eq(a,b):
    def temp(c1,c2):
        word1 = wordlist[c1]
        word2 = wordlist[c2]
        if word1[a]==word2[b]: return True
        return False
    return temp
        
for c1,word1 in enumerate(wordlocs):
    for c2,word2 in enumerate(wordlocs):
        if word1==word2: continue
        intsc = list(set(word1).intersection(set(word2)))
        if len(intsc) > 0:
            a = word1.index(intsc[0])
            b = word2.index(intsc[0])
            problem.addConstraint(eq(a,b),[c1,c2])    
        
print('starting...')
        
def getwords(solution):
    listq = []
    for locs in wordlocs:
        word = ''.join([solution[loc] for loc in locs])
        listq.append(word)        
    return listq
              
# pretty print the output
start = timer()
b = ''
for count,a in enumerate(problem.getSolutionIter()):  
    print('count: '+str(count))
    sol = ['.' for i in range(s*s)]
    for key,val in a.iteritems():
        for c,char in enumerate(wordlist[val]):
            sol[wordlocs[key][c]] = char      
    b=sol
    for i in range(s):
        for j in range(s): 
            print(sol[i*s+j],end=' ')
        print('')
    print('')  
    # get the words we have used
    zz = getwords(sol)
    print(zz)
    break
end = timer()
print("solution time = " + str(end-start))
print(''.join([str(b[i]) for i in range(s*s)]))

Creating Crosswords with python-constraint [part 1]

Intro

I put the code for this on github, it's a bit messy but works: https://github.com/jameslyons/crossword_gen.

In previous posts I have been playing with python-constraint, and building crosswords is something that interests me for some reason. I decided to do it in a two step process: step 1 generate the patterns of white and black squares, step 2 fill the patterns with letters. This post will deal with step 1 only, the next post will deal with step 2.

Generating the patterns is well suited to using a constraint solver since you can just specify the things you'd like to be true of the final result, then ask the constraint solver to do it for you. In this case we have e.g. 15 by 15 binary variables, we choose the constraints so that the final pattern looks like a crossword puzzle.

What do pretty patterns look like

I am interested in automatically generating patterns that look like nice crossword patterns e.g.

Some of the things I'd like to be true of the patterns I generate:

  1. I'd like 4-fold symmetry, because I think it looks good.
  2. No 2x2 blocks of white squares.
  3. No 2x2 blocks of black squares.
  4. No length 1 words.
  5. No length 2 words.
  6. No words longer than MAXLEN.
  7. No sequences of black squares longer than 3.
  8. White squares must be fully connected.

Some of these conditions are pretty easy to implement with python-constraint, the fully connected constraint I can just apply to the final solutions and discard them at that point. By fully connected I mean you can get from one square to every other square in the pattern, so you can't get disjoint islands.

The implementation

Examples of what the program outputs:

3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg
X X . . . . . X . . . . . X X
. . . X X X . X . X X X . . .
. X . X . . . . . . . X . X .
. . . . X . X . X . X . . . .
. X X X . . . . . . . X X X .
. X . . . X . X . X . . . X .
. . . X . . . . . . . X . . .
X X . . . X . X . X . . . X X
. . . X . . . . . . . X . . .
. X . . . X . X . X . . . X .
. X X X . . . . . . . X X X .
. . . . X . X . X . X . . . .
. X . X . . . . . . . X . X .
. . . X X X . X . X X X . . .
X X . . . . . X . . . . . X X

The hex string is the binary to hex encoding of the pattern, 'X's are zeros, '.'s are ones. The hex string will be passed to step 2 which will fill in the words using a wordlist.

It's pretty easy to generate a few million example patterns with this program (if you are willing to wait a day), which is what I suggest doing. Once you have a heap of patterns you can further filter them by whatever criteria you want e.g. no more than X 3-letter words, no more than X% black squares etc. Once you have the filtered list, you can try filling them with words.

from __future__ import print_function
from constraint import *
from itertools import combinations
from math import sqrt,floor

# this program will create crossword patterns, but not fill it with words
# i.e. just patterns of zeros and ones for the black and white squares

# tests if a solution is fully connected i.e. can every point be reached from
# any random point. True if fullyconnected, false if disjoint parts.
# Could easily be made more efficient, but this works fine.
def isfullyconnected(solution):
    L = int(sqrt(len(solution)))
    # get list of all non-zero points
    plist = []
    for i in range(L):
        for j in range(L):
            if solution[i*L + j]: plist.append((i,j))
    
    # start top right corner, try to get to every other point
    reached = {plist[0]:1}
    rlen = len(reached)
    oldrlen = 0
    while rlen != oldrlen:
      temp = list(reached.keys())
      for i,j in temp:
        if (i+1,j) in plist: reached[(i+1,j)]=1
        if (i-1,j) in plist: reached[(i-1,j)]=1
        if (i,j+1) in plist: reached[(i,j+1)]=1
        if (i,j-1) in plist: reached[(i,j-1)]=1
      oldrlen = rlen
      rlen = len(reached)
    # if rlen is not eq number of all points, we couldn't reach some points
    if rlen < len(plist): return False
    return True    
    
from random import randint
    
def getPattern(sidelen=15,maxlen=9):
    problem = Problem()
    s = sidelen
    maxWordLen = maxlen
    problem.addVariables(list(range(s*s)), [0,1])
  
    # ensure there are no 2x2 blocks of ones or zeros
    def no4x4block(*x):
        if sum(x)==0: return False
        if sum(x)==len(x): return False
        else: return True

    for i in range(s-1):
        for j in range(s-1):
            problem.addConstraint(no4x4block,[i*s+j,(i*s+j)+1,(i*s+j)+s,(i*s+j)+s+1])
    
    #ensure matrix is symmetric  
    def eq(a,b):  
        return a==b

    for i in range(s):
        for j in range(s):
            problem.addConstraint(eq,[i*s+j,(s-i-1)*s+j])
            problem.addConstraint(eq,[i*s+j,i*s+(s-j-1)])
            problem.addConstraint(eq,[i*s+j,(s-i-1)*s+(s-j-1)])
       
    #ensure no words longer than maxWordLen
    def maxlen(*x):
        if sum(x) == len(x): return False
        else: return True

    seq = list(range(maxWordLen+1))
    for i in range(s - maxWordLen):
        for j in range(s):
            problem.addConstraint(maxlen,[(k+i)+j*s for k in seq])
            problem.addConstraint(maxlen,[(k+i)*s+j for k in seq])
    
    # no two letter words
    def no2letter(*x):
        if x == (0,1,1,0): return False
        else: return True
    
    seq = list(range(4))
    for i in range(s - 3):
        for j in range(s):
            problem.addConstraint(no2letter,[(k+i)+j*s for k in seq])
            problem.addConstraint(no2letter,[(k+i)*s+j for k in seq])

    # no two letter words around the edge either
    def no2letterEdge(*x):
        if x == (1,1,0): return False
        else: return True
    
    seq = list(range(4))
    for i in range(s):
        problem.addConstraint(no2letterEdge,[i,i+s,i+2*s])
        problem.addConstraint(no2letterEdge,[i*s,i*s+1,i*s+2])
        problem.addConstraint(no2letterEdge,[i*s+s-1,i*s+s-2,i*s+s-3])
        problem.addConstraint(no2letterEdge,[i+s*(s-1),i+s*(s-2),i+s*(s-3)])
    
    # no one letter squares
    def no1letter(*x):    
        if sum(x) == 0: return True
        if sum(x[1:]) == 0: return False
        return True
    
    seq = list(range(4))
    for i in range(s):
      for j in range(s):
        vals = [i*s+j]
        if i-1 >= 0: vals.append((i-1)*s+j) 
        if i+1 < s: vals.append((i+1)*s+j) 
        if j-1 >= 0: vals.append(i*s+(j-1)) 
        if j+1 < s: vals.append(i*s+(j+1)) 
        problem.addConstraint(no1letter,vals)
 
    # no more than three 0's in a row
    def nothreeblank(*x):
        if sum(x) == 0: return False
        return True
    
    seq = list(range(4))
    for i in range(s - 3):
        for j in range(s):
            problem.addConstraint(nothreeblank,[(k+i)+j*s for k in seq])
            problem.addConstraint(nothreeblank,[(k+i)*s+j for k in seq])
   
    for sol in problem.getSolutionIter():  
        if not isfullyconnected(sol): continue
        yield(sol)
    
from test_bin2hex import bin2hex
import sys

print(len(sys.argv))
if len(sys.argv)<2: sidelen = 11
else: sidelen = int(sys.argv[1])
print('using sidelen',sidelen)

f = open('pattern'+str(sidelen)+'.txt','w')
for count,sol in enumerate(getPattern(sidelen=sidelen,maxlen=8)):
    f.write(bin2hex(''.join([str(sol[a]) for a in range(sidelen*sidelen)]))+'\n')
    if count % 100 == 0: 
      print('count: '+str(count))
      print(bin2hex(''.join([str(sol[a]) for a in range(sidelen*sidelen)])))
      for i in range(sidelen):
        for j in range(sidelen): 
            if sol[i*sidelen+j]==1: print('.',end=' ')
            else: print('X',end=' ')
        print('')
      print('')
    #break

The hex string is generated using the following python script (test_bin2hex.py). The extra letters are so we can encode binary strings that aren't necessarily divisible by 4.


bin = ['0000','0001','0010','0011','0100','0101','0110','0111','1000','1001','1010','1011','1100','1101','1110','1111',
       '0','1','00','01','10','11','000','001','010','011','100','101','110','111']
hex = "0123456789ABCDEFghijklmnopqrst"

def bin2hex(binstr):
    map = dict(zip(bin,hex))
    ret = ''
    for i in range(0,len(binstr),4):
        sub = binstr[i:i+4]
        ret += map[sub]
    return ret
    
def hex2bin(hexstr):
    map = dict(zip(hex,bin))
    ret = ''
    for i in hexstr:
        ret += map[i]
    return ret

Monday, 2 January 2017

Magic Word Squares with python-constraint

This post will deal with the seemingly boring topic of magic word squares - I define these to be squares of letters where each row is a dictionary word, and each column is also a dictionary word. In my wordlist, there are 3,641 4-letter words (I talk about how I generated the wordlist below). How many possible 4 by 4 word squares do you think there are? Would you believe 90.6 million? I was rather surprised to discover there were so many. What about 8 by 8 word squares (given there are 12,317 8 letter words in the wordlist)? would you believe 0? It turns out it gets much harder to find word squares as you increase the side length.

I'll display some of the word squares and other stats before I display the python code I used to generate all these things further down.

Some example word squares

a z o
n o w
t o n
f u z z
a l o e
g a r s    
s n i t
f r i z z
l a r e e
o d o r s
p i n o t
s i s s y
z y g o t e
y e l p e d
g l u i n g
o p i a t e
t e n t e r
e d g e r s
e s c a p e s
s e a l a n t
c a s e t t e
a l e r t e r
p a t t e r n
e n t e r a l
s t e r n l y

How many word squares are there?

I counted how many word squares there are for each side length. I was suprised how many there are for 4x4 squares, but then there are no 8x8 squares or higher. The No. of words is how many words of that particular length occur in the wordlist.

Side LengthNo. of wordsNo. of squares
31,0142,607,858
43,64190,669,454
56,68310,991,077
69,55968,518
711,96098
812,3170
911,1810
108,760didn't check
116,296didn't check
124,082didn't check
132,493didn't check
141,326didn't check
15720didn't check

Building the wordlist

I identified two wordlists which I might use: the scrabble dictionary, and googles billion word corpus. Unfortunately, the scrabble dictionary has so many weird words I've never heard of before, and google's dataset has a heap of misspellings and gibberish. To get my wordlist I found the intersection of the two, with the hope being that the result will be just reasonably common proper English words.

The code

I used python-constraint to solve the squares. A basic (very slow) implementation is shown below

from constraint import *
problem = Problem()
s = 6
problem.addVariables(range(s*s), list("abcdefghijklmnopqrstuvwxyz"))

# build the wordlist from the wordlist text file
wordlist = {}
f = open("fewer_words.txt")
for line in f:
    w = line.split()[0].lower()
    wordlist[w] = 1

# we need a function: true if word is in wordlist, false if not
# this is characterwise checking, so a sequence of chars is provided, 
# we need to join them into a string before checking if they are in the list
def inlist(*word): 
    w = ''.join(word)
    if w in wordlist: return True
    return False
   
# add row and col constraints
for i in range(0,s): problem.addConstraint(inlist,range(s*i,s*i+s))
for i in range(0,s): problem.addConstraint(inlist,range(i,s*s,s))

# pretty print the output
a = problem.getSolution()
for i in range(s):
    for j in range(s): print a[i*s+j],
    print('')

It can be sped up a lot by adding way more constraints on what bigrams, trigrams etc are possible. I apologise for the scruffy code, I was more concerned with getting it to work efficiently than making it look nice

from constraint import *
from timeit import default_timer as timer
from itertools import combinations

problem = Problem()
s = 3
problem.addVariables(list(range(s*s)), list("abcdefghijklmnopqrstuvwxyz"))

wordlist = {}
f = open("fewer_words.txt")
for line in f:
    w = line.split()[0].lower()
    if len(w) != s: continue 
    wordlist[w] = 1

# build a list of monograms in each position
mono = []
for j in range(s): mono.append({})
for i in wordlist.keys():
    for j in range(s):
        mono[j][i[j]] = 1
    
# build a list of bigrams at each pair of positions
bgpos = list(combinations(list(range(s)),2))
bg = {}
for j in bgpos: bg[j] = {}
for i in wordlist.keys():
    for j in bgpos:
        bg[j][i[j[0]]+i[j[1]]] = 1
        
tripos = list(combinations(list(range(s)),3))
tri = {}
for j in tripos: tri[j] = {}
for i in wordlist.keys():
    for j in tripos:
        tri[j][i[j[0]]+i[j[1]]+i[j[2]]] = 1

qpos = list(combinations(list(range(s)),4))
q = {}
for j in qpos: q[j] = {}
for i in wordlist.keys():
    for j in qpos:
        q[j][i[j[0]]+i[j[1]]+i[j[2]]+i[j[3]]] = 1

fpos = list(combinations(list(range(s)),5))
f = {}
for j in fpos: f[j] = {}
for i in wordlist.keys():
    for j in fpos:
        f[j][i[j[0]]+i[j[1]]+i[j[2]]+i[j[3]]+i[j[4]]] = 1

def opposite():
    return lambda x,y: x==y
    
# we need a function: true if word is in wordlist, false if not
def inlist(*word): 
    w = ''.join(word)
    if w in wordlist: return True
    return False
   
def inmono(i):
    return lambda x: x in mono[i]
    
def inbg(i,j):
    return lambda *x: ''.join(x) in bg[(i,j)]   
   
def intri(i,j,k):
    return lambda *x: ''.join(x) in tri[(i,j,k)]   

def inq(i,j,k,l):
    return lambda *x: ''.join(x) in q[(i,j,k,l)]   

def inf(i,j,k,l,m):
    return lambda *x: ''.join(x) in f[(i,j,k,l,m)]   
    
# add row and col constraints
for i in range(0,s): problem.addConstraint(inlist,list(range(s*i,s*i+s)))
for i in range(0,s): problem.addConstraint(inlist,list(range(i,s*s,s)))

# add row monogram constraints
for j in range(s):
    for i in range(s):
        problem.addConstraint(inmono(i),[j*s +i])
        problem.addConstraint(inmono(i),[i*s +j])

#print(bg.keys())
for j in range(s):
    for a,b in bgpos:
        problem.addConstraint(inbg(a,b),[j*s +a,j*s +b])
        problem.addConstraint(inbg(a,b),[a*s +j,b*s +j])

for j in range(s):
    for a,b,c in tripos:
        problem.addConstraint(intri(a,b,c),[j*s +a,j*s +b,j*s +c])
        problem.addConstraint(intri(a,b,c),[a*s +j,b*s +j,c*s +j])

for j in range(s):
    for a,b,c,d in qpos:
        problem.addConstraint(inq(a,b,c,d),[j*s +a,j*s +b,j*s +c,j*s +d])
        problem.addConstraint(inq(a,b,c,d),[a*s +j,b*s +j,c*s +j,d*s +j])

for j in range(s):
    for a,b,c,d,e in fpos:
        problem.addConstraint(inf(a,b,c,d,e),[j*s +a,j*s +b,j*s +c,j*s +d,j*s +e])
        problem.addConstraint(inf(a,b,c,d,e),[a*s +j,b*s +j,c*s +j,d*s +j,e*s +j])

# pretty print the output
start = timer()
for count,a in enumerate(problem.getSolutionIter()):
  if count % 1000 == 0: print 'count:',count   
  for i in range(s):
    for j in range(s): print a[i*s+j],
    print('')
  print('')  
end = timer()

print("solution time = " + str(end-start))
print count

for i in range(s):
    for j in range(s): print a[i*s+j],
    print ''
print ''  

Tuesday, 22 November 2016

Constraint Programming with python-constraint

I have recently been looking at constraint programming and I thought I would post some of my code as I go. This is using the python-constraint package.

Doing things with python-constraint is pretty easy. To constrain a variable to have a certain value, we can use InSetConstraint. To make sure all the variables in a set are different, we use the AllDifferentConstraint. These are enough to solve sudoku.

Sudoku

I don't think Sudoku needs any introduction. The constraints are fairly easy to specify: each row must be all different, each col must be all different and each 3x3 block must be all different. Also we have to ensure that the given starting state is used, for this we use equality constraints. We use 1 variable for each number in the sudoku square, so 81 variables all up.

from constraint import *
problem = Problem()
problem.addVariables(range(9*9), range(1,10))

# specify the given starting sudoku problem here, dots are blank squares
sudoku = '\
.42986.1.\
....2...5\
......82.\
4.1..793.\
.683.254.\
.758..1.2\
.39......\
7...4....\
.1.67325.'

# Make sure our solution matches the specified starting thing
for i,char in enumerate(sudoku):
    if char != '.':  problem.addConstraint(InSetConstraint([int(char)]),[i])
    
# add row and col constraints
for i in range(9): 
    problem.addConstraint(AllDifferentConstraint(),range(9*i,9*i+9))
for i in range(9): 
    problem.addConstraint(AllDifferentConstraint(),range(i,81,9))
# add block constraints
for i in [0,3,6,27,30,33,54,57,60]: 
    ind = []
    for j in range(3):
        for k in range(3):
            ind.append(i + j*9 + k)
    problem.addConstraint(AllDifferentConstraint(),ind)

# pretty print the output
a = problem.getSolution()
for i in range(9):
    for j in range(9): print a[i*9+j],
    print ''

The Golomb Ruler

A bit of info on Golomb Rulers: here. If you know how to improve the code below, leave a comment. Golomb rulers are described by their order, which is the number of marks on their edge. The length of a Golomb ruler is the distance between the outer two marks and represents the longest distance it can measure.

There is no requirement that a Golomb ruler measures all distances up to their length. However, if a ruler does measure all distances, it is classified as a perfect Golomb ruler. There are no perfect Golomb rulers above order 4.

Finally, a Golomb ruler is described as optimal if no shorter ruler of the same order exists. The code below first looks for perfect rulers, then increments length each time until an optimal ruler is found. It starts to slow down by order = 7, it takes a few minutes.

This problem is harder than the sudoku problem, we have to use a set of auxiliary variables to represent the pairwise differences, then apply the AllDifferentConstraint to the auxiliary variables.

from constraint import *
from itertools import combinations

order = 7 # the number of marks on the Golomb ruler
diffs = [] # the set of pairwise differences between marks
for i in combinations(range(order),2): diffs.append(i)
length = len(diffs)

def lessthan(a,b): return a < b
def diffeq(a,b,res): return (b-a)==res     

while True:
    print 'trying length: '+str(length)
    problem = Problem()
    problem.addVariables(range(order), range(0,length+1))
    problem.addVariables(diffs,    range(1,length+1))

    # make sure the first mark is 0 and last is length
    problem.addConstraint(InSetConstraint([0]),[0])
    problem.addConstraint(InSetConstraint([length]),[order-1])

    # ensure that the marks are in increasing order
    # make sure all possible pairwise relations are constrained
    for i in diffs: problem.addConstraint(lessthan,[i[0],i[1]])

    # ensure that the differences reflect the marks
    for i in diffs: problem.addConstraint(diffeq,[i[0],i[1],i])
    
    # all the differences must be different    
    problem.addConstraint(AllDifferentConstraint(),diffs)
   
    # pretty print the output
    solutions = problem.getSolutions()
    for ruler in solutions:
        for mark in range(order): print ruler[mark],
        print ''
    if len(solutions) > 0: break # if we find some solutions, quit
    length = length + 1

example output for order = 7, note the shortest ruler is length 25, though a perfect ruler would be length 21:

trying length: 21
trying length: 22
trying length: 23
trying length: 24
trying length: 25
0 4 9 15 22 23 25
0 3 4 12 18 23 25
0 2 3 10 16 21 25
0 2 5 14 18 24 25
0 2 6 9 14 24 25
0 2 7 13 21 22 25
0 2 7 15 21 24 25
0 1 4 10 18 23 25
0 1 7 11 20 23 25
0 1 11 16 19 23 25

Magic Squares

Magic Squares seem to be very common examples when doing constraint programming, so I may as well include the code for them. Info on Magic Squares can be found here. This code is quite slow, I thought it would be much faster. This may be because python-constraint is just kind of slow, or maybe the problem is harder than I thought.

from constraint import *

problem = Problem()
size = 4
problem.addVariables(range(size*size), range(1,size*size+1))
total = size*(size*size + 1)/2

# make sure rows add to total
for i in range(size): 
    problem.addConstraint(ExactSumConstraint(total),range(i*size,i*size+size))
for i in range(size): 
    problem.addConstraint(ExactSumConstraint(total),range(i,size*size,size))
# add the diagonal constraints
diag1,diag2 = [],[]
for i in range(size):
    diag1.append(i*size + i)
    diag2.append(i*size + (size-i-1))

problem.addConstraint(ExactSumConstraint(total),diag1)
problem.addConstraint(ExactSumConstraint(total),diag2)
problem.addConstraint(AllDifferentConstraint())
   
# pretty print the output
solution = problem.getSolution()
for i in range(size):
    for j in range(size):
        print solution[i*size + j],
    print ''