tag:blogger.com,1999:blog-11474263036354038962024-11-01T05:10:30.335-07:00jimblogperfectly cromulent ramblingJames lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.comBlogger45125tag:blogger.com,1999:blog-1147426303635403896.post-47286417641252877932018-05-27T03:24:00.000-07:002019-02-23T02:17:04.596-08:00Getting Apache Drill working on Windows 10<p>Apache Drill is a program that allows you to query many data sources from a single query using SQL. It supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.</p> <p>I basically want to be able to query CSV files, sqlite files, parquet files and other RDBMS using SQL with a unified interface from python. I am looking into Apache Drill, hopefully it will solve all my problems, I'll just be using it in local mode.</p> <p>Download apache drill from here: <a href="https://drill.apache.org/download/">https://drill.apache.org/download/</a>. I did the direct file download, I got version 1.13.0.</p> <p>To get drill working, you'll also need Oracle JDK v8 from here: <a href="http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html">http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html</a>. I got version <tt>jdk-8u171-windows-x64.exe</tt></p> <p>Use 7zip to unzip the tar.gz file to some directory. From the command prompt, cd to the directory, it should have the following contents:</p><pre>bin<br />conf<br />git.properties<br />jars<br />KEYS<br />LICENSE<br />NOTICE<br />README.md<br />sample-data<br />winutils</pre> <p>cd to the bin folder and try to run <pre>sqlline -u jdbc:drill:zk=local</pre> The first error I got was: "WARN: JAVA_HOME not found in your environment. Please set the JAVA_HOME variable in your environment to match the location of your Java installation". This is pretty easy to fix, make sure java JDK is installed and you have an environment variable called JAVA_HOME (see <a href="https://www.java.com/en/download/help/path.xml">here</a> for how to create one) that has the value "C:\Program Files\Java\jdk1.8.0_171" (assuming that is where you installed it).</p> <p>After doing that, the next error I got when running <tt>sqlline -u jdbc:drill:zk=local</tt> was: <tt>Error: Failure in connecting to Drill: org.apache.drill.exec.rpc.RpcException: Failure setting up ZK for client. (state=,code=0)</tt> plus a whole lot more java error stuff. Then no matter what I typed it simply said <tt>No current connection</tt>. To fix this one, you have to put quotes around the argument passed to sqlline like this:</p><pre>sqlline -u "jdbc:drill:zk=local"</pre></p> <p>Note that to exit sqlline I found <tt>Ctrl d</tt> to be the easiest, this simulates an end of file character and the program stops gracefully. Now I can run commands like:</p><pre>USE cp; <br />SELECT employee_id, first_name FROM `employee.json`; <br /></pre> <h1>Connecting to sqlite databases from apache drill</h1><p>Of course you can connect to sqlite databases directly using the sqlite3 console program, but I would like to connect to sqlite dbs, csvs, parquet files and other RDBMS all from the same query. To do that I need the JDBC drivers for sqlite, which are available here: <a href="https://bitbucket.org/xerial/sqlite-jdbc/downloads/">https://bitbucket.org/xerial/sqlite-jdbc/downloads/</a>. I got version <tt>sqlite-jdbc-3.21.0.jar</tt>. This has to be copied to the 3rdparty folder in the apache drill install directory: <tt>C:\Users\james\Downloads\apache-drill-1.13.0\jars\3rdparty</tt> on my system. Make sure you close and reopen sqlline/drill before trying to use the driver.<p> <p>Unfortunately, we can't access sqlite dbs in the same way as files, we have to create a new storage plugin for each sqlite database we have. I downloaded the sample database <tt>chinook.db</tt> from here: <a href="http://www.sqlitetutorial.net/sqlite-sample-database/">http://www.sqlitetutorial.net/sqlite-sample-database/</a>. Extract the database chinook.db somewhere and remember where it is. I got the list of tables by opening the db in <a href="https://sqlitestudio.pl/index.rvt">sqlitestudio</a>.</p> <p>Now we need to open the drill web console by navigating to <tt>http://localhost:8047/storage/</tt> (make sure sqlline or drill is running in the background). We need to create a new storage plugin, we'll call it chinook. Down the bottom of the page there is a text entry, put chinook in there and and press the create new storage plugin button. You'll be met with a text entry field that is expecting JSON. Enter the following, but replace the path that I have entered with the path to your database file:</p> <pre>{<br /> "type": "jdbc",<br /> "driver": "org.sqlite.JDBC",<br /> "url": "jdbc:sqlite:C:\\Users\\james\\Downloads\\chinook.db",<br /> "username": null,<br /> "password": null,<br /> "enabled": true<br />}</pre> <p>Now we can query the sqlite database using drill (you may need to close and restart sqlline/drill):</p><pre>SELECT * FROM chinook.customers WHERE lastname like 'g%';<br />+-------------+------------+------------+---------------------------------------------------+--------------------------+<br />| CustomerId | FirstName | LastName | Company | Address |<br />+-------------+------------+------------+---------------------------------------------------+--------------------------+<br />| 1 | Luφs | Gonτalves | Embraer - Empresa Brasileira de Aeronßutica S.A. | Av. Brigadeiro Faria Lim |<br />| 7 | Astrid | Gruber | null | Rotenturmstra▀e 4, 1010 |<br />| 19 | Tim | Goyer | Apple Inc. | 1 Infinite Loop |<br />| 23 | John | Gordon | null | 69 Salem Street |<br />| 27 | Patrick | Gray | null | 1033 N Park Ave |<br />| 42 | Wyatt | Girard | null | 9, Place Louis Barthou |<br />| 56 | Diego | GutiΘrrez | null | 307 Macacha Gⁿemes |<br />+-------------+------------+------------+---------------------------------------------------+--------------------------+<br />7 rows selected (0.185 seconds)</pre> <p>Unfortunately, not all tables could be read. If I try to open the employees table, i get: </p><pre> SELECT * FROM chinook.employees;<br />Error: DATA_READ ERROR: Failure while attempting to read from database.<br /><br />sql SELECT *<br />FROM employees<br />plugin sqlite<br />Fragment 0:0<br /><br />[Error Id: ac42e2eb-a8ac-478e-b5bd-de663f7bc93e on DESKTOP:31010] (state=,code=0)</pre> <p>The sqlite JDBC driver doesn't seem to like the datetime fields, use something like <a href="https://sqlitestudio.pl/index.rvt">sqlitestudio</a> to change the datatype of the fields to strings, then everything will work. You'll just have to convert the string dates to actual date objects in code later on.</p> <h1>Using Apache drill from python</h1><p>I think the easiest way of doing this is to use pydrill (run <tt>pip install pydrill</tt> to get it). This however uses the rest api, which is likely not the most efficient. Using the ODBC drivers and pyodbc will likely be better, but I have not tested it yet. </p> <pre>from pydrill.client import PyDrill<br /><br />drill = PyDrill(host='localhost', port=8047)<br /><br />if not drill.is_active():<br /> raise ImproperlyConfigured('Please run Drill first')<br /><br />res= drill.query('SELECT * FROM chinook.customers LIMIT 5')<br /><br />for result in res:<br /> print(result)<br /><br /># pandas dataframe<br />df = res.to_dataframe()<br />print(df)</pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com2tag:blogger.com,1999:blog-1147426303635403896.post-65121179926037874562017-01-31T04:27:00.000-08:002019-02-23T02:17:04.944-08:00Generating Gray codes with python<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script> <p>Gray codes are often used in positioning systems, see <a href="https://en.wikipedia.org/wiki/Gray_code">the wiki article</a> for more details. I want to generate gray codes with python, and I'd like to generate all possible gray codes for a specific number of positions and number of bits. The maximum number of positions for a gray code is 2^numbits i.e. a 4 bit gray code can have at most 16 positions. It can also have fewer. If you want the gray code to be circular i.e. the first and last code have hamming distance 1, then there needs to be an even number of positions. Odd is fine if you don't care if it is circular.</p> <p>There are of course algorithms for generating gray codes, but a constraint solver works really well, and also allows you to generate all possible gray codes for a specific number of positions. The python code below is for generating gray codes using python-constraint:</p> <pre class="prettyprint"><br />from constraint import *<br />from itertools import product<br /><br />numbits = 4<br />numpos = 16<br /><br /># this generates a list of all binary vectors of numbits bits<br />vecs = list(product(*(((1,0),)*numbits)))<br /><br />problem = Problem()<br /># we will have numpos variables, i.e. one var for each code,<br /># each var is an index into vecs, has 2^numbits possible vals<br />problem.addVariables(list(range(numpos)), list(range(len(vecs))))<br /><br /># vecs holds our binary vectors. <br /># we want True if hamming distance between vecs[a] and vecs[b] is 1<br /># False otherwise. This will be our constraint. <br />def hammingDist1(a,b):<br /> hamming = sum([vecs[a][i]^vecs[b][i] for i in range(numbits)])<br /> if hamming == 1: return True<br /> return False<br /><br /># add the important constraints<br />for i in range(numpos):<br /> j = (i+1)%numpos<br /> problem.addConstraint(hammingDist1,[i,j])<br /><br /># make sure all the vectors are different <br />problem.addConstraint(AllDifferentConstraint())<br /><br /># print out the first solution we come across<br />sol = problem.getSolution()<br />for i,j in enumerate(sol):<br /> print(vecs[sol[i]])<br /> <br /></pre> <p>The main thing to note is that we add constraints so that each code has a hamming distance of 1 with its neighbour, and also that all the codes are diferent.</p> <p>An example run looks like this (for a 4-bit, 16 position code):</p><pre><br />C:\Users\james\Desktop>python gray.py<br />(0, 0, 0, 0)<br />(0, 0, 0, 1)<br />(0, 0, 1, 1)<br />(0, 0, 1, 0)<br />(0, 1, 1, 0)<br />(0, 1, 1, 1)<br />(0, 1, 0, 1)<br />(1, 1, 0, 1)<br />(1, 0, 0, 1)<br />(1, 0, 0, 0)<br />(1, 0, 1, 0)<br />(1, 0, 1, 1)<br />(1, 1, 1, 1)<br />(1, 1, 1, 0)<br />(1, 1, 0, 0)<br />(0, 1, 0, 0)<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com4tag:blogger.com,1999:blog-1147426303635403896.post-75828828608071794792017-01-16T23:26:00.000-08:002019-02-23T02:17:05.185-08:00Latex Dictionary Template<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script> <p>I recently needed a latex dictionary template, there are a couple around on the internet, but they are all under non-commercial licences so I couldn't use them. Fortunately it is pretty easy to make one yourself, I have provided one here. Consider it to be under the <a href="http://www.wtfpl.net/">wtfpl</a> licence.</p> <p>A short example of what the definitions look like:</p> <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgm0PB1uMwW84J39u3wZfvoWa-vCCKbsxaVjGhigfdTeDPjX65TtMmK442iWLZjwNzBimi3YEX_so-2lrrwU0niL1DuglXCWq-J08DJp1uY0JqIcAMLBqsL7kJTHKMW_FdiRpmDS5im1ptT/s1600/example.png" imageanchor="1" ><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgm0PB1uMwW84J39u3wZfvoWa-vCCKbsxaVjGhigfdTeDPjX65TtMmK442iWLZjwNzBimi3YEX_so-2lrrwU0niL1DuglXCWq-J08DJp1uY0JqIcAMLBqsL7kJTHKMW_FdiRpmDS5im1ptT/s320/example.png" width="320" height="272" /></a> <pre class="prettyprint"><br />%Copyright 2017 James Lyons<br />%This work is free. You can redistribute it and/or modify it under the<br />%terms of the Do What The Fuck You Want To Public License, Version 2,<br />%as published by Sam Hocevar. See http://www.wtfpl.net/ for more details.<br />\documentclass[twoside]{book}<br />\usepackage[utf8]{inputenc}<br />\usepackage{multicol}<br />\usepackage{fancyhdr} <br />\usepackage[bindingoffset=0.2in,margin=0.75in,paperwidth=5.25in,paperheight=8in]{geometry}<br /><br />% set the headers so that first mark is always on left, second on right<br />% this is so that each word is put in the header correctly <br />\pagestyle{fancy} <br />\fancyhead[LO,LE]{\rightmark}<br />\fancyhead[RE,RO]{\leftmark}<br />% set the font<br />\renewcommand*\rmdefault{bch}<br />\newcommand{\comment}[1]{}<br />% remove chapter numbering<br />\usepackage{titlesec}<br />\titleformat{\chapter}<br /> {\Large\bfseries} % format<br /> {} % label<br /> {0pt} % sep<br /> {\huge} % before-code<br />% remove section numbering<br />\setcounter{secnumdepth}{0}<br />% this is how we define the actual dictionary items. markboth is for <br />% setting the page headers so that first and last words appear there.<br />\newcommand{\ditem}[1]{\item[#1] \markboth{\footnotesize \textbf{#1}}{\footnotesize \textbf{#1}}}<br />\begin{document}<br />%%%%%%%%%%%%%%%%%%%%%%%% title page<br />\begin{center}<br /><br />\thispagestyle{empty}<br />\vbox{\vspace{5cm}<br />{\huge The Dictionary Title}}<br />\vfill<br />{\large Author}<br />\vfill<br /><br />\end{center}<br />\clearpage<br />%%%%%%%%%%%%%%%%%%%%%%%% table of contents<br />\tableofcontents<br /><br />%%%%%%%%%%%%%%%%%%%%%%%% Preface<br />\chapter{Preface}<br /><br />This is the preface.<br /><br />\chapter{The Dictionary}<br />% we'd like the text size to be a bit smaller for the definitions<br />\footnotesize<br />% we'd like two columns of definitions<br />\begin{multicols}{2}<br /><br />\begin{description}<br />\section{A}<br />\ditem{Abandon} \textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abandoned} \textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abbey}\textit{-v.} definition 1 \textit{-n.} definition 2<br />\ditem{Abide} \textit{-v.} definition 1 \textit{-n.} definition 2<br />\end{description}<br /><br />\end{multicols}<br /> <br />\end{document}<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0tag:blogger.com,1999:blog-1147426303635403896.post-9497513212433158302017-01-15T23:37:00.000-08:002019-02-23T02:17:05.361-08:00Creating crosswords with python-constraint [part 2]<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script> <p>In the <a href="http://www.cromulentrambling.com/2017/01/creating-crosswords-with-python.html">previous post</a> I was generating the patterns of white and black squares. In this post I'd like to actually fill the patterns with words so that everything matches up. This sort of problem is very well suited to a constraint satisfaction solver, as it can try many possibilities until something is found that works. The trick is specifying the constraints in an efficient manner. My first solution was to implement it the same way as the <a href="http://www.cromulentrambling.com/2017/01/magic-word-squares-with-python.html">magic word squares</a>, but this was really inefficient and there is a much better way.</p> <p>I put the code for this on github, it's a bit messy but works: <A href="https://github.com/jameslyons/crossword_gen">https://github.com/jameslyons/crossword_gen</a>.</p> <p>The trick is you don't need to represent all the different squares in the crossword as separate variables and put the constraints on those, your actual variables will be indexes into the word list, and the constraints will be whether e.g. character 4 of word 1 is the same as character 2 of word 7. These constraints will obviously be different for every crossword and will have to be generated each time, but this isn't too hard.</p> <P>The inputs from the previous step are hex strings e.g.</p><pre>3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg</pre> <p>This can be converted to a binary string by replacing e.g. 3 with '0011', same as decoding hex anywhere. The function I use is down the bottom of the <a href="http://www.cromulentrambling.com/2017/01/creating-crosswords-with-python.html">previous post</a>. This gives a binary string that can be represented as the white and black squares of the crossword.</p> <h1>Example run</h1><p>The code takes as argument the hex representation of the pattern. It first prints a representation of the crossword with 'X's where the letters go, then once it gets solved it outputs the solved crossword, along with the list of words that got used. The dictionary I am using has some weird words in it, so you need to pay some attention to that if you want nice crosswords (I recommend using something like the <tt>3of6game.txt</tt> from the <a href="http://wordlist.aspell.net/12dicts/">12dicts wordlist</a>, it is specifically for games). The final thing that gets printed is the crossword represented as a string, this is passed to some typesetting code I have that produces nice pdfs of the crosswords.</p> <pre>>pypy cw_fill.py BEFB4516BFAFAAF8FE27573BFBDD5DEFEE75723F8FAAFAFEB4516FBEh<br />building word lists...<br />X X X X X X X X X X X X<br />X X X X X X<br />X X X X X X X X X X X<br />X X X X X X X X X X X<br />X X X X X X X X X<br /> X X X X X X X X<br />X X X X X X X X X X X X X<br />X X X X X X X X X X<br />X X X X X X X X X X X X X<br /> X X X X X X X X<br />X X X X X X X X X<br />X X X X X X X X X X X<br />X X X X X X X X X X X<br />X X X X X X<br />X X X X X X X X X X X X<br />starting...<br />count: 0<br />s . c r u m b . r e m a n . b<br />w . r . . . r . i . . . o . r<br />a . a . p h o e b u s . s . i<br />m o w n . e . l . r . g y m s<br />i . . . c r i m i n e . . . k<br />. . p o o . n . n . v e g . .<br />a v o . i n f a u n a . i l k<br />s . s u n . l . t . d i g . i<br />p h i . a l a n i n e . o f t<br />. . t a g . m . l . r a t . .<br />i . . . e y e l e s s . . . o<br />m o t s . e . o . p . w a t t<br />i . o . s t o p g a p . m . t<br />d . r . . . a . e . . . i . a<br />e . c l a c k . l u l l s . r<br /><br />['swami', 'asp', 'imide', 'craw', 'posit', 'torc', 'coinage', 'her', 'yet', 'bro',<br /> 'inflame', 'oak', 'elm', 'lop', 'rib', 'inutile', 'gel', 'urn', 'spa', 'evaders',<br /> 'nosy', 'gigot', 'amis', 'brisk', 'kit', 'ottar', 'crumb', 'reman', 'phoebus',<br /> 'mown', 'gyms', 'crimine', 'poo', 'veg', 'avo', 'infauna', 'ilk', 'sun', 'dig',<br /> 'phi', 'alanine', 'oft', 'tag', 'rat', 'eyeless', 'mots', 'watt', 'stopgap',<br /> 'clack', 'lulls']<br />solution time = 1.05939997405<br />s.crumb.reman.bw.r...r.i...o.ra.a.phoebus.s.imown.e.l.r.gymsi...crimine...k..p<br />oo.n.n.veg..avo.infauna.ilks.sun.l.t.dig.iphi.alanine.oft..tag.m.l.rat..i...ey<br />eless...omots.e.o.p.watti.o.stopgap.m.td.r...a.e...i.ae.clack.lulls.r</pre> <h1>The code</h1><p>This takes the hex string output from the pattern generation step as an argument. It uses a wordlist called 'fewerwords.txt'. Each time it is run it should give different results because the word indexes are shuffled each time. Sometimes it will take a long time to find a solution, I suggest just killing it and running it again. Most of the time this will get a new starting point that allows it to be solved very quickly, though some patterns are just harder to solve than others.</p> <pre class="prettyprint"><br />from __future__ import print_function<br />from constraint import *<br />from timeit import default_timer as timer<br />from itertools import combinations<br />from math import sqrt<br />from test_bin2hex import hex2bin<br />import sys<br /><br />''' to run from pypy<br />pypy cw_fill.py 3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg<br />'''<br /># this program will take a crossword pattern and try to fill it with words<br /><br />if len(sys.argv) < 2: print('supply pattern as argument')<br />else:<br /> pattern = hex2bin(sys.argv[1]) <br /> pattern = [int(i) for i in pattern]<br /><br />s = int(sqrt(len(pattern)))<br /><br />print('building word lists...')<br /># extract the word positions from the pattern<br />wordlocs = []<br />hwordlocs = []<br />vwordlocs = []<br />for i in range(s):<br /> hword = []<br /> vword = []<br /> for j in range(s):<br /> if pattern[i*s + j] == 0: print(' ',end=' ')<br /> else: print('X',end=' ')<br /> if pattern[i*s + j] == 1:<br /> hword.append(i*s + j)<br /> else:<br /> if len(hword) > 1: hwordlocs.append(hword)<br /> hword = [] <br /> <br /> if pattern[j*s + i] == 1:<br /> vword.append(j*s + i)<br /> else:<br /> if len(vword) > 1: vwordlocs.append(vword)<br /> vword = []<br /> <br /> if j == s-1:<br /> if len(vword) > 1 and vword not in vwordlocs: vwordlocs.append(vword)<br /> if len(hword) > 1 and hword not in hwordlocs: hwordlocs.append(hword)<br /> <br /> print('')<br />wordlocs.extend(vwordlocs)<br />wordlocs.extend(hwordlocs)<br /><br /># now that we have the words, add constraints so each word is in the dictionary<br />maxL = 10<br /><br />wordlist = []<br />f = open("fewwords.txt")<br />for line in f:<br /> w = line.split()[0].lower()<br /> wordlist.append(w)<br /><br />wordlen = []<br />olen = 0<br />for c,word in enumerate(wordlist):<br /> if olen < len(word):<br /> wordlen.append(c)<br /> olen = len(word)<br /><br />problem = Problem()<br />from random import shuffle<br /><br />for c,w in enumerate(wordlocs):<br /> L = len(w)-3<br /> r = list(range(wordlen[L],wordlen[L+1]))<br /> shuffle(r)<br /> problem.addVariables([c], list(r))<br /><br />problem.addConstraint(AllDifferentConstraint())<br /><br />def eq(a,b):<br /> def temp(c1,c2):<br /> word1 = wordlist[c1]<br /> word2 = wordlist[c2]<br /> if word1[a]==word2[b]: return True<br /> return False<br /> return temp<br /> <br />for c1,word1 in enumerate(wordlocs):<br /> for c2,word2 in enumerate(wordlocs):<br /> if word1==word2: continue<br /> intsc = list(set(word1).intersection(set(word2)))<br /> if len(intsc) > 0:<br /> a = word1.index(intsc[0])<br /> b = word2.index(intsc[0])<br /> problem.addConstraint(eq(a,b),[c1,c2]) <br /> <br />print('starting...')<br /> <br />def getwords(solution):<br /> listq = []<br /> for locs in wordlocs:<br /> word = ''.join([solution[loc] for loc in locs])<br /> listq.append(word) <br /> return listq<br /> <br /># pretty print the output<br />start = timer()<br />b = ''<br />for count,a in enumerate(problem.getSolutionIter()): <br /> print('count: '+str(count))<br /> sol = ['.' for i in range(s*s)]<br /> for key,val in a.iteritems():<br /> for c,char in enumerate(wordlist[val]):<br /> sol[wordlocs[key][c]] = char <br /> b=sol<br /> for i in range(s):<br /> for j in range(s): <br /> print(sol[i*s+j],end=' ')<br /> print('')<br /> print('') <br /> # get the words we have used<br /> zz = getwords(sol)<br /> print(zz)<br /> break<br />end = timer()<br />print("solution time = " + str(end-start))<br />print(''.join([str(b[i]) for i in range(s*s)]))<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com2tag:blogger.com,1999:blog-1147426303635403896.post-49215539438502722492017-01-15T22:38:00.000-08:002019-02-23T02:17:05.538-08:00Creating Crosswords with python-constraint [part 1]<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script> <h1>Intro</h1><p>I put the code for this on github, it's a bit messy but works: <A href="https://github.com/jameslyons/crossword_gen">https://github.com/jameslyons/crossword_gen</a>.</p> <p>In <a href="http://www.cromulentrambling.com/2017/01/magic-word-squares-with-python.html">previous posts</a> I have been playing with python-constraint, and building crosswords is something that interests me for some reason. I decided to do it in a two step process: step 1 generate the patterns of white and black squares, step 2 fill the patterns with letters. This post will deal with step 1 only, the <a href="http://www.cromulentrambling.com/2017/01/creating-crosswords-with-python_15.html">next post</a> will deal with step 2.</p> <p>Generating the patterns is well suited to using a constraint solver since you can just specify the things you'd like to be true of the final result, then ask the constraint solver to do it for you. In this case we have e.g. 15 by 15 binary variables, we choose the constraints so that the final pattern looks like a crossword puzzle.</p> <h1>What do pretty patterns look like</h1><p>I am interested in automatically generating patterns that look like nice crossword patterns e.g.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8TcUHwRD6HLE_7fT_esuXuuqTt52h4OLLrhYI29fSOwiNnqLwFjb9ksxD89RzibMBu56cSxW3BtTggtNh7TwGpXcxBNI3UX1aqJ1dqj6APEliAc-b2q9TlPDDT6anj0Nh5t-0aLrZlMRm/s1600/1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8TcUHwRD6HLE_7fT_esuXuuqTt52h4OLLrhYI29fSOwiNnqLwFjb9ksxD89RzibMBu56cSxW3BtTggtNh7TwGpXcxBNI3UX1aqJ1dqj6APEliAc-b2q9TlPDDT6anj0Nh5t-0aLrZlMRm/s320/1.jpg" width="320" height="320" /></a></div> <p>Some of the things I'd like to be true of the patterns I generate:</p><ol><li>I'd like 4-fold symmetry, because I think it looks good.</li><li>No 2x2 blocks of white squares.</li><li>No 2x2 blocks of black squares.</li><li>No length 1 words.</li><li>No length 2 words.</li><li>No words longer than MAXLEN.</li><li>No sequences of black squares longer than 3.</li><li>White squares must be fully connected.</li></ol> <p>Some of these conditions are pretty easy to implement with python-constraint, the fully connected constraint I can just apply to the final solutions and discard them at that point. By fully connected I mean you can get from one square to every other square in the pattern, so you can't get disjoint islands.</p> <h1>The implementation</h1><P>Examples of what the program outputs:</p><pre><br />3EF9C51EBFAFAAF8FE37577BFB9D5CEFEF75763F8FAAFAFEBC51CFBEg<br />X X . . . . . X . . . . . X X<br />. . . X X X . X . X X X . . .<br />. X . X . . . . . . . X . X .<br />. . . . X . X . X . X . . . .<br />. X X X . . . . . . . X X X .<br />. X . . . X . X . X . . . X .<br />. . . X . . . . . . . X . . .<br />X X . . . X . X . X . . . X X<br />. . . X . . . . . . . X . . .<br />. X . . . X . X . X . . . X .<br />. X X X . . . . . . . X X X .<br />. . . . X . X . X . X . . . .<br />. X . X . . . . . . . X . X .<br />. . . X X X . X . X X X . . .<br />X X . . . . . X . . . . . X X<br /></pre> <P>The hex string is the binary to hex encoding of the pattern, 'X's are zeros, '.'s are ones. The hex string will be passed to <a href="http://www.cromulentrambling.com/2017/01/creating-crosswords-with-python_15.html">step 2</a> which will fill in the words using a wordlist.</p> <p>It's pretty easy to generate a few million example patterns with this program (if you are willing to wait a day), which is what I suggest doing. Once you have a heap of patterns you can further filter them by whatever criteria you want e.g. no more than X 3-letter words, no more than X% black squares etc. Once you have the filtered list, you can try filling them with words.</p> <pre class="prettyprint"><br />from __future__ import print_function<br />from constraint import *<br />from itertools import combinations<br />from math import sqrt,floor<br /><br /># this program will create crossword patterns, but not fill it with words<br /># i.e. just patterns of zeros and ones for the black and white squares<br /><br /># tests if a solution is fully connected i.e. can every point be reached from<br /># any random point. True if fullyconnected, false if disjoint parts.<br /># Could easily be made more efficient, but this works fine.<br />def isfullyconnected(solution):<br /> L = int(sqrt(len(solution)))<br /> # get list of all non-zero points<br /> plist = []<br /> for i in range(L):<br /> for j in range(L):<br /> if solution[i*L + j]: plist.append((i,j))<br /> <br /> # start top right corner, try to get to every other point<br /> reached = {plist[0]:1}<br /> rlen = len(reached)<br /> oldrlen = 0<br /> while rlen != oldrlen:<br /> temp = list(reached.keys())<br /> for i,j in temp:<br /> if (i+1,j) in plist: reached[(i+1,j)]=1<br /> if (i-1,j) in plist: reached[(i-1,j)]=1<br /> if (i,j+1) in plist: reached[(i,j+1)]=1<br /> if (i,j-1) in plist: reached[(i,j-1)]=1<br /> oldrlen = rlen<br /> rlen = len(reached)<br /> # if rlen is not eq number of all points, we couldn't reach some points<br /> if rlen < len(plist): return False<br /> return True <br /> <br />from random import randint<br /> <br />def getPattern(sidelen=15,maxlen=9):<br /> problem = Problem()<br /> s = sidelen<br /> maxWordLen = maxlen<br /> problem.addVariables(list(range(s*s)), [0,1])<br /> <br /> # ensure there are no 2x2 blocks of ones or zeros<br /> def no4x4block(*x):<br /> if sum(x)==0: return False<br /> if sum(x)==len(x): return False<br /> else: return True<br /><br /> for i in range(s-1):<br /> for j in range(s-1):<br /> problem.addConstraint(no4x4block,[i*s+j,(i*s+j)+1,(i*s+j)+s,(i*s+j)+s+1])<br /> <br /> #ensure matrix is symmetric <br /> def eq(a,b): <br /> return a==b<br /><br /> for i in range(s):<br /> for j in range(s):<br /> problem.addConstraint(eq,[i*s+j,(s-i-1)*s+j])<br /> problem.addConstraint(eq,[i*s+j,i*s+(s-j-1)])<br /> problem.addConstraint(eq,[i*s+j,(s-i-1)*s+(s-j-1)])<br /> <br /> #ensure no words longer than maxWordLen<br /> def maxlen(*x):<br /> if sum(x) == len(x): return False<br /> else: return True<br /><br /> seq = list(range(maxWordLen+1))<br /> for i in range(s - maxWordLen):<br /> for j in range(s):<br /> problem.addConstraint(maxlen,[(k+i)+j*s for k in seq])<br /> problem.addConstraint(maxlen,[(k+i)*s+j for k in seq])<br /> <br /> # no two letter words<br /> def no2letter(*x):<br /> if x == (0,1,1,0): return False<br /> else: return True<br /> <br /> seq = list(range(4))<br /> for i in range(s - 3):<br /> for j in range(s):<br /> problem.addConstraint(no2letter,[(k+i)+j*s for k in seq])<br /> problem.addConstraint(no2letter,[(k+i)*s+j for k in seq])<br /><br /> # no two letter words around the edge either<br /> def no2letterEdge(*x):<br /> if x == (1,1,0): return False<br /> else: return True<br /> <br /> seq = list(range(4))<br /> for i in range(s):<br /> problem.addConstraint(no2letterEdge,[i,i+s,i+2*s])<br /> problem.addConstraint(no2letterEdge,[i*s,i*s+1,i*s+2])<br /> problem.addConstraint(no2letterEdge,[i*s+s-1,i*s+s-2,i*s+s-3])<br /> problem.addConstraint(no2letterEdge,[i+s*(s-1),i+s*(s-2),i+s*(s-3)])<br /> <br /> # no one letter squares<br /> def no1letter(*x): <br /> if sum(x) == 0: return True<br /> if sum(x[1:]) == 0: return False<br /> return True<br /> <br /> seq = list(range(4))<br /> for i in range(s):<br /> for j in range(s):<br /> vals = [i*s+j]<br /> if i-1 >= 0: vals.append((i-1)*s+j) <br /> if i+1 < s: vals.append((i+1)*s+j) <br /> if j-1 >= 0: vals.append(i*s+(j-1)) <br /> if j+1 < s: vals.append(i*s+(j+1)) <br /> problem.addConstraint(no1letter,vals)<br /> <br /> # no more than three 0's in a row<br /> def nothreeblank(*x):<br /> if sum(x) == 0: return False<br /> return True<br /> <br /> seq = list(range(4))<br /> for i in range(s - 3):<br /> for j in range(s):<br /> problem.addConstraint(nothreeblank,[(k+i)+j*s for k in seq])<br /> problem.addConstraint(nothreeblank,[(k+i)*s+j for k in seq])<br /> <br /> for sol in problem.getSolutionIter(): <br /> if not isfullyconnected(sol): continue<br /> yield(sol)<br /> <br />from test_bin2hex import bin2hex<br />import sys<br /><br />print(len(sys.argv))<br />if len(sys.argv)<2: sidelen = 11<br />else: sidelen = int(sys.argv[1])<br />print('using sidelen',sidelen)<br /><br />f = open('pattern'+str(sidelen)+'.txt','w')<br />for count,sol in enumerate(getPattern(sidelen=sidelen,maxlen=8)):<br /> f.write(bin2hex(''.join([str(sol[a]) for a in range(sidelen*sidelen)]))+'\n')<br /> if count % 100 == 0: <br /> print('count: '+str(count))<br /> print(bin2hex(''.join([str(sol[a]) for a in range(sidelen*sidelen)])))<br /> for i in range(sidelen):<br /> for j in range(sidelen): <br /> if sol[i*sidelen+j]==1: print('.',end=' ')<br /> else: print('X',end=' ')<br /> print('')<br /> print('')<br /> #break<br /></pre> <p>The hex string is generated using the following python script (test_bin2hex.py). The extra letters are so we can encode binary strings that aren't necessarily divisible by 4.</p><pre class="prettyprint"><br /><br />bin = ['0000','0001','0010','0011','0100','0101','0110','0111','1000','1001','1010','1011','1100','1101','1110','1111',<br /> '0','1','00','01','10','11','000','001','010','011','100','101','110','111']<br />hex = "0123456789ABCDEFghijklmnopqrst"<br /><br />def bin2hex(binstr):<br /> map = dict(zip(bin,hex))<br /> ret = ''<br /> for i in range(0,len(binstr),4):<br /> sub = binstr[i:i+4]<br /> ret += map[sub]<br /> return ret<br /> <br />def hex2bin(hexstr):<br /> map = dict(zip(hex,bin))<br /> ret = ''<br /> for i in hexstr:<br /> ret += map[i]<br /> return ret<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0tag:blogger.com,1999:blog-1147426303635403896.post-88598672057437162672017-01-02T19:40:00.000-08:002019-02-23T02:17:05.714-08:00Magic Word Squares with python-constraint<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script> <p>This post will deal with the seemingly boring topic of magic word squares - I define these to be squares of letters where each row is a dictionary word, and each column is also a dictionary word. In my wordlist, there are 3,641 4-letter words (I talk about how I generated the wordlist <a href="#wordlist">below</a>). How many possible 4 by 4 word squares do you think there are? Would you believe 90.6 million? I was rather surprised to discover there were so many. What about 8 by 8 word squares (given there are 12,317 8 letter words in the wordlist)? would you believe 0? It turns out it gets much harder to find word squares as you increase the side length.</p> <p>I'll display some of the word squares and other stats before I display the python code I used to generate all these things further down.</p> <h1>Some example word squares</h1> <table><tr><td><pre><br />a z o<br />n o w<br />t o n<br /></pre></td><td><pre><br />f u z z<br />a l o e<br />g a r s <br />s n i t<br /></pre></td><td><pre><br />f r i z z<br />l a r e e<br />o d o r s<br />p i n o t<br />s i s s y<br /></pre></td></tr><tr><td><pre><br />z y g o t e<br />y e l p e d<br />g l u i n g<br />o p i a t e<br />t e n t e r<br />e d g e r s<br /></pre></td><td></td><td><pre><br />e s c a p e s<br />s e a l a n t<br />c a s e t t e<br />a l e r t e r<br />p a t t e r n<br />e n t e r a l<br />s t e r n l y<br /></pre></td></tr></table> <h1>How many word squares are there?</h1><p>I counted how many word squares there are for each side length. I was suprised how many there are for 4x4 squares, but then there are no 8x8 squares or higher. The No. of words is how many words of that particular length occur in the wordlist.</p> <p><table><tr><td>Side Length</td><td>No. of words</td><td>No. of squares</td></tr><tr><td>3</td><td>1,014</td><td>2,607,858</td></tr><tr><td>4</td><td>3,641</td><td>90,669,454</td></tr><tr><td>5</td><td>6,683</td><td>10,991,077</td></tr><tr><td>6</td><td>9,559</td><td>68,518</td></tr><tr><td>7</td><td>11,960</td><td>98</td></tr><tr><td>8</td><td>12,317</td><td>0</td></tr><tr><td>9</td><td>11,181</td><td>0</td></tr><tr><td>10</td><td>8,760</td><td>didn't check</td></tr><tr><td>11</td><td>6,296</td><td>didn't check</td></tr><tr><td>12</td><td>4,082</td><td>didn't check</td></tr><tr><td>13</td><td>2,493</td><td>didn't check</td></tr><tr><td>14</td><td>1,326</td><td>didn't check</td></tr><tr><td>15</td><td>720</td><td>didn't check</td></tr></table></p> <a name="wordlist"></a><h1>Building the wordlist</h1><p>I identified two wordlists which I might use: the <A href="https://raw.githubusercontent.com/jonbcard/scrabble-bot/master/src/dictionary.txt">scrabble dictionary</a>, and <a href="http://norvig.com/ngrams/count_1w.txt">googles billion word corpus</a>. Unfortunately, the scrabble dictionary has so many weird words I've never heard of before, and google's dataset has a heap of misspellings and gibberish. To get my wordlist I found the intersection of the two, with the hope being that the result will be just reasonably common proper English words.</p> <h1>The code</h1><p>I used python-constraint to solve the squares. A basic (very slow) implementation is shown below</p> <pre class="prettyprint"><br />from constraint import *<br />problem = Problem()<br />s = 6<br />problem.addVariables(range(s*s), list("abcdefghijklmnopqrstuvwxyz"))<br /><br /># build the wordlist from the wordlist text file<br />wordlist = {}<br />f = open("fewer_words.txt")<br />for line in f:<br /> w = line.split()[0].lower()<br /> wordlist[w] = 1<br /><br /># we need a function: true if word is in wordlist, false if not<br /># this is characterwise checking, so a sequence of chars is provided, <br /># we need to join them into a string before checking if they are in the list<br />def inlist(*word): <br /> w = ''.join(word)<br /> if w in wordlist: return True<br /> return False<br /> <br /># add row and col constraints<br />for i in range(0,s): problem.addConstraint(inlist,range(s*i,s*i+s))<br />for i in range(0,s): problem.addConstraint(inlist,range(i,s*s,s))<br /><br /># pretty print the output<br />a = problem.getSolution()<br />for i in range(s):<br /> for j in range(s): print a[i*s+j],<br /> print('')<br /></pre> <p>It can be sped up a lot by adding way more constraints on what bigrams, trigrams etc are possible. I apologise for the scruffy code, I was more concerned with getting it to work efficiently than making it look nice</p><pre class="prettyprint"><br />from constraint import *<br />from timeit import default_timer as timer<br />from itertools import combinations<br /><br />problem = Problem()<br />s = 3<br />problem.addVariables(list(range(s*s)), list("abcdefghijklmnopqrstuvwxyz"))<br /><br />wordlist = {}<br />f = open("fewer_words.txt")<br />for line in f:<br /> w = line.split()[0].lower()<br /> if len(w) != s: continue <br /> wordlist[w] = 1<br /><br /># build a list of monograms in each position<br />mono = []<br />for j in range(s): mono.append({})<br />for i in wordlist.keys():<br /> for j in range(s):<br /> mono[j][i[j]] = 1<br /> <br /># build a list of bigrams at each pair of positions<br />bgpos = list(combinations(list(range(s)),2))<br />bg = {}<br />for j in bgpos: bg[j] = {}<br />for i in wordlist.keys():<br /> for j in bgpos:<br /> bg[j][i[j[0]]+i[j[1]]] = 1<br /> <br />tripos = list(combinations(list(range(s)),3))<br />tri = {}<br />for j in tripos: tri[j] = {}<br />for i in wordlist.keys():<br /> for j in tripos:<br /> tri[j][i[j[0]]+i[j[1]]+i[j[2]]] = 1<br /><br />qpos = list(combinations(list(range(s)),4))<br />q = {}<br />for j in qpos: q[j] = {}<br />for i in wordlist.keys():<br /> for j in qpos:<br /> q[j][i[j[0]]+i[j[1]]+i[j[2]]+i[j[3]]] = 1<br /><br />fpos = list(combinations(list(range(s)),5))<br />f = {}<br />for j in fpos: f[j] = {}<br />for i in wordlist.keys():<br /> for j in fpos:<br /> f[j][i[j[0]]+i[j[1]]+i[j[2]]+i[j[3]]+i[j[4]]] = 1<br /><br />def opposite():<br /> return lambda x,y: x==y<br /> <br /># we need a function: true if word is in wordlist, false if not<br />def inlist(*word): <br /> w = ''.join(word)<br /> if w in wordlist: return True<br /> return False<br /> <br />def inmono(i):<br /> return lambda x: x in mono[i]<br /> <br />def inbg(i,j):<br /> return lambda *x: ''.join(x) in bg[(i,j)] <br /> <br />def intri(i,j,k):<br /> return lambda *x: ''.join(x) in tri[(i,j,k)] <br /><br />def inq(i,j,k,l):<br /> return lambda *x: ''.join(x) in q[(i,j,k,l)] <br /><br />def inf(i,j,k,l,m):<br /> return lambda *x: ''.join(x) in f[(i,j,k,l,m)] <br /> <br /># add row and col constraints<br />for i in range(0,s): problem.addConstraint(inlist,list(range(s*i,s*i+s)))<br />for i in range(0,s): problem.addConstraint(inlist,list(range(i,s*s,s)))<br /><br /># add row monogram constraints<br />for j in range(s):<br /> for i in range(s):<br /> problem.addConstraint(inmono(i),[j*s +i])<br /> problem.addConstraint(inmono(i),[i*s +j])<br /><br />#print(bg.keys())<br />for j in range(s):<br /> for a,b in bgpos:<br /> problem.addConstraint(inbg(a,b),[j*s +a,j*s +b])<br /> problem.addConstraint(inbg(a,b),[a*s +j,b*s +j])<br /><br />for j in range(s):<br /> for a,b,c in tripos:<br /> problem.addConstraint(intri(a,b,c),[j*s +a,j*s +b,j*s +c])<br /> problem.addConstraint(intri(a,b,c),[a*s +j,b*s +j,c*s +j])<br /><br />for j in range(s):<br /> for a,b,c,d in qpos:<br /> problem.addConstraint(inq(a,b,c,d),[j*s +a,j*s +b,j*s +c,j*s +d])<br /> problem.addConstraint(inq(a,b,c,d),[a*s +j,b*s +j,c*s +j,d*s +j])<br /><br />for j in range(s):<br /> for a,b,c,d,e in fpos:<br /> problem.addConstraint(inf(a,b,c,d,e),[j*s +a,j*s +b,j*s +c,j*s +d,j*s +e])<br /> problem.addConstraint(inf(a,b,c,d,e),[a*s +j,b*s +j,c*s +j,d*s +j,e*s +j])<br /><br /># pretty print the output<br />start = timer()<br />for count,a in enumerate(problem.getSolutionIter()):<br /> if count % 1000 == 0: print 'count:',count <br /> for i in range(s):<br /> for j in range(s): print a[i*s+j],<br /> print('')<br /> print('') <br />end = timer()<br /><br />print("solution time = " + str(end-start))<br />print count<br /><br />for i in range(s):<br /> for j in range(s): print a[i*s+j],<br /> print ''<br />print '' <br /></pre> James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com2tag:blogger.com,1999:blog-1147426303635403896.post-42508134851499508152016-11-22T03:54:00.000-08:002019-02-23T02:17:05.890-08:00Constraint Programming with python-constraint<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script> <p>I have recently been looking at constraint programming and I thought I would post some of my code as I go. This is using the <a href="https://labix.org/python-constraint">python-constraint package</a>.</p> <p>Doing things with python-constraint is pretty easy. To constrain a variable to have a certain value, we can use InSetConstraint. To make sure all the variables in a set are different, we use the AllDifferentConstraint. These are enough to solve sudoku. <h1>Sudoku</h1><p>I don't think Sudoku needs any introduction. The constraints are fairly easy to specify: each row must be all different, each col must be all different and each 3x3 block must be all different. Also we have to ensure that the given starting state is used, for this we use equality constraints. We use 1 variable for each number in the sudoku square, so 81 variables all up.</p> <pre class="prettyprint">from constraint import *<br />problem = Problem()<br />problem.addVariables(range(9*9), range(1,10))<br /><br /># specify the given starting sudoku problem here, dots are blank squares<br />sudoku = '\<br />.42986.1.\<br />....2...5\<br />......82.\<br />4.1..793.\<br />.683.254.\<br />.758..1.2\<br />.39......\<br />7...4....\<br />.1.67325.'<br /><br /># Make sure our solution matches the specified starting thing<br />for i,char in enumerate(sudoku):<br /> if char != '.': problem.addConstraint(InSetConstraint([int(char)]),[i])<br /> <br /># add row and col constraints<br />for i in range(9): <br /> problem.addConstraint(AllDifferentConstraint(),range(9*i,9*i+9))<br />for i in range(9): <br /> problem.addConstraint(AllDifferentConstraint(),range(i,81,9))<br /># add block constraints<br />for i in [0,3,6,27,30,33,54,57,60]: <br /> ind = []<br /> for j in range(3):<br /> for k in range(3):<br /> ind.append(i + j*9 + k)<br /> problem.addConstraint(AllDifferentConstraint(),ind)<br /><br /># pretty print the output<br />a = problem.getSolution()<br />for i in range(9):<br /> for j in range(9): print a[i*9+j],<br /> print ''<br /></pre> <h1>The Golomb Ruler</h1><p>A bit of info on Golomb Rulers: <a href="https://en.wikipedia.org/wiki/Golomb_ruler">here</a>. If you know how to improve the code below, leave a comment. Golomb rulers are described by their order, which is the number of marks on their edge. The length of a Golomb ruler is the distance between the outer two marks and represents the longest distance it can measure.</p> <p>There is no requirement that a Golomb ruler measures all distances up to their length. However, if a ruler does measure all distances, it is classified as a perfect Golomb ruler. There are no perfect Golomb rulers above order 4.</p> <p>Finally, a Golomb ruler is described as optimal if no shorter ruler of the same order exists. The code below first looks for perfect rulers, then increments length each time until an optimal ruler is found. It starts to slow down by order = 7, it takes a few minutes.</p> <p> This problem is harder than the sudoku problem, we have to use a set of auxiliary variables to represent the pairwise differences, then apply the AllDifferentConstraint to the auxiliary variables.</p> <pre class="prettyprint">from constraint import *<br />from itertools import combinations<br /><br />order = 7 # the number of marks on the Golomb ruler<br />diffs = [] # the set of pairwise differences between marks<br />for i in combinations(range(order),2): diffs.append(i)<br />length = len(diffs)<br /><br />def lessthan(a,b): return a < b<br />def diffeq(a,b,res): return (b-a)==res <br /><br />while True:<br /> print 'trying length: '+str(length)<br /> problem = Problem()<br /> problem.addVariables(range(order), range(0,length+1))<br /> problem.addVariables(diffs, range(1,length+1))<br /><br /> # make sure the first mark is 0 and last is length<br /> problem.addConstraint(InSetConstraint([0]),[0])<br /> problem.addConstraint(InSetConstraint([length]),[order-1])<br /><br /> # ensure that the marks are in increasing order<br /> # make sure all possible pairwise relations are constrained<br /> for i in diffs: problem.addConstraint(lessthan,[i[0],i[1]])<br /><br /> # ensure that the differences reflect the marks<br /> for i in diffs: problem.addConstraint(diffeq,[i[0],i[1],i])<br /> <br /> # all the differences must be different <br /> problem.addConstraint(AllDifferentConstraint(),diffs)<br /> <br /> # pretty print the output<br /> solutions = problem.getSolutions()<br /> for ruler in solutions:<br /> for mark in range(order): print ruler[mark],<br /> print ''<br /> if len(solutions) > 0: break # if we find some solutions, quit<br /> length = length + 1<br /></pre> <p>example output for order = 7, note the shortest ruler is length 25, though a perfect ruler would be length 21:</p><pre>trying length: 21<br />trying length: 22<br />trying length: 23<br />trying length: 24<br />trying length: 25<br />0 4 9 15 22 23 25<br />0 3 4 12 18 23 25<br />0 2 3 10 16 21 25<br />0 2 5 14 18 24 25<br />0 2 6 9 14 24 25<br />0 2 7 13 21 22 25<br />0 2 7 15 21 24 25<br />0 1 4 10 18 23 25<br />0 1 7 11 20 23 25<br />0 1 11 16 19 23 25</pre> <h1>Magic Squares</h1><p>Magic Squares seem to be very common examples when doing constraint programming, so I may as well include the code for them. Info on Magic Squares can be found <a href="https://en.wikipedia.org/wiki/Magic_square">here</a>. This code is quite slow, I thought it would be much faster. This may be because python-constraint is just kind of slow, or maybe the problem is harder than I thought.</p> <pre class="prettyprint">from constraint import *<br /><br />problem = Problem()<br />size = 4<br />problem.addVariables(range(size*size), range(1,size*size+1))<br />total = size*(size*size + 1)/2<br /><br /># make sure rows add to total<br />for i in range(size): <br /> problem.addConstraint(ExactSumConstraint(total),range(i*size,i*size+size))<br />for i in range(size): <br /> problem.addConstraint(ExactSumConstraint(total),range(i,size*size,size))<br /># add the diagonal constraints<br />diag1,diag2 = [],[]<br />for i in range(size):<br /> diag1.append(i*size + i)<br /> diag2.append(i*size + (size-i-1))<br /><br />problem.addConstraint(ExactSumConstraint(total),diag1)<br />problem.addConstraint(ExactSumConstraint(total),diag2)<br />problem.addConstraint(AllDifferentConstraint())<br /> <br /># pretty print the output<br />solution = problem.getSolution()<br />for i in range(size):<br /> for j in range(size):<br /> print solution[i*size + j],<br /> print ''<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com5tag:blogger.com,1999:blog-1147426303635403896.post-69410411423165108882016-11-14T02:28:00.000-08:002019-02-23T02:17:06.068-08:00Terraforming Venus<p>I thought I would look at some of the figures involved in terraforming Venus, discussions of which can be found all over the place e.g. <a href="http://terraforming.wikia.com/wiki/Venus">here</a>. The numbers from this page come from <a href="https://www.reddit.com/r/space/comments/58x005/the_surface_of_venus/d943oa6/">here</a>. Venus is the planet next to earth as you go towards the Sun. It has a year of 224.7 Earth days, and a day length of 243 Earth days (longer than the year!). It also rotates in the opposite direction to most other planets, likely due to some large collision early in its life.</p> <p>I'll quote a little bit from <a href="https://en.wikipedia.org/wiki/Venus">wikipedia</a>: Venus has an extremely dense atmosphere composed of 96.5% carbon dioxide, 3.5% nitrogen, and traces of other gases, most notably sulfur dioxide. The mass of its atmosphere is 93 times that of Earth's, whereas the pressure at its surface is about 92 times that at Earth's—a pressure equivalent to that at a depth of nearly 1 kilometre under Earth's oceans. The density at the surface is 65 kg/m3, 6.5% that of water or 50 times as dense as Earth's atmosphere at 20 °C at sea level. The CO2-rich atmosphere generates the strongest greenhouse effect in the Solar System, creating surface temperatures of at least 735 K (462 °C).</p> <p>Terraforming Venus would involve removing most of the sulfuric acid and carbon dioxide, needless to say this is a monumental task that we don't currently have the technology for. Some of the truly staggering figures:</p> <table><tr><td>Total Mass of Venus Atmosphere </td><td> = 4.80E+20 kg</td></tr><tr><td>Percent Atmosphere CO2 </td><td> = 96.50% </td></tr><tr><td>Total Mass of CO2 </td><td> = 4.63E+20 kg</td></tr></table>For comparison: <table><tr><td>Total Mass of Earth's Atmosphere </td><td> = 5.10E+18 kg</td></tr><tr><td>Mass Ratio Venus to Earth </td><td> = 94.118 </td></tr></table><p>Utilizing the Bosch reaction (CO2 + 2H2 -> C + 2H2O) , combining hydrogen with carbon dioxide to make carbon graphite and water could be used to remove Venus' carbon dioxide. This reaction requires the introduction of iron, cobalt or nickel as a catalyst and requires a temperature level of 530-730 degrees Celsius. Since the reaction releases some heat, it could be self sustaining if you could supply enough hydrogen. A problem with this is that that the production of elemental carbon tends to foul the catalyst's surface, which is detrimental to the reaction's efficiency [<a href="https://en.wikipedia.org/wiki/Bosch_reaction">ref</a>].</p> <table><tr><td>Molecular Weight of CO2 </td><td> = 44 </td></tr><tr><td>Molecular Weight of 2*H2 </td><td> = 4 </td></tr><tr><td>Total Initial Molecular Weight </td><td> = 48 </td></tr><tr><td>Molecular Weight of C (graphite) </td><td> = 12 </td></tr><tr><td>Molecular Weight of 2*H2O </td><td> = 36</td></tr><tr><td>Total Final Molecular Weight </td><td> = 48 </td></tr></table><p>You could use water from e.g. comets to get hydrogen for the reaction. You would have to use Electrolysis to split the water and initiate the reaction. Alternatively you could reuse the water output from the Bosch reaction , recover the Hydrogen and continue. But this way you wouldn't have oceans on Venus at the end. </p> <p>If you were to provide hydrogen for the reaction (4E19 kg), this massive reaction would create an ocean nearly as large as 1/4 of the Earth’s ocean:</p><table><tr><td>Ratio of 2*H2O to CO2 (36 / 44) </td><td> = 0.818 </td></tr><tr><td>Resultant Mass of H2O </td><td> = 3.79E+20 kg</td></tr><tr><td>Density of H2O </td><td> = 1.000 g / cm^3</td></tr><tr><td> </td><td> = 1,000,000.000 g / M^3</td></tr><tr><td> </td><td> = 1,000.000 kg / M^3</td></tr><tr><td>Volume of Resultant H2O </td><td> = 3.79E+17 M^3</td></tr><tr><td> </td><td> = 3.79E+08 kM^3</td></tr><tr><td>Area of Venus Surface </td><td> = 4.602E+08 kM^2</td></tr><tr><td>Average Depth of H2O </td><td> = 0.824 kM </td></tr><tr><td>Total Volume of Earth's Oceans </td><td> = 1.300E+09 kM^3</td></tr><tr><td>Average Depth of Earth's Oceans </td><td> = 3.682 kM</td></tr></table> <p>It would also result in the deposition of a layer of graphite with an average thickness over the entire surface of Venus roughly equal to a 40 story building:</p> <table><tr><td>Ratio of graphite to CO2 (12 / 44) </td><td> = 0.273 </td></tr><tr><td>Resultant Mass of graphite </td><td> = 1.26E+20 kg</td></tr><tr><td>Density of C (graphite) </td><td> = 2.230 g / cm^3</td></tr><tr><td> </td><td> = 2,230,000.000 g / M^3</td></tr><tr><td> </td><td> = 2,230.000 kg / M^3</td></tr><tr><td>Volume of Resultant C (graphite) </td><td> = 5.66E+16 M^3</td></tr><tr><td> </td><td> = 5.66E+07 kM^3</td></tr><tr><td>Area of Venus Surface </td><td> = 4.602E+08 kM^2</td></tr><tr><td>Average Depth of C (graphite) </td><td> = 0.123 kM </td></tr></table> <p>So after converting all the carbon dioxide to graphite, you would be left with ~120 meter thick blanket of carbon over the planet. Though this may not be terrible, carbon makes a pretty good base for soils. There would still be a great deal of atmospheric pressure due to all the oxygen though, and a spark would result in a mighty conflagration that would undo all your work. A better idea may be to trap the carbon dioxide as carbonate rocks, but for this you would need large quantities of calcium and/or magnesium (About 8×10^20 kg of calcium or 5×10^20 kg of magnesium would be required). This would have the advantage of removing a lot of the oxygen, which would lower the atmospheric pressure. Unfortunately Magnesium carbonate begins to decompose at 350 degrees C, so Calcium might be a better bet. You would also need to remove all the sulphuric acid since it would dissolve all your carbonate rocks, using up the sulphuric acid but producing more carbon dioxide. </p> <p>Of course you have to power all this as well, electrolysis doesn't come cheap. Solar panels can work well on Venus because the sunlight is 4 times stronger. Perhaps some genetically engineered plant...</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com2tag:blogger.com,1999:blog-1147426303635403896.post-539555291306713782016-10-12T23:05:00.000-07:002019-02-23T02:17:06.244-08:00CODZ ciphers, testing RC4 in python<p>Some of the <a href="http://www.callofdutyzombies.com/topic/183529-all-revelations-ciphers-texture-files/#comment-1759652">CODZ ciphers</a> consist of binary data, and it is not clear what sort of cipher has created them. Notably 2,3,7 and 10 are base64 converted binary data. Ciphers 4,5,11, and 14 are hex data. I will be dealing with 2,3,7 and 10 in this post.</p> <p>Cipher 2 decrypts to: dd3ed3b56ff21a7751ebbbaf6ffd1086f190a6b29b49c2a47e656bcc91768346df0321db23505874cb9f8a71d978490e and 3a411e6479e27ee31d601d486c7c807dbd20d85273ad599f90a7126caae9406be1434fdbbeeefa45d9c6cbf5ad4fb4f7 depending on wether you read the base64 data forwards or backwards. Either way you get exactly 384 bits of binary, which is suspiciously similar to the SHA384 hash length. I tried running john the ripper on the two hashes for a day or so, but nothing turned up. Either it is not SHA384 or we're gonna need some better word lists.</p> <p>Ciphers 3,7 and 10 are longer. I have tested their entropy after breaking them up into different length segments, and the entropy is pretty much always equal to random, this means they are highly unlikely to have been generated by a classical cipher. More likely something modern like RC4. I have tried decrypting the 3 ciphers as RC4 using a wordlist for the keys, but no luck. </p> <p>Code for testing entropy:</p><pre><br />from base64 import decodestring<br />from bitarray import bitarray<br />from binascii import b2a_hex<br />from math import log<br /><br />ctext = 'iW9cXmzOU7ZuZBtW40b3ng...' # I cut the rest of the cipher off to save space<br />data = decodestring(ctext)<br /><br />a = bitarray(endian='little')<br />a.frombytes(data)<br /><br />for bitlen in range(1,16):<br /> code = {}<br /> for i in range(2**bitlen):<br /> string = format(i,'0'+str(bitlen)+'b')<br /> code[i] = bitarray(string)<br /> num = a.decode(code)<br /> freq = {}<br /> for i in num:<br /> if i in freq: freq[i] += 1.<br /> else: freq[i] = 1.<br /> N = len(num)<br /> en = 0<br /> for i in freq.values():<br /> p = i/N<br /> en -= p*log(p)<br /> print bitlen,' entropy: ',en,' if random: ',-log(1./2**bitlen)<br /></pre> <p>For decrypting RC4, we keep the key corresponding to the decrypt with the lowest entropy, if the wrong key is used something high entropy will result, but if it decrypts to text we should get something easily identifiable. Python code for decrypting RC4 with a word list: </p><pre><br />data = base64.b64decode(ctext)<br /><br />def rc4(data,key):<br /> S = range(256)<br /> j = 0<br /> out = []<br /><br /> #KSA Phase<br /> for i in range(256):<br /> j = (j + S[i] + ord( key[i % len(key)] )) & 0xFF<br /> S[i] , S[j] = S[j] , S[i]<br /><br /> #PRGA Phase<br /> i = j = 0<br /> for char in data:<br /> i = ( i + 1 ) & 0xFF<br /> j = ( j + S[i] ) & 0xFF<br /> S[i] , S[j] = S[j] , S[i]<br /> out.append(chr(ord(char) ^ S[(S[i] + S[j]) & 0xFF]))<br /> return out<br /> <br /># find a good word list and use it here<br />keylist = open("C:\Users\james\Desktop\cipher_stuff\simplesub_word\\count_1w.txt")<br /><br />besten = 10e10<br />bestkey = ""<br />bestout = ""<br />for count,key in enumerate(keylist):<br /> key = key.split()[0].strip()<br /> for i in range(3):<br /> if i == 0: key = key[0].upper() + key[1:].lower()<br /> elif i == 1: key = key.upper()<br /> else: key = key.lower()<br /> out = rc4(data,key)<br /> # compute entropy of the decoded text<br /> freq = {}<br /> for i in out:<br /> if i in freq: freq[i] += 1.<br /> else: freq[i] = 1.<br /> N = len(out)<br /> en = 0<br /> for i in freq.values():<br /> p = i/N<br /> en -= p*log(p)<br /> if en < besten:<br /> besten = en<br /> bestkey = key<br /> bestout = ''.join(out[:20])<br /> if count % 500 == 0: print count, key, bestkey,besten,bestout<br />print count, key, bestkey,besten,bestout<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0tag:blogger.com,1999:blog-1147426303635403896.post-65465687555702712532016-10-09T01:44:00.000-07:002019-02-23T02:17:06.418-08:00CODZ 8. sha_paper <p>I havn't actually played the game, I just like breaking codes.</p> <p>Apparently there are some new CODZ ciphers, the transcripts can be seen here: <a href="http://www.callofdutyzombies.com/topic/183529-all-revelations-ciphers-texture-files/#comment-1759652">http://www.callofdutyzombies.com/topic/183529-all-revelations-ciphers-texture-files/#comment-1759652</a>.</p> <p>This post will look at the following cipher:</p> <pre><br />bx re yh zy bf lm kt ut yg se tb sx ky co jh km aq we tx wx<br />cy ji ut vt kn vc gx aw ij av qn lg ef fj uq bd kn sv cx fn<br />je wr rk kn cg aw xq vn zf li fh vz wt ta ia ij zf eh uf tj <br />qm yg hl yq cx ij vw ig de qz tg nj rs er vk tm sa yv tw hr <br />hs lt vy kr qc tv gh hb jn yb qh er ut gk et cs wv jl rh xo <br />wr ex hr xt zi kc xs qs fd wd cm ku ah fh fj lf ui ly sh vf <br />au xm hx qw dl gi cx vb dh wt xm kv un ej kt kt ye cq jd ef <br />eh zv xt he uz tg cl jw nr tw ur vo jt jo ru iq iy rz ey ho <br />gd nq yn bq ul ai fh bu ji ho nw qg yg vj if yv zu id jc gh <br />ke xr qf cq ra it gw dl fc gq ti iu qu ny vr gy sj rh iu hi <br />wr mv ym zi lk re vk xu ry uq gs ve gd yn bq ch ky er qh jr <br />ho ya el ky zj ei hz cb if dk<br /></pre> <p>There are 460 characters total, 25 distinct characters, IC is 0.0418. Counts:</p><pre>a: 12, c: 18, b: 11, e: 23, d: 12, g: 20,<br />f: 19, i: 24, h: 29, k: 21, j: 23, m: 9,<br />l: 14, o: 7, n: 15, q: 23, s: 12, r: 24,<br />u: 20, t: 26, w: 18, v: 23, y: 25, x: 19,<br />z: 13</pre> <p>Since this cipher uses pairs of characters, and there are 25 distinct characters, it is highly likely to be foursquare, bifid, playfair or some other digraphic substitution cipher based on a 5 by 5 grid. Usually I and J are combined, but it looks like in this cipher P is missing, so It was combined with something else.</p> <p>Other ciphers that use a 25 letter alphabet include BAZERIES, BIFID, CADENUS, CHECKERBOARD, CM BIFID, FOURSQUARE, PHILLIPS, PHILLIPS-C, PHILLIPS-RC, PLAYFAIR, SERIATED PLAYFAIR, TRI-SQUARE, TWIN BIFID and TWO-SQUARE. Unfortunately this is quite a lot, and trying them all will be time consuming.</p> <p>I am trying 4 variants of the above the cipher, the original cipher: bkreyh..., the reversed cipher: kdfibc..., forward but with each pair reversed: xberhy..., and each pair forwards, but the pairs from last to first: dkifcb... .</p> <p>BAZERIES is a substitution + transposition, which means its IC would be that of normal text. Since the IC of this text is ~0.04, I am confident in saying it is not BAZERIES . It also can't be CADENUS because the message length is not a multiple of 25. CHECKERBOARD uses a 5x5 key square, but can at maximum use 20 characters in the ciphertext. </p> <P>Possibilities after thinking a bit: BIFID, CM BIFID, TWIN BIFID, FOURSQUARE, PHILLIPS, PLAYFAIR, SERIATED PLAYFAIR, TRI-SQUARE, and TWO-SQUARE. The next piece of information is that none of the letter pairs in the ciphertext contain a doubled letter e.g. AA, BB, CC etc. This is highly unlikely to occur with most ciphers, but it is a peculiarity of PLAYFAIR. So this puts playfair near the top of the list. </p> <p>I have tried the <a href="http://practicalcryptography.com/cryptanalysis/stochastic-searching/cryptanalysis-playfair/"> playfair cryptanalysis program</a> over here on all four variants and it hasn't worked. I did a test sentence of the same length which it cracked in just a few iterations, so I am fairly confident in saying it is not a vanilla playfair, or at least if it is then something else is going on as well (such as some sort of route cipher).</p> <p>For bifid, there is no obvious period identifiable (<a href="http://practicalcryptography.com/cryptanalysis/stochastic-searching/cryptanalysis-bifid-cipher/">see here for how to compute it</a>), and some quick simulated annealing didn't turn anything up.</p> <p>Regarding the possibility of routes, 460 has a lot of factors: 2x230,4x115,5x92,10x45,20x23. This means lots of routes around the rectangles when the characters are lined up right. Exploring this would add a lot of extra time to the cracking effort.</p> <P>The quest continues...</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0tag:blogger.com,1999:blog-1147426303635403896.post-26626896054942854692016-10-08T07:01:00.000-07:002019-02-23T02:17:06.592-08:00CODZ cipher 13. kin_paper_torn straddle checkerboard?<p>I haven't actually played the game, I just like breaking codes.</p> <p>Apparently there are some new CODZ ciphers, the transcripts can be seen here: <a href="http://www.callofdutyzombies.com/topic/183529-all-revelations-ciphers-texture-files/#comment-1759652">http://www.callofdutyzombies.com/topic/183529-all-revelations-ciphers-texture-files/#comment-1759652</a>.</p> <p>This post will look at the following cipher: </p><pre>@CB?>?>C@F?A?>CCCG??@C>?@C@C>???G=?C?=C@CGBC@CB?G@A<br />?=CB?C?G<?>CB@C=CB?<FG<?>C>C>CA?B?AGA?F?C?@CA?G=?B?>C@B?</pre> <h1>TLDR</h1><p>EDIT: I'm beginning to think that this is not the final decryption, that they have used a slightly modified method which results in some extra characters. There is too much decrypted for me to think that it is totally wrong, but I believe there is an extra step I'm missing somewhere in the middle. </EDIT></p> The ciphertext is a straddle checkerboard. Converting the 10 symbols to digits, and reversing the string, we get (using @=0, C=1, B=2, ?=3, >=4, F=5, A=6, G=7, = is 8, <=9, Note that this mapping is arbitrary, and any other mapping would work just as well):</p><tt>320143238736103135367632361414143975932181021439731321 83607321012710183138733341010341033711143635014343210</tt> <p>See the link for a description of the <a href="http://practicalcryptography.com/ciphers/straddle-checkerboard-cipher/">straddle checkerboard</a>. Using this key:</p> <pre> 0123456789<br /> C.D.JYPEKW <br />1 NXV.T.QFUL <br />3 ZSOGIARVBH</pre> <p>The message decrypts to <tt>OCTOBERNSAREFORTTTHEYFOUNDTHESOURCEONVENUSBEGINNINGEXTRACTION</tt>. with spaces: <tt>OCTOBER NSA REPORT T THEY FOUND THE SOURCE ON VENUS BEGINNING EXTRACTION</tt>. There are likely a couple characters wrong still, or the slight mistakes are intentional.</p> <h1>My approach</h1><p>There are 10 different characters, 107 in total. The counts for each character are as follows: <tt>A: 6, @: 11, C: 24, B: 9, G: 8, F: 3, =: 5, <: 3, ?: 28, >: 10</tt>. It so happens that 10 characters can be converted to digits, so we can look at existing ciphers that use digits e.g. the straddle checkerboard, GRANDPRE, MONOME-DINOME, MORBIT, NIHILIST SUBSTITUTION, POLLUX or TRIDIGITAL. There are probably also more.</p> <p>Converted to digits it is: <tt>0123434105363411173301430101433378313810172101237063812313 7934120181239579341414163236763531301637832341023</tt>. If you reverse it, you get <tt>320143238736103135367632361414143975932181021439731321 83607321012710183138733341010341033711143635014343210</tt>. I'll be talking about the non reversed ciphertext, but I'll be applying all the steps to both just in case. Using the above ciphertext and <a href="http://web.archive.org/web/20130726135932/http://home.comcast.net/~acabion/refscore.html">this</a> calculator, we can compute the most likely cipher algorithms, and it has concluded that it is most likely a nihilist substitution. Unfortunately, it can't be nihilist because it has an odd number of characters, and nihilist requires even. It can't be grandpre for the same reason. CryptoCrack (from <a href="https://sites.google.com/site/cryptocrackprogram/home">here</a>) can't break it as GRANDPRE, MORBIT, NIHILIST SUBSTITUTION, POLLUX or TRIDIGITAL.</p> <p>That really only leaves <a href="http://practicalcryptography.com/ciphers/straddle-checkerboard-cipher/">straddle checkerboard</a> as the last candidate from the ciphers I can think of. Assuming it is a straddle checkerboard, we first have to find to location of the blanks in the key. If we can find those locations, we can decrypt in with any letters and we will have a plain substitution cipher. There are 10 choose 2 = 45 ways of putting the first 2 blanks, and 20 choose 2 = 190 ways of putting the second. This means there are 45*190 = 8550 possible configurations of putting the blanks. Some of these will be invalid though, e.g. the key:</p><pre> 0 1 2 3 4 5 6 7 8 9<br /> f k m c p d y e<br />3: h b i g q r o s a z<br />7: l u t j n w v x </pre> <p>can't have 38, 39, 78 or 79 in the ciphertext. If these numbers do exist in the ciphertext, then this is not a possible set of blanks. After trying to decrypt our ciphertext with all 8550 keys, only 3340 are actually valid blank positions. In addition to this, most of the blank positions result in identical outputs to other blank positions, so we can discard all the duplicate outputs, leaving only 45 substitution ciphers that we have to try.</p> <p>After trying to decrypt the resulting 45 substitution ciphers from the forward cipher and another 45 from the reversed cipher, one of the decrypts, namely <tt>SALSYFWIRVWFESWLLLZFDHSOIBLZFRSOWAFSIKFIORYFTUIIUITFJLWVALUSI</tt>, can be decrypted to <tt>OCTOBERNSAREFORTTTHEYWOUNDTHESOURCEONMENUSBEGINNINGEXTRACTION</tt>. There are likely a couple characters wrong still, or the slight mistakes are intentional.</p> <p>This is a bit of python code that will spit out the possible substitution ciphers for a given straddle checkerboard ciphertext (Please excuse the rough code, I wrote everything pretty quick):</p><p> I have actually since discovered there is a much shorter way of getting to the final candidates, but I'll leave this code here. To rank the final candidates, use index of coincidence, the correct one is almost always the one with the IC closest to 0.07.</p> <pre>import random<br />import sys<br />from itertools import combinations<br /><br />ctext = '6909746723099383772753870703607230943837727093872638757438333832743772974928387272384175943874720383270'<br /><br />''' decrypt a straddle checkerboard cipher ctext given a key<br /> key should look like e.g. 'fkm.cpd.yehbigqrosazlutjnwvx..' <br /> it should be 30 chars in length, and have 4 dots. <br /> 2 dots in the first 10 chars, and 2 in the last 20.<br /> - if it returns 0 in the first result, the key was invalid<br />def scdecrypt(ctext,key):<br /> dotpos = []<br /> for i,k in enumerate(key): <br /> if k == '.': dotpos.append(i)<br /> output = "" <br /> if dotpos[0] > 9: return 0,output<br /> if dotpos[1] > 9: return 0,output<br /> if dotpos[2] < 10: return 0,output<br /> if dotpos[3] < 10: return 0,output<br /> flag = 0<br /> for cc in ctext:<br /> c = int(cc)<br /> if key[c] != '.' and flag == 0: <br /> output += key[c]<br /> elif flag == 1: <br /> if key[10+c] == '.': return 0,""<br /> output += key[10+c]<br /> flag = 0<br /> elif flag == 2: <br /> if key[20+c] == '.': return 0,"" <br /> output += key[20+c]<br /> flag = 0<br /> elif c == dotpos[0]: flag = 1 <br /> elif c == dotpos[1]: flag = 2<br /> else: <br /> return 0,output<br /> return 1,output<br /><br />''' see if word and ct are possibly the same substitution cipher <br />returns 0 in the first part of the result if they are not ''' <br />def canstart(ct,word):<br /> key = {}<br /> if len(word) > len(ct): return 0,key<br /> for i,char in enumerate(word):<br /> if ct[i] not in key: <br /> if char not in key.values(): <br /> key[ct[i]] = char<br /> else: return 0,key<br /> elif key[ct[i]] == char: pass<br /> else: return 0,key<br /> return 1,key<br /><br /># get a list of all the possible blank spot permutations<br />dp = []<br />a = combinations(range(10),2)<br />for i in a:<br /> b = combinations(range(20),2)<br /> for j in b:<br /> dotpos = list(i)<br /> dotpos.append(10+j[0])<br /> dotpos.append(10+j[1])<br /> dp.append(dotpos)<br /><br /># we have all possible blank spot positions, try decrypting with each and see if we can discard any<br />count = 0<br />texts = []<br />for d in dp:<br /> key = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')<br /> for pos in d: key.insert(pos,'.')<br /> <br /> bool,ptext = scdecrypt(ctext,key)<br /> if bool == 0: continue<br /> texts.append(ptext)<br /> count += 1<br /><br /># now we have a heap of valid decrypts, discard duplicates<br />lst = []<br />for i in range(len(texts)):<br /> cs = False<br /> for j in range(len(lst)):<br /> if canstart(lst[j],texts[i])[0]: cs = True<br /> if not cs: lst.append(texts[i])<br /><br /># print the unique possible decrypts, they must be solved as substitution ciphers<br />for i,j in enumerate(range(len(lst))):<br /> print lst[j]<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com1tag:blogger.com,1999:blog-1147426303635403896.post-35080706561526687572016-09-29T21:44:00.000-07:002019-02-23T02:17:06.768-08:00The CODZ Mob of the dead ADFGX cipher<p>The ADFGX cipher from CODZ has been unsolved for quite a while, even with many people trying to break it. I thought I would document some of my progress on this page. Other Discussions:</p> <p><ul><li><a href="https://www.reddit.com/r/codes/comments/5542it/looking_to_hire_or_work_with_a_cryptographer_to/">https://www.reddit.com/r/codes/comments/5542it/looking_to_hire_or_work_with_a_cryptographer_to/</a></li><li><a href="http://www.callofdutyzombies.com/topic/154218-adfgx-cipher/">http://www.callofdutyzombies.com/topic/154218-adfgx-cipher/</a></li><li><a href="https://www.reddit.com/r/CODZombies/comments/4vxy7f/adfgx_key_word_mega_thread/">https://www.reddit.com/r/CODZombies/comments/4vxy7f/adfgx_key_word_mega_thread/</a></li><li><a href="https://www.reddit.com/r/CODZombies/comments/3wfhht/adfgx_cipher_still_unsolved_and_treyarch_employee/">https://www.reddit.com/r/CODZombies/comments/3wfhht/adfgx_cipher_still_unsolved_and_treyarch_employee/</a></li><li><a href="https://www.reddit.com/r/CODZombies/comments/3d94uz/mob_of_the_dead_cipher/">https://www.reddit.com/r/CODZombies/comments/3d94uz/mob_of_the_dead_cipher/</a></li><li><a href="http://www.callofdutyzombies.com/topic/182194-mob-of-the-dead-loading-screen-code/">http://www.callofdutyzombies.com/topic/182194-mob-of-the-dead-loading-screen-code/</a></li><li><a href="http://s13.zetaboards.com/Crypto/topic/7047345/1/">http://s13.zetaboards.com/Crypto/topic/7047345/1/</a></li></ul></p><p>There are of course other discussions around, but I'm not sure how much useful information is in them. Apparently the developers of the game have said that this was one of the early ciphers they made, and it may be too difficult to break. But we will have a go anyway.</p> <h1>EDIT: later thoughts</h1><p>This paragraph has been added later, but I have decided that most of the analysis on this page is useless because I can't crack length 34 substitution ciphers using all the knowledge and code I have (which is mostly over <a href="http://practicalcryptography.com/cryptanalysis/stochastic-searching/cryptanalysis-simple-substitution-cipher/">here</a>). I have made a couple of my own length 34 substitution ciphers, and I can't get anywhere even close to cracking them (If it were e.g. 100 characters it would be totally possible). So I might think about how to solve that problem. But until then, there is little point in the analysis below, because even without the ADFGX keyword steps it is impossible to solve (for me at the moment).</p> <h1>Some Observations</h1><p>An <a href="http://practicalcryptography.com/ciphers/adfgx-cipher/">ADFGX</a> cipher with 68 characters means the plaintext has only 34 characters. Unfortunately this is close to the unicity distance for substitution ciphers, so it may be unbreakable from that perspective unless some extra information can be found. Most substitution cipher solvers can't break text this short as there are incorrect solutions that will score higher (using ngram scorers) than the true decryption. The ciphertext is as follows:</p> <pre>FFGXGD<br />GFFAGF<br />GGDDGF<br />FFXXFF<br />FDGFFG<br />FDGFFG<br />FDGGFF<br />FGFGAA<br />FXFXDX<br />XFXDGF<br />FAGGFF<br /> AF</pre> <p>There are two components to an <a href="http://practicalcryptography.com/ciphers/adfgx-cipher/#the-algorithm">ADFGX key</a>: a keyword e.g. 'GERMAN' and a substitution key for the <a href="http://practicalcryptography.com/ciphers/polybius-square-cipher/">Polybius square</a>. Much of the discussion on various forums is about finding the ADFGX keyword, however I think this is unneccessary. With a length 6 keyword there are only 6! = 720 possible keys, and only 48 possibilities if we assume the last two columns are the first 2 characters of the keyword (this is a reasonable assumption because they are dangling at the end like that).</p> <p>This means we can decrypt the cipher using each of the possible 48 keyword permutations (this gives us 48 Polybius square ciphers), then use an arbitrary Polybius square key e.g. "ABCDEFGHIKLMNOPQRSTUVWXYZ" to get a 48 ciphers which are just a substitution cipher. From here, ideally, it would just take a good substitution cipher solver applied to each of the 48 candidates to determine which can be broken resulting in English text. It turns out that, due to symmetry of the polybius square, there are actually only 24 possible substitution ciphers, which makes our job easier. Unfortunately 34 characters is just too short for most substitution cipher crackers, or alternatively one of the previous assumptions is wrong.</p> <p>Of course I am not the first to think of this, and many people have tried this exact procedure to no avail.</p> <h1>The Repeated Rows</h1><p>There are two identical rows in the ciphertext: <tt>FDGFFG FDGFFG</tt>. This means, regardless of the keyword, there will be 6 letters in the plaintext that consists of 3 letters repeated e.g. THETHE. How often does this sort of thing occur in English? Quite often actually. I have a corpus of around 50 million English sentences from news websites and wikipedia which can be used to determine the frequency of this sort of occurrence.<p> <p>It so happens that in every million english sentences, you would expect around 34,000 to have a repeated triple like this, consisting of around 1100 distinct sequences (i.e. many of the 34,000 are repeats of previously seen sequences). In total after 50 million sentences 6164 distinct 6-character sequences were found, occurring 1.89 million times in total. The most common are: THATHA (235167), ANDAND (176355), INGING (122567), THETHE (89908), REAREA (67514), NTONTO (42511), ETHETH (36691), SSESSE (31542), NDINDI (31259), ASSASS (29754), followed by many other rarer ones. </p> <p>To utilise this information, it may be possible to fix the letters in the substitution key so that the repeated 3-letter combo is e.g. fixed to THATHA. This might make it easier for an n-gram scorer to identify the rest of the letters if some are known. I haven't tried this yet.</p> <h1>A Dictionary attack on the Polybius bit</h1><p>Apparently past ciphers in the game have used keywords to generate keys for ciphers. This means we may be able to use a dictionary attack to generate polybius keys for decrypting the adfgx. This means we don't have to crack the substitution component, which would make things easier.</p> <p>After doing this, the best decryption I could get from the 48 keyword permutations was VOGANHEMOWIGHHEHHENHEKANZEWESINHAC with a 4-gram score of -166. So this means either the polybius key was not a dictionary word (my dictionary has about 500k words in it), the adfgx keword is not length 6, or something other assumption we have made is wrong.</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com2tag:blogger.com,1999:blog-1147426303635403896.post-13940976165240629992015-10-23T00:23:00.000-07:002019-02-23T02:17:06.943-08:00Installing RNNLIB on Ubuntu 14.04<p>I have previously looked at CURRENNT, but now I'll go through how to use RNNLIB, which is a different rnn library. RNNLIB doesn't use cuda, but it can do multi-dimensional rnns and also can use the CTC loss function which can be handy. Let's get right into it. Download RNNLIB from sourceforge <a href="http://sourceforge.net/projects/rnnl/">here</a>. </p> <p>You'll have to install the netcdf library:</p><pre>sudo apt-get install libnetcdf-dev</pre> <p>You should also install boost: <tt>sudo apt-get install libboost-all-dev</tt>. I have libboost1.55 installed already.</p> <p>There is an installation guide over <a href="">here</a> if you want some additional pointers. First run configure and make:</p><pre>configure<br />make</pre> <h2>Fixing the errors</h2><P>The first error I got was:</p><pre>Mdrnn.hpp:239:6: required from here<br />Helpers.hpp:311:41: error: ‘range_min_size’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]<br /> size_t size = range_min_size(r1, r2, r3);</pre> <p>To fix this one you need to put templates code above the code that uses them. Moving the 4 templates for range_min_size to the top of the file should fix all the issues in Helpers.hpp. I cut lines 532-547 and pasted the lines to line 163 of Helpers.hpp.</p> <p>Running make again I got: <tt>Container.hpp:157:24: error: ‘resize’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation</tt>. To fix this one change on line 157 of containers.hpp "resize" to "this->resize." </p> <p>That was all I needed to do to get this version installed.</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com7tag:blogger.com,1999:blog-1147426303635403896.post-67774951356973371372015-10-22T23:45:00.000-07:002019-02-23T02:17:07.117-08:00Running the CURRENNT rnn library<p>Previously I have made posts about <a href="http://www.cromulentrambling.com/2015/10/installing-currennt-on-ubuntu-1404.html">installing Currennt</a> and <a href="http://www.cromulentrambling.com/2015/10/running-currennt-and-writing-hdf5-files.html">writing data files for Currennt</a>. This post will be about actually running it assuming it is installed and the data files have been written.</p> <p>The first thing you'll need to specify is the config file, which I called <tt>config.cfg</tt>. Its contents look like this:</p><pre>max_epochs_no_best = 20<br />max_epochs = 100<br />learning_rate = 1e-5<br />network = network.jsn <br />train = true<br />train_file = full_train_currennt.nc<br />val_file = full_test_currennt.nc<br />weights_dist = normal<br />weights_normal_sigma = 0.1<br />weights_normal_mean = 0<br />stochastic = true<br />validate_every = 1<br />parallel_sequences = 20<br />input_noise_sigma = 0<br />shuffle_fractions = true<br />shuffle_sequences = false<br /></pre> <p>Many of the options are described on the <a href="http://sourceforge.net/p/currennt/wiki/Command%20line%20options/">currennt wiki</a>. The <tt>network.jsn</tt> is a file that I will describe down below and contains the network configuration e,g, number of layers, layer types etc. The files <tt>full_train_currennt.nc</tt> and <tt>full_test_currennt.nc</tt> were created using the <a href="http://www.cromulentrambling.com/2015/10/running-currennt-and-writing-hdf5-files.html">MATLAB script in a previous post</a>. If you set parallel sequences too high you'll run out of memory on your GPU, but higher means it will train faster. </p> <h2>The network configuration</h2> <p>Info on the network configuration can be found <a href="http://sourceforge.net/p/currennt/wiki/Neural%20network%20configuration/">over here</a>. The network I used looks like this:</p><pre>{<br /> "layers": [<br /> {<br /> "size": 20,<br /> "name": "input",<br /> "type": "input"<br /> },<br /> {<br /> "size": 200,<br /> "name": "level_1",<br /> "bias": 1.0,<br /> "type": "blstm"<br /> },<br /> {<br /> "size": 200,<br /> "name": "level_2",<br /> "bias": 1.0,<br /> "type": "blstm"<br /> },<br /> {<br /> "size": 4,<br /> "name": "output",<br /> "bias": 1.0,<br /> "type": "softmax"<br /> },<br /> {<br /> "size": 4,<br /> "name": "postoutput",<br /> "type": "multiclass_classification"<br /> }<br /> ]<br />}</pre><h2> actually running stuff</h2><p>Now to run things, just do:</p><pre>/path/to/currennt/build/current config.cfg</pre><p>and everything should work, just wait for it to finish training. This will output a file called <tt>trained_network.jsn</tt> which we will need during testing. For testing, you will need another HDF5 file just like the training one, except with the test sequences in it. For testing we have another config file which I called <tt>ff_config:</tt></p><pre><br />network = trained_network.jsn<br />cuda = 0<br />ff_output_format = csv<br />ff_output_file = test_results<br />ff_input_file = full_test_currennt.nc<br /></pre> <p>Notice that cuda is off, I found it ran faster in feed forward mode without cuda. To do testing run:</p><pre>/path/to/currennt/build/current ff_config.cfg</pre> <p>Now all you need to do is parse the output file <tt>test_results.csv</tt> if you need the individual predictions, it is in comma separated variable format so it is not hard to work out.</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com1tag:blogger.com,1999:blog-1147426303635403896.post-34653045228762048962015-10-22T21:40:00.000-07:002019-02-23T02:17:07.293-08:00Writing HDF5 files for Currennt<p>In a previous post I explained <a href="http://www.cromulentrambling.com/2015/10/installing-currennt-on-ubuntu-1404.html">how to install currennt</a>. This post will deal with writing the data files. The <a href="http://www.cromulentrambling.com/2015/10/running-currennt-rnn-library.html">next post</a> will be about actually running things. Currennt uses data written to HDF5 files, so whatever format your data is in you have to write it to a HDF5 file then use currennt to process it. This post will use matlab to generate the HDF5 files, but e.g. python could easily do it as well.</p> <p>Recurrent neural networks predict things about sequences, so for this example we'll be using protein sequences as the example. Predicting things like phonemes for speech files is exactly the same. For each amino acid in a protein, we'll be predicting the secondary structure, which can be C, E, H or X. The training information will be a L by 20 matrix, this means the protein is of length L and each amino acid has 20 features (we'll use PSSM features, of which there are 20, extracted using PSI-BLAST.). I will just be using 400 training proteins and 100 test proteins, it is often beneficial to try things out on small datasets because you get the results quicker and can figure out problems quicker.</p> <h2>HDF5 files</h2><p>Currennt needs the data written to HDF5 files, but it needs a heap of extra parameters written to the files as well as the data, this section will specify what you need to write to them, then I'll put some actual MATLAB code in the next section.</p> <p>The data we'll be using is protein data, there will be 400 training proteins, each of length L (the sequences can be anywhere from about 20 to 2000 in length). There are 20 features per amino acid, this means each protein is represented by a matrix of size 20 by L, and there are 400 matrices.</p> <p>These parameters are for classification, if you want something else like regression you may have to use different dimensions and variables e.g. numLabels only makes sense for classification. HDF5 files have dimensions and variables, and they need to be specified separately. The dimensions that currennt needs specified in the HDF5 files are as follows:</p><pre>numTimesteps: the total number of amino acids if you lined <br /> up all the proteins end to end<br />inputPattSize: the number of features per amino acid, in this case 20.<br />numSeqs: the number of proteins<br />numLabels: the number of classes we want to predict, in this case 4<br />maxLabelLength: the length of the longest string used to hold <br /> the class label name<br />maxTargStringLength: to be honest I'm not sure about this one, I <br /> just set it to be 5000 and things seem to work<br />maxSeqTagLength: this is another one I'm not sure about. I set <br /> it to 800 and it seems to work.<br /></pre> <p>Now we have specified the dimensions, now we want to specify the variables i.e. the actual data we will be using for features and labels. They are specified like so:</p> <pre><br />seqTags, size: maxSeqTagLength x numSeqs<br />numTargetClasses: size: 1<br />inputs, size: inputPattSize x numTimesteps<br />seqLengths, size: numSeqs<br />targetClasses, size: numTimesteps<br />labels, size: maxLabelLength x numLabels<br /></pre> <p>You then just have to write the relevant data to each of the variables and run currennt. Some MATLAB code that does all this stuff is provided below. Note that this matlab file will just create one file for e.g. training. If you want a separate dataset for e.g. testing (strongly recommended) then you will need to run the code a second time changing the file names and protein dataset.</p> <script src="https://gist.github.com/jameslyons/c203e0eba13c1ef7062c.js"></script>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com2tag:blogger.com,1999:blog-1147426303635403896.post-38056926808538531072015-10-20T23:01:00.000-07:002019-02-23T02:17:07.468-08:00Installing Currennt on ubuntu 14.04<p>This post is all about how to install Currennt on ubuntu 14.04. The next post will deal with <a href="http://www.cromulentrambling.com/2015/10/running-currennt-and-writing-hdf5-files.html">how to write the data files that Currennt needs</a> as well as <a href="http://www.cromulentrambling.com/2015/10/running-currennt-rnn-library.html">actually running currennt</a>. Currennt is a recurrent neural net library that is sped up using cuda. Installing it is not as simple as it could be.</p> <p>First step: grap the source code from here: <A href="http://sourceforge.net/projects/currennt/">currennt on sourceforge</a>. I am using current-0.2-rc1.zip.</p> <p>You'll first need to install cuda:</p><pre>sudo apt-get update<br />sudo apt-get install cuda</pre> <p>You'll also have to install the netcdf library:</p><pre>sudo apt-get install libnetcdf-dev</pre> <p>You should also install boost: <tt>sudo apt-get install libboost-all-dev</tt>. I have libboost1.55 installed already.</p> <p>Now you can start the installation. Extract the Currennt zip file to a directory and cd to that directory. Now, create a folder called build and cd into that, then run make.</p><pre>mkdir build<br />cd build<br />cmake ..<br />make</pre> <p>When I did this I got the error:<tt>nvcc fatal : Value 'compute_13' is not defined for option 'gpu-architecture'</tt>. This can be fixed by editing the CMakeLists.txt, change <tt>compute_13</tt> to <tt>compute_30</tt>, and rerun cmake and make.</p> <p>The next make got a little bit further, but I got the error: <tt>error: namespace "thrust" has no member "transform_reduce"</tt>. This one can be fixed by adding the line: <tt>#include <thrust/transform_reduce.h></tt>to the top of the file /currennt_lib/src/layers/Layer.hpp .</p> <p>That was all I had to do this time, previously I have had a heap of different propblems when installing currennt on older computers, but this time seemed relatively painless.</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com3tag:blogger.com,1999:blog-1147426303635403896.post-1377474296222440102015-10-05T23:07:00.000-07:002019-02-23T02:17:07.645-08:00Numerically Checking Neural Net Gradients<p>It is highly recommended by everyone that implements neural networks that you should numerically check your gradient calculations. This is because it is very easy to introduce small errors into the back-propagation equations that will make it seem like the network is working, but it doesn't do quite as well as it should. A simple numerical check can let you know that all your numbers are right.</p> <P>This tutorial will go through the process of how to numerically compute the derivatives for a simple 1 hidden layer neural network. The exact same process is used for checking the derivatives of e.g. LSTMs, which are much more complicated.</p> <p>The main thing to remember is the error measure you are using for your neural network. If you are using Mean Squared Error then the error \(E = \dfrac{1}{N}\sum^N_{n=1} \sum^K_{k=1} (y_k - x_k)^2 \) is what we want to minimise. In the previous equation, \(y_k\) is the desired label and \(x_k\) is the network output. \(N\) is the minibatch size, if you are just putting in a single input at a time then \(N\) is one. This single number \(E\) tells us how good our network is at predicting the output, the lower the value of \(E\) the better. Our aim is to adjust the weights so that \(E\) gets smaller. </p> <P>For a neural net with 1 hidden layer we have 2 sets of weights and 2 sets of biases. The output is computed as follows:</p>\( a = \sigma (i*W_1 + b_1) \) <br/>\( x = \sigma ( a*W_2 + b_2) \) <p>\(i\) is our input vector, \(a\) is the hidden layer activation, and \(x\) is the network output. <p>The way to numerically check the gradient is to pick one of the weights e.g. element (1,1) of \(W_1\), and to add and subtract a small number to/from it e.g. 0.0001. This way we have \(W_1^+\) and \(W_1^-\) (note that only element (1,1) is changed, all the other weights stay the same for now). We now have to compute \(x^+\) and \(x^-\) using the new slightly modified weights. Then we can compute \(E^+\) and \(E^-\) using \(x^+\) and \(x^-\). The gradient of E is then \((E^+ - E^-)/(2*0.0001)\). </p> <p> Now that we have the derivative of \(E\) with respect to weight (1,1) of \(W_1\), we have to do it for all the other weights as well. This follows the exact same procedure, we just add/subtract a small number from a different weight. The final matrix of derivatives should exactly match the gradient as calculated by back-propagation. This python code: <a href="https://github.com/jameslyons/python-neural-nets/blob/master/nn_2layer.py">nn_2layer.py</a> implements both back-propagation and the numerical check for a simple single hidden layer nn.</p> James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com1tag:blogger.com,1999:blog-1147426303635403896.post-58549176554646718282015-08-20T00:55:00.000-07:002019-02-23T02:17:07.818-08:00Caffe Tutorial - Part 1 - basic feedforward nets<p><a href="http://caffe.berkeleyvision.org/">Caffe</a> is a deep learning framework often used for image recognition projects, but I've found that the tutorials aren't that great, or at least I have trouble finding solutions to the problems I have. So I've decided to write a sequence of tutorials to lay out information that I can revise in the future and that others might use as well.</p> <p>The first tutorial (this one) will be a basic single hidden layer feed-forward neural net trained on the fisher iris dataset. I'll use MATLAB to write the feature files that caffe uses, for this tutorial we'll use layer type HDF5Data because HDF5 files are easy to write.</p> <p>Also try <a href="http://caffe.berkeleyvision.org/gathered/examples/mnist.html">this lenet tutorial</a> as it covers some things that this page does not.</p> <h2>Writing the Data Files</h2><p>This is a short matlab script for writing the train and test files, each file must contain both the features and labels. Other languages can be used e.g. python has a library for writing hdf5 files as well. Note that <tt>fisheriris</tt> is built in to matlab. When we load it the features are 150x4, when it gets written to the hdf5 files it has to be transposed (4x150). The labels come as a cell array of strings, we need to convert them to integers starting at 0. We randomly choose 100 samples to be in the train set and 50 to be in the test set.</p> <p>We use innerProduct layers as these are fully connected feedforward layers, and they are initalised using the 'xavier' algorithm which is explained <a href="http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization">in this other blog post</a>.</p> <script src="https://gist.github.com/jameslyons/503eb279435735a51a79.js"></script> <p>We now have 2 files: iris_train.hdf5 containing the train features and labels, and iris_test.hdf5 containing the test features and labels.</p> <h2>Specifying the Network</h2> <p>At this point I recommend reading a bit about <a href="http://caffe.berkeleyvision.org/tutorial/net_layer_blob.html">layers, nets and blobs</a>, you can also look at <a href="http://caffe.berkeleyvision.org/tutorial/layers.html">the types of layers available</a>. Our network will be a single hidden layer with 20 nodes. The iris dataset has 3 classes, so our output layer will have 3 nodes. We use the rectified linear unit (ReLU) activation, and softmax outputs.</p> <p>We are doing training and testing in this network, this is why there are 2 data layers, one with <tt>phase: TRAIN</tt> and one with <tt>phase: TEST</tt>. One annoying bit is that we can't directly specify the location of the HDF5 file, we have to specify the location of a txt file that contains as its first line the location of the HDF% file. So <tt>"./files/train_file_location.txt"</tt> that appears in the network specification actually has as its only contents:</p><pre>./files/iris_train.hdf5</pre><p>which is the location the matlab script saved the hdf5 files to.</p> <p>At the end of this script, we have <tt>SoftmaxWithLoss</tt> as well as <tt>softmax</tt> layers. This is because we want to see the accuracy, and the loss can't be used with the accuracy. We pass the <tt>softmax</tt> outputs to the <tt>Accuracy</tt> layer so we can see how well it performs. Note that accuracy is only computed during <tt>phase: TEST</tt>.</p> <script src="https://gist.github.com/jameslyons/5715c7f3c39730dbf5a8.js"></script> <h2>Specifying the Solver</h2> <p>The solver is pretty standard.</p> <script src="https://gist.github.com/jameslyons/2e4a7d1caec671e54ee5.js"></script> <h2>Running Everything</h2><p>Now you probably want to run it, just type the following into a terminal while being in the correct directory:</p> <pre><br />/path/to/install/caffe-master/build/tools/caffe train --solver=./iris_caffe_solver.prototxt<br /></pre>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com5tag:blogger.com,1999:blog-1147426303635403896.post-23756101237306248272015-02-03T20:58:00.000-08:002019-02-23T02:17:07.992-08:00A look at the second Feynman cipher - fractionated morse [part 4]<p>The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 4 of a discussion about the second cipher.</p> <p>I have talked about this cipher three times previously in <a href="http://www.cromulentrambling.com/2014/01/a-look-at-second-feynman-cipher.html">Part 1</a>, <a href="http://www.cromulentrambling.com/2014/01/a-look-at-second-feyman-cipher-part-2.html">Part 2</a> and <A href="http://www.cromulentrambling.com/2014/12/a-look-at-second-feyman-cipher-part-3.html">Part 3</a>. In part 1 I was playing around with vigenere type ciphers, in part 2 I decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. Part 3 looked at the 3x3 Hill cipher and some of its variants.</p> <p>So far I have been working my way through the cipher types I think it may be. In part 2 I identified fractionated morse as a candidate cipher. It turns out breaking fractionated morse is not too difficult, the procedure is similar to breaking standard substitution ciphers. I will start with a short description of Fractionated morse, then we will look at how to break it.</p> <h1>Description of Fractionated Morse</h1><p>There are many descriptions of the algorithm around, the <a href="http://www.cryptogram.org/cdb/aca.info/aca.and.you/chap08.html#FRACTI">ACA has a good description</a>, I will reproduce a part of <A href="http://www.cs.ucf.edu/~gworley/files/pollux_and_frac_morse.txt">G. worley's</a> description. It assumes you know <a href="http://en.wikipedia.org/wiki/Morse_code">morse code</a>.</p><pre style="background-color: #DDDDDD;"><br />If Alice wanted to send Bob the<br />message "I love you." she would tap over the telegraph:<br /><br />Pt: I love you.<br />Et: ..xx.-..x---x...-x.xx-.--x---x..-xx.-.-.-<br /><br />After encoding the plaintext in Morse Code, the encoded text is broken into<br />n-graphs, usually trigraphs because there are 26 Morse Code trigraphs (you never<br />have three dividers in a row, so you don't count that one). Here's a sample<br />trigraph mapping (using a keyword):<br /><br />Keyword: LOOKINGGLASS<br /><br />L O K I N G A S B C D E F H J M P Q R T U V W X Y Z<br />. . . . . . . . . - - - - - - - - - x x x x x x x x<br />. . . - - - x x x . . . - - - x x x . . . - - - x x<br />. - x . - x . - x . - x . - x . - x . - x . - x . -<br /><br />So, breaking our message into trigraphs we get:<br /><br /> . x . - x . . - - - . x - .<br /> . . . - . - x . x - . x . -<br /> x - x - . x x - - x - . - x<br /> ~~~~~~~~~~~~~~~~~~~~~~~~~~~<br />Ct: K T K H R G B D P J O Y D G<br /><br />Cracking this is much harder because the dividers are obfuscated. The only way<br />to attack this is to do a statistical analysis for common Morse Code trigraphs.<br /></pre> <p>The cipher assumes 'x' is placed between each letter and 'xx' is placed between each word.</p> <h1>Cryptanalysis of Fractionated Morse</h1><p>Lets look at the following Fractionated Morse table:</p><pre><br />A B C D E F G H I J K L M N O P Q R S T U V W X Y Z<br />. . . . . . . . . - - - - - - - - - x x x x x x x x<br />. . . - - - x x x . . . - - - x x x . . . - - - x x<br />. - x . - x . - x . - x . - x . - x . - x . - x . -<br /></pre><p>If you look at the table above, you can see that certain ciphertext letter combinations are not possible e.g. RS, RT, ... RZ cannot occur because we can't have xxx in the plaintext Morse. Also IS, IT, ... IZ can't happen for the same reason. Other impossible combinations: CY, CZ, FY, FZ, OY, OZ etc. We also can't have e.g. ABD because there are no Morse letters that go '.....-.-.'. You should be able to come up with many other impossible combinations. We can also use frequency information, since THE occurs very often in English, we expect 'xx-x....x.xx'='ZSCI' to be quite common in the ciphertext (with this particular key, if you look at the counts provided down below, we see ZSCI is actually the most common quadgram). This means certain keys will be very unlikely if they lead to impossible transitions, others more likely if they lead to likely transitions.</p> <p>The trick to breaking Fractionated Morse is that finding the key can be done completely independently of the English->Morse translation. Knowing the statistics of a particular Fractionated Morse key means we can just find the key that translates to it, without ever checking to see if the putative plain text even decodes properly. The Idea is to start with a random key, then continuously swap characters in the key trying to make the ciphertext look like it has been enciphered with the key "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Once this is done, we can just decrypt it. To do this we need quadgram statistics for a lot of English text that has been enciphered with the key "ABCDEFGHIJKLMNOPQRSTUVWXYZ". I did this for 40 million sentences, or around 1 billion characters. The results are linked below.</p> <p>Code for breaking Fractionated Morse ciphers is <a href="https://github.com/jameslyons/python_cryptanalysis/blob/master/break_fracmorse.py">here</a>, with the required statistics <a href="https://raw.githubusercontent.com/jameslyons/python_cryptanalysis/master/fmorse_quadgrams.txt">here</a>. I have run the code on many possible routes through the second Feynman cipher, and nothing resulted in any good looking plaintexts. I would expect it to work, since breaking length 100 Fractionated Morse ciphertexts is not too difficult using the code above, and F2 is 261 characters. This leads me to believe that F2 is not a Fractionated Morse cipher.</p> <p>From here I'll try to write an efficient running key solver for vigenere, beaufort, variant and porta rules and see how that goes.</p> James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0tag:blogger.com,1999:blog-1147426303635403896.post-43249328272549241212014-12-18T01:20:00.000-08:002019-02-23T02:17:08.184-08:00A look at the second Feynman cipher [part 3]<p>The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 3 of a discussion about the second cipher.</p> <p>I have talked about this cipher twice previously in <a href="http://www.cromulentrambling.com/2014/01/a-look-at-second-feynman-cipher.html">Part 1</a> and <a href="http://www.cromulentrambling.com/2014/01/a-look-at-second-feyman-cipher-part-2.html">Part 2</a>. In part 1 I was playing around with vigenere type ciphers, in part 2 I decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. One of the ciphers I mentioned there was the Hill cipher, but I only considered the 2 by 2 Hill cipher. It can't be a 2x2 because the ciphertext has an odd number of characters. However, it could be a 3 by 3 Hill cipher because the cipher length is divisible by 3. Fortunately cracking 3x3 Hill ciphers is fairly easy, so we'll consider various types of hill cipher as well as routes through the ciphertext (in case the hill cipher is coupled with a route cipher of some sort).</p> <p>One of the reasons the Hill cipher is a candidate is because it produces ciphertexts with 26 characters, something many ciphers do not (e.g. foursquare=25, playfair=25, trifid=27 etc. etc.)</p> <h1>The Basic Hill Cipher</h1><p>In the equation below, the 3x3 matrix is called the key, the matrix \(\left[0,19,19\right]\) represents the characters 'ATT' (A=0,B=1,...,Z=25). The equation shows the enciphering process, see <a href="http://www.practicalcryptography.com/ciphers/hill-cipher/">here for a larger tutorial</a>.</p>\( \left[ \begin{array}[pos]{ccc} 2 & 4 & 5\\ 9 & 2 & 1\\ 3 & 17 & 7\\ \end{array} \right] \left[ \begin{array}[pos]{c} 0\\ 19\\ 19\\ \end{array} \right] = \left[ \begin{array}[pos]{c} 171\\ 57\\ 456\\ \end{array} \right] \pmod{26} = \left[ \begin{array}[pos]{c} 15\\ 5\\ 14\\ \end{array} \right] =\) 'PFO' <p>The major thing to notice for breaking hill ciphers is that the first ciphertext letter ('P'=15) depends only on the top row of the key matrix, changing the other elements of the key wont affect the fact that 'P' comes out. To make use of this fact we search through all 26^3 possible top rows and look at every third letter in the candidate decryption. We rank the key row by how well the monogram frequencies of every third deciphered character match English. This means we don't have to search through all 26^9 possible keys, just 26^3 3 times, which is much easier. In <a href="http://www.cromulentrambling.com/2014/01/a-look-at-second-feynman-cipher.html">Part 1</a> I looked at 8 routes, this time I'll do 40 routes: <a href="https://gist.github.com/jameslyons/2dc53b49ec879b10da7b">f2routes.txt</a>.</p> <p>The code to break 3 by 3 hill ciphers is provided here: <a href="https://github.com/jameslyons/python_cryptanalysis/blob/master/break_hill3.py">break_hill3.py</a>. This code was run on each of the routes keeping the top 15 candidates from each route. Unfortunately nothing came of it, all the decryptions were gibberish. On to the next variant!</p> <h1>Extended Hill Cipher</h1><p>We can add an extra vector to the Hill enciphering process to make it a little more complicated, increasing the number of key elements from 9 to 12:</p>\( \left[ \begin{array}[pos]{ccc} 2 & 4 & 5\\ 9 & 2 & 1\\ 3 & 17 & 7\\ \end{array} \right] \left[ \begin{array}[pos]{c} 0\\ 19\\ 19\\ \end{array} \right] + \left[ \begin{array}[pos]{c} 15\\ 5\\ 14\\ \end{array} \right] = \left[ \begin{array}[pos]{c} 186\\ 62\\ 470\\ \end{array} \right] \pmod{26} = \left[ \begin{array}[pos]{c} 4\\ 10\\ 2\\ \end{array} \right] =\) 'EKC' <p>Note the extra vector \(\left[15,5,14\right]\) added to the first step. In this case the key consists of the square matrix \([2,4,5;9,2,1;3,17,7]\) as well as the vector \([15;5;14]\) (12 numbers to specify the key). This doesn't complicate things too much, we now have to search through 4 numbers 0-25 instead of three. The code for breaking these is here: <A HREF="https://github.com/jameslyons/python_cryptanalysis/blob/master/break_xhill3.py">break_xhill3.py</a>. We have to check a few more possibilities, which makes the program a little bit slower, but not too bad.</p> <p>This code was run on each of the routes keeping the top 15 candidates from each route. Unfortunately nothing came of it, all the decryptions were gibberish as well. On to the next variant!</p> <h1>Hill cipher with arbitrary substitutions</h1><p>The next more complicated hill cipher is specified like the standard hill cipher, except the numbers don't necessarily map to the letters in order. For example the standard Hill cipher uses A=0,B=1,...,Z=25. Now we relax that so that e.g. A=7,B=12,...Z=1. This is equivalent to enciphering the message with a substitution cipher prior to applying the hill cipher. This complicates everything a little more.</p> <p>To break this cipher we can't rank key rows using monogram frequencies because of the additional layer of encipherment, we have to use <a href="http://www.practicalcryptography.com/cryptanalysis/text-characterisation/index-coincidence/">Index of Coincidence</a> to rank them instead (I.C. is invariant to substitutions). Otherwise the cracking process is similar, except we have to keep more candidates because I.C. is not as effective as monogram frequencies at discriminating good keys from bad.</p> <p>Cracking this variant was much slower than planned, it meant I was unable to run it on all the different routes, I only ran it on the original cipher. My program output the top few hundred most likely Hill cipher keys, then I have to run a substitution cipher cracker on all the candidates. No success yet.</p> James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com6tag:blogger.com,1999:blog-1147426303635403896.post-1225056959384026722014-12-02T00:26:00.000-08:002019-02-23T02:17:08.358-08:00Converting PDB files to FASTA format with python<p>This is a simple script to convert PDB files (which contain information about protein structure) into FASTA format, which is just a list of the amino acids. This code handles one PDB file at a time, to do multiple ones put it in a loop e.g. for bash:</p><pre>for i in `ls *.pdb`; do python pdb2fasta.py $i > $i.fasta; done</pre> <script src="https://gist.github.com/jameslyons/00cad4f1ab6a49c2b3df.js"></script> <p>In operation, it just looks for lines with 'ATOM', then looks at the residue number to see if has changed. When the residue numbers change it outputs the current amino acid label. Feel free to use/modify this code for any purpose.</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com7tag:blogger.com,1999:blog-1147426303635403896.post-27414207733259774652014-11-30T23:03:00.000-08:002019-02-23T02:17:08.531-08:00Framing and reconstructing speech signals<p>This post will deal with framing and overlap-add resynthesis. This can also be known as AMS (Analysis-Modification-Synthesis) when doing things like speech enhancement. First of all, what is the point of framing? An audio signal is constantly changing, so we assume that on short time scales the audio signal doesn't change much (when we say it doesn't change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame, and the FFT will end up smearing the contents of the frame.</p> <p>What is involved: frame the speech signal into short, overlapping frames. Typically frames are taken to be about 20ms long. For a 16kHz sampled audio file, this corresponds to 0.020s * 16,000 samples/s = 400 samples in length. We then use an overlap of 50%, or about 200 samples. This means the first frame starts at sample 0, the second starts at sample 200, the third at 400 etc.</p> <p>MATLAB code for framing: <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/frame_sig.m">frame_sig.m</a> and unframing:<a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/deframe_sig.m">deframe_sig.m</a>.</p> <p>Framing the signal is pretty simple, the only thing to note is that the signal is padded with zeros so that it makes an integer number of frames. A window function is also applied. The overlap-add process has a few things that make it tricky, as well as adding up the overlapped signal we also add up the window correction which is basically what our signal would be if every frame was just the window. This is important since the windowed frames won't necessarily add up to get the original signal back. You can see this by plotting the window_correction variable in deframe_sig.m and thinking about how it gets like that. We also have to add <tt>eps</tt> (this is just a very small constant i.e. <tt>eps</tt>ilon) to the window correction just in case it is ever zero, this prevents <tt>inf</tt>s appearing in our reconstructed signal.</tt></p> <p>To see how the AMS framework can be used for spectral subtraction, have a look at <a href="">this spectral subtraction tutorial</a>. The framing and deframing routines on this page can be used to implement the enhancement routines there. Some example code for the tutorial above would look something like this:</p> <script src="https://gist.github.com/jameslyons/554325efb2c05da15b31.js"></script>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com9tag:blogger.com,1999:blog-1147426303635403896.post-73121530734037511722014-11-30T04:20:00.000-08:002019-02-23T02:17:08.704-08:00Processing lists of files in MATLAB<p>It is often necessary to apply a certain operation to many files all at once e.g. adding noise to a database, creating feature files from a list of audio files, enhancing files, etc. This post will explain the method I use for processing lists of files, because I tend to use it quite often.</p> <p>The main points are to use the unix <tt>find</tt> command to generate the file list (of course this is not necessary, any method for generating file lists is fine), then use matlab's <tt>fgetl</tt> to read each file name and process it. The code for adding noise to a list of NIST files is below, you can find the <a href="http://www.cromulentrambling.com/2014/11/reading-and-writing-nist-raw-and-wav.html">read_NIST_file script here</a>, and the <a href="http://www.cromulentrambling.com/2014/11/adding-noise-of-certain-snr-to-audio.html">addnoise script can be found here</a>.</p> <script src="https://gist.github.com/jameslyons/e146d9178d9e7e1fe3a3.js"></script> <p>Note that we check for end of file by checking <tt>ischar(wavname)</tt>, once the end of file is reached, <tt>wavname</tt> (a line from the file list) will no longer contain character data.James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0tag:blogger.com,1999:blog-1147426303635403896.post-17728267278123501472014-11-29T01:05:00.000-08:002019-02-23T02:17:08.879-08:00Adding Noise of a certain SNR to audio files<p>A common task when dealing with audio is to add noise to files, e.g. if you want to test the performance of a speech recognition system in the presence of noise. This is based on computing the Signal to Noise Ratio (SNR) of the speech vs. noise. To compute the energy in a speech file, just add up the sum of squares of all the samples:</p>\[ E_{Speech} = \sum_{i=0}^N s(i)^2 \] <p>where \(s(i)\) is the vector of speech samples you read with a function like <tt>wavread</tt>. We will also need some noise, which we can generate using a function like <tt>randn(N,1)</tt> where N is the length of the speech signal. Alternatively we can use a dedicated noise file containing e.g. babble noise and just truncate it at the correct length. When using a noise file, it is important to randomise the start position for each file so you don't always have e.g. a door banging or a guy laughing at the same point in every file. This can mess with classifiers. Anyway, now compute the energy of the noise:</p>\[ E_{Noise} = \sum_{i=0}^N n(i)^2 \] <p>where \(n(i)\) is the noise vector. To compute the SNR of a speech file compared to noise:</p> \[ SNR = 10\log_{10} \left( \dfrac{E_{Speech}}{E_{Noise}} \right) \] <p>If you don't have the pure noise, you just have a corrupted version of the original, you compute the noise as: \(n(i) = x(i) - s(i)\), where \(x(i)\) is the corrupted signal.</p> <p>Now we want to scale the noise by a certain amount and add it to the original speech signal so that the SNR is correct. This assumes we have a target SNR, for the sake of this post assume we want the noise to be at 20dB SNR. We now use the following formula (This formula assumes the noise signal has unit variance, you may need to normalise it before using this formula):</p>\[ K = \sqrt{ \dfrac{E_{Speech}}{10^{20\text{dB}/10}} } \] <p>Once we have done this we need to create \(\hat{n}(i) = K\times n(i)\) for our noise samples. Our noisy speech file is calculated as:</p> \[ x(i) = s(i) + \hat{n}(i) \] <p>for \(i = 1 \ldots N\). You should be able to compute the SNR between the new noisy signal and the original signal and it should come out to be very close to 20dB (it could be 19.9 or 20.1 or something). Function for computing snr in matlab: <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/snr.m">snr.m</a>, function for adding white noise to files: <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/addnoise.m">addnoise.m</a>. James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com1tag:blogger.com,1999:blog-1147426303635403896.post-6088547502280869302014-11-27T20:47:00.000-08:002019-02-23T02:17:09.052-08:00Reading and writing NIST, RAW and WAV files in MATLAB<p>To open files (NIST or WAV) when you are not sure which it could be, use <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/audioread.m">audioread.m</a>, which depends on the <tt>read_X_file.m</tt> explained below.</p> <h1>NIST files</h1> <p>NIST files are very common when doing speech processing, for example the TIMIT and RM1 speech databases are in NIST format. The NIST format consists of 1024 bytes at the start of the file consisting of ASCII text, after this header the speech data follows. For TIMIT and RM1 the speech data is 16-bit signed integers which can be read directly from the file and treated as a signal.</p> <p>To get the sampling frequency, you'll have to parse the text that forms the header. In any case, the functions to read and write NIST files is supplied here: <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/read_NIST_file.m">read_NIST_file.m</a> and <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/write_NIST_file.m">write_NIST_file.m</a>.</p> <p>Note that writing a NIST file using the scripts above requires a header. The easiest way to get a header is to read another NIST file. So if you want to modify a NIST file then you would use something like:</p><pre>[signal, fs, header] = read_NIST_file('/path/to/file.wav');<br />write_NIST_file('/path/to/new/file.wav',fs,header);<br /></pre><p>This reuses the header from previously and works fine. If you want to create completely new files i.e. there is no header to copy, I recommend not creating NIST formatted files, create wav files instead as they are far better supported by everything.</p> <p>An example header from the timit file <tt>timit/test/dr3/mrtk0/sx283.wav</tt>. Note the magic numbers <tt>NIST_1A</tt> as the first 7 bytes of the file. The actual audio data starts at byte 1024, the rest of the space between the end of the header text and byte 1024 is just newline characters.</p> <pre>NIST_1A<br /> 1024<br />database_id -s5 TIMIT<br />database_version -s3 1.0<br />utterance_id -s10 rtk0_sx283<br />channel_count -i 1<br />sample_count -i 50791<br />sample_rate -i 16000<br />sample_min -i -2780<br />sample_max -i 4675<br />sample_n_bytes -i 2<br />sample_byte_format -s2 01<br />sample_sig_bits -i 16<br />end_head<br /></pre> <h1>RAW files</h1> <p>RAW files are just pure audio data without a header. This can make it a little difficult to figure out what is actually in them, often you may just have to try things until you get meaningful data coming out. Common settings would be 16-bit signed integer samples. You'll also have to take care of endianness, if the files were written on a windows machine all the integers will be byte-swapped and you'll have to swap them back.</p> <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/read_RAW_file.m">read_RAW_file.m</a> and <a href="https://raw.githubusercontent.com/jameslyons/spl_featgen/master/write_RAW_file.m">write_RAW_file.m</a>. <h1>WAV files</h1><p>wav files are supported natively by matlab, so you can just use matlabs <tt>wavread</tt> and <tt>wavwrite</tt> functions.</p>James lyonshttp://www.blogger.com/profile/06070465204350279441noreply@blogger.com0