Friday 31 January 2014

Fast Calculation of Autocorrelation Coefficients in Julia

Autocorrelation is one of those things that gets used a lot in signal processing which has a very simple formula, but which is easy to speed up. In essence, autocorrelation is a measure of how similar a signal is to itself. Periodic signals are exactly those signals that are self-similar, so autocorrelation is good for locationg periodicities. The basic formula for autocorrelation of a signal \(x(i)\) is:

\[ R_{xx}(j) = \sum_{i=0}^N x(i)x(i-j) \]

In the above equation, \( R_{xx}(j) \) is the notation for the \(j^{th}\) autocorrelation coefficient of signal \(x\), where \(x\) is of length \(N\) samples. Note that the above equation is \(O(N^2)\) in complexity if we want all the autocorrelation coefficients (of which there are around \( 2N - 1\) ). To speed it up, we are going to use the Wiener–Khinchin theorem, which states that the power spectrum of a signal and the autocorrelation sequence of a signal are related to each other via the Fourier Transform (FFT). This means:

\[ S_{xx} = FFT(R_{xx})\]


\[ R_{xx} = IFFT(S_{xx}) = IFFT( |FFT(x)|^2 ) \]

where \(S_{xx}\) is the power spectrum of our signal \(x\). Note that the complexity of the FFT is \(O(n\log n)\), so this is a significant speedup over the original equation. This means our code for fast autocorrelation is simply:

Alternatively, Julia provides a xcorr function for computing cross correlation, to produce the same output as the function above, you would need: xcorr(x,x)[length(x):], this is the preferred way of achieving it.

Tuesday 28 January 2014

A quick look at the 3rd Feynman cipher

The Feynman ciphers are a set of 3 ciphertexts apparently given to Richard Feynman at Los Alamos, only the first of which has been solved. This post will be a quick look at some of the statistics of the 3rd cipher. The Cipher itself is:


It is 231 characters in length (which has factors 3, 7 and 11), and has an I.C. of 0.0429, which means it is unlikely to be a transposition or substitution cipher. If we look at the monogram frequencies, we see all 26 letters appear, which means it can't be a cipher based on a 5 by 5 keysquare. This cipher actually looks statistically very close to the previous (second feynman cipher) which I have talked about previously (1, 2).

Since this one is so similar to the second Feynman cipher, much of the analysis here also applies to this cipher. Using this cipher identifier, the top candidates are the following ciphers:

RunningKey 9
Beaufort 9
FracMorse 9
period 7 Vigenere 9
Quagmire3 10
Porta 10
Vigenere 10
Quagmire4 10
Gronsfeld 11
Quagmire2 11
Nicodemus 11
Vigslidefair 13

These are almost all Vigenere-type ciphers, I will try and run some cracking tools on them to see what happens.

Saturday 25 January 2014

A look at the second Feynman cipher [part 2]

The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 2 of a discussion about the second cipher, which I refer to as F2.

In the previous part, I was playing around with vigenere type ciphers, in this part I have decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. To do this we need a comprehensive list of cipher algorithms, thankfully there are lists around.

Using the list of ciphers above plus a few more from here, I have created the following list of ciphers, ruling out the ones it can't possibly be:





Other: GRANDPRE (ct is numbers), GRILLE (ct length must be square number), GROMARK (ct starts with 5 digits), PERIODIC GROMARK (ct starts with digits), HEADLINES (ct has spaces), KEY PHRASE (word divisions retained), MONOME-DINOME (ct is numbers), MORBIT (ct is numbers), NIHILIST SUBSTITUTION (ct is numbers), POLLUX (ct is numbers), RAGBABY (word divisions are kept, 24 chars), TRIDIGITAL (ct is numbers), ADFGX/ADFGVX (only 5/6 characters used), POLYBIUS SQUARE (only 5/6 characters used), HILL (2*2)(requires even number of characters)

The above ciphers can be ruled out. The transposition ciphers because the frequency distribution does not match english, the substitution ciphers because the IC is 0.045, which is too low for English. The F2 cipher also has 26 characters in total, so it can't be anything based on a 5 by 5 key square. Some ciphers have too many characters (more than 26) so they can also be ruled out. Others can be ruled out because the ciphertext (ct) they produce consists of numbers.

The Possibilities


Assumptions: we all have to make assumptions, it is better to lay them out at the start. 1. The plaintext is in English, 2. There are no NULLs distributed through the ciphertext, 3. It is not a combination transposition + something else. 4. When making the ciphertext, no mistakes were made that prevent decryption.

I may have to weaken some of these assumptions later, but for now I think focusing on simpler stuff should be prioritised.

Narrowing Possibilities

In part 1, I tried solving it using Vigenere, Beaufort, Variant Beaufort, Gronsfeld, VigAutokey and Porta, and this was unsuccessful which makes me confident I can ignore these possibilities under the assumptions above.

I have used this site to get a list of candidate ciphers ranked by likelyhood. I have also removed the ones I know are not possible:

FracMorse 5
RunningKey 8
Quagmire3 8
Quagmire2 9
Quagmire4 9
Nicodemus 11
Vigslidefair 12
Portax 13
Swagman 22
Progkey beaufort 23
Progressivekey 24

This means I will concentrate my focus on FracMorse, Running Key and the Quagmires. I will update again in Part 3.

Thursday 23 January 2014

Removing Annotations from the Brown Corpus with Sed

The Brown corpus is a collection of texts on various topics amounting to around 1 million words, and each word is annotated with its Part Of Speech (POS). The following is an excerpt from it:

 The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at 
investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd
``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

You can find a copy of the Brown corpus on this page. To get a copy of the Brown corpus with tags removed, you can get it here.

Each word ends with a "/" character, followed by the part of speech, e.g. "The" is an article, which is abbreviated to "at". The full list of abbreviations can be seen here. I would like to strip off all the POS tags though, so I can use the text on its own. My main motivation is to get a wide selection of texts on several topics to use as a test set for computing the perplexity of language models. The Brown corpus seems to fit the bill.

Stripping the Tags

I am going to use sed to do all the work here, based on the observation that each word has a slash ("/") in the middle, followed by some more characters. To express this as a regular expression, we'll use the following:

[^ ]*/[^ ]*

This regular expression (RE) will match a sequence of characters (not including space), followed by a slash, followed by some other sequence of characters (not including space). Each time we match this, we want to keep only the first part of match i.e. the part before the slash. In sed this is done like so:

s_\([^ ]*\)/[^ ]*_\1_g

We have put the RE into sed's syntax inside the "s/a/b/g" construct, which tells sed to match "a" and substitute any instances matched with "b". Note that the forward slash characters are optional, any character can be used in that spot, in the example above "_" is used instead. The next sed feature we have used is bracketing "\( \)", which tells sed to record the part of the match that happened inside the brackets and put it in a variable called "\1". Note we used this variable in the second half of the example. This tells sed to match the whole RE, but replace it with only the first bit.

If we have a file called "temp.txt" with the following text in it:

The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at 
City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in
the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at
City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at
election/nn was/bedz conducted/vbn ./.

we'll want to strip out the tags using sed like this:

sed 's_\([^ ]*\)/[^ ]*_\1_g' temp.txt > newtemp.txt

and we are left with:

The jury further said in term-end presentments that the City Executive Committee 
, which had over-all charge of the election , `` deserves the praise and thanks of the
City of Atlanta '' for the manner in which the election was conducted .

Applying it to Multiple Files

The Brown corpus contains 500 files, and we want to apply our sed expression to all the files, to do this we'll use bash.

for fname in `ls c*`; do 
sed -i 's_\([^ ]*\)/[^ ]*_\1_g' $fname;

The script above loops over all the files in the brown corpus beginning with the letter c, then uses sed to apply the regular expression to each file. The '-i' flag tells sed to apply the regular expression to the file in-place, whereas usually it would print out the results.

To get a copy of the Brown corpus with tags removed, you can get it here.

Monday 20 January 2014

A look at the second Feynman cipher [part 1]

In 1987, someone posted a message to an internet cryptology list, saying that Caltech Physics Professor Richard Feynman was given three samples of code by a fellow scientist at Los Alamos. Only one of the three was ever solved (see

The second cipher is as follows (I'm going to refer to it as "F2", feynman cipher 2 throughout the post):


The first cipher was a plain route cipher, decrypting to some Chaucer. F2 has 261 characters, which factors into a rectangle 9*29, which means it could be a route cipher like the previous one, or a combination of route cipher and some other cipher. We will jump right into some analysis following this guide: Identifying unknown ciphers.

First of all, the monogram frequencies look nothing like English, so we can confidently rule out a plain transposition/route cipher. Next, the Index of coincidence is 0.045, which is too low to be a substitution cipher. From here I can reasonably confidently conclude it is not a transposition, monographic substitution or combination of these.

Another major signal is the presence of 26 characters; this means it can't be a cipher based on a 5 by 5 or 6 by 6 grid such as playfair, foursquare, phillips etc.

There are also all 26 characters appearing, so it is not ADFGX/ADFGVX or Polybius square. It is probably not trifid, since trifid uses 27 characters.

We can actually go further and rule out digraphic ciphers like bifid, hill, playfair and foursquare because the number of characters is odd. What does this leave? It leaves all the vigenere-type ciphers like gronsfeld, porta, autokey, beaufort etc.

Next steps

I will have to eventually make some assumptions and start cracking, but it is good to narrow the playing field a bit first. From here I am going to assume it is a Vigenere-type cipher with possibly some route-type transpositions going on. If we want to break route transpositions, we need to first identify the candidate routes we want to try. This will involve slicing up the 9*29 into various routes. Of course, the previous cipher was split into 75*5, so this one may actually be 3*87 instead of 9*29, but we can try that later.

If our block of text looks like:


Then we can read off letters in various directions e.g.









I am going to start with these 8 routes, and try to break them assuming they are any of the following: vigenere, vigenere-autokey, beaufort, variant beaufort, porta. I 'll also have to search all keys with lengths 2-20, and I'll be using the method described here.

Sources of Difficulty

While the code for the above tests is running, I thought I'd look at reasons why these experiments may not work. First is that it is not a cipher from the vigenere family, or perhaps it is e.g. a running key cipher, i.e. something I've not tried. Second is that the actual route is one I've not tried. The 8 routes from the 9*29 square are above, but i'll also have to try the same 8 routes from the 29*9, 3*87 and 87*3. There are many other routes I can't try. The last source of error I can see is mismatch between the plaintext language and the english model I am using to score fitness. The first cipher is Chaucer, which is old English and doesn't match up with modern english very well. There is a chance the cipher is in German or French or something too, in which case I probably won't break it.

The results of the above testing: I have run Vigenere, Vig-Autokey, Beaufort, Porta and variant Beaufort crackers on 64 possible routes for key lengths 3-20. I am fairly certain that if F2 was one of these ciphers it would have been broken. In the next post I'll broaden the search a bit.

Thursday 16 January 2014

Auto-generating an Index for Blogger posts

I am not a big fan of blog layouts, it is usually very difficult to find a certain page without using a search engine. To fix this, I wanted to create an Index page with headings and all the posts I have written. I don't want to do this manually, however, as that seems like a lot of ongoing work. There are solutions that use javascript to dynamically build an Index page from a feed, but this means that google won't index the page. So my solution was to go with a semi-manual process in which I use a python script to build the html Index page from the blog feed, then I just have to paste the html into the blog periodically.

The code below generates the following page: index.html

The headings in the page are generated from the labels for the page, e.g. this page is labeled as "Programming", so it appears under the Programming heading. Note that you will have to change the line with "" with your own blog url. To run the code, save the code into a file called e.g. "", then use python to run the code. This will output html, copy and paste it into a page on your blogger site.