Pages

Thursday 18 December 2014

A look at the second Feynman cipher [part 3]

The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 3 of a discussion about the second cipher.

I have talked about this cipher twice previously in Part 1 and Part 2. In part 1 I was playing around with vigenere type ciphers, in part 2 I decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. One of the ciphers I mentioned there was the Hill cipher, but I only considered the 2 by 2 Hill cipher. It can't be a 2x2 because the ciphertext has an odd number of characters. However, it could be a 3 by 3 Hill cipher because the cipher length is divisible by 3. Fortunately cracking 3x3 Hill ciphers is fairly easy, so we'll consider various types of hill cipher as well as routes through the ciphertext (in case the hill cipher is coupled with a route cipher of some sort).

One of the reasons the Hill cipher is a candidate is because it produces ciphertexts with 26 characters, something many ciphers do not (e.g. foursquare=25, playfair=25, trifid=27 etc. etc.)

The Basic Hill Cipher

In the equation below, the 3x3 matrix is called the key, the matrix \(\left[0,19,19\right]\) represents the characters 'ATT' (A=0,B=1,...,Z=25). The equation shows the enciphering process, see here for a larger tutorial.

\( \left[ \begin{array}[pos]{ccc} 2 & 4 & 5\\ 9 & 2 & 1\\ 3 & 17 & 7\\ \end{array} \right] \left[ \begin{array}[pos]{c} 0\\ 19\\ 19\\ \end{array} \right] = \left[ \begin{array}[pos]{c} 171\\ 57\\ 456\\ \end{array} \right] \pmod{26} = \left[ \begin{array}[pos]{c} 15\\ 5\\ 14\\ \end{array} \right] =\) 'PFO'

The major thing to notice for breaking hill ciphers is that the first ciphertext letter ('P'=15) depends only on the top row of the key matrix, changing the other elements of the key wont affect the fact that 'P' comes out. To make use of this fact we search through all 26^3 possible top rows and look at every third letter in the candidate decryption. We rank the key row by how well the monogram frequencies of every third deciphered character match English. This means we don't have to search through all 26^9 possible keys, just 26^3 3 times, which is much easier. In Part 1 I looked at 8 routes, this time I'll do 40 routes: f2routes.txt.

The code to break 3 by 3 hill ciphers is provided here: break_hill3.py. This code was run on each of the routes keeping the top 15 candidates from each route. Unfortunately nothing came of it, all the decryptions were gibberish. On to the next variant!

Extended Hill Cipher

We can add an extra vector to the Hill enciphering process to make it a little more complicated, increasing the number of key elements from 9 to 12:

\( \left[ \begin{array}[pos]{ccc} 2 & 4 & 5\\ 9 & 2 & 1\\ 3 & 17 & 7\\ \end{array} \right] \left[ \begin{array}[pos]{c} 0\\ 19\\ 19\\ \end{array} \right] + \left[ \begin{array}[pos]{c} 15\\ 5\\ 14\\ \end{array} \right] = \left[ \begin{array}[pos]{c} 186\\ 62\\ 470\\ \end{array} \right] \pmod{26} = \left[ \begin{array}[pos]{c} 4\\ 10\\ 2\\ \end{array} \right] =\) 'EKC'

Note the extra vector \(\left[15,5,14\right]\) added to the first step. In this case the key consists of the square matrix \([2,4,5;9,2,1;3,17,7]\) as well as the vector \([15;5;14]\) (12 numbers to specify the key). This doesn't complicate things too much, we now have to search through 4 numbers 0-25 instead of three. The code for breaking these is here: break_xhill3.py. We have to check a few more possibilities, which makes the program a little bit slower, but not too bad.

This code was run on each of the routes keeping the top 15 candidates from each route. Unfortunately nothing came of it, all the decryptions were gibberish as well. On to the next variant!

Hill cipher with arbitrary substitutions

The next more complicated hill cipher is specified like the standard hill cipher, except the numbers don't necessarily map to the letters in order. For example the standard Hill cipher uses A=0,B=1,...,Z=25. Now we relax that so that e.g. A=7,B=12,...Z=1. This is equivalent to enciphering the message with a substitution cipher prior to applying the hill cipher. This complicates everything a little more.

To break this cipher we can't rank key rows using monogram frequencies because of the additional layer of encipherment, we have to use Index of Coincidence to rank them instead (I.C. is invariant to substitutions). Otherwise the cracking process is similar, except we have to keep more candidates because I.C. is not as effective as monogram frequencies at discriminating good keys from bad.

Cracking this variant was much slower than planned, it meant I was unable to run it on all the different routes, I only ran it on the original cipher. My program output the top few hundred most likely Hill cipher keys, then I have to run a substitution cipher cracker on all the candidates. No success yet.

Tuesday 2 December 2014

Converting PDB files to FASTA format with python

This is a simple script to convert PDB files (which contain information about protein structure) into FASTA format, which is just a list of the amino acids. This code handles one PDB file at a time, to do multiple ones put it in a loop e.g. for bash:

for i in `ls *.pdb`; do python pdb2fasta.py $i > $i.fasta; done

In operation, it just looks for lines with 'ATOM', then looks at the residue number to see if has changed. When the residue numbers change it outputs the current amino acid label. Feel free to use/modify this code for any purpose.