## Wednesday, 11 December 2013

### Breaking Substitution Ciphers Part 1

The 'Substitution cipher' is a fairly well known algorithm for enciphering text, a good description of which can be found here. It is also fairly well known that they can be fairly easily broken using hill climbing approaches.

In this post I'm going to look at a different method of breaking substitution ciphers which is not so dependent on trying lots of different cases until one works. The problem with that approach is that you can't be sure how long your cracker will need to run, perhaps by leaving it a bit longer it will eventually succeed? So I'll treat the sentence somewhat like a HMM, then try and find the most likely sequence of characters that make up the key. This will require relaxing some of the constraints, in this case we will relax the constraint that a ciphertext letter will always decrypt to the same plaintext letter. At the end we'll try and massage everything back into place so that we have the most likely decryption key.

## Building the Lattice

Our first task is to build the lattice over which we will be doing all our work, our test sentence will be DEFEND THE EAST enciphered to GIUIFGCEIIPRC. This is a very short sentence which will probably not break due to its length, but hopefully it will illustrate the technique which can be applied to longer sentences.

Our first step is to determine emission probabilities. e.g. G is the first letter of the ciphertext, what is the probability that ciphertext G decrypts to plaintext A? or B? This question will be answered using the multinomial distribution and the counts for each letter. In this case G occurs twice in 13 letters, and we know that A occurs with a frequency of 0.0855. This means the probability of ciphertext G corresponding to plaintext A can be computed as follows:

$p(G_{CT} = A_{PT}) = \dfrac{n_{TOT}!}{n_A!~n_{\lnot A}!}p(A)^{n_A}p(\lnot A)^{n_{\lnot A}}$ $= \dfrac{13!}{2!11!}0.0855^2 \times (1-0.0855)^{11} = 0.2133$

In the equation above I have simplified the formula a bit by assuming $$x_1$$ is the number of times A occurs, and $$x_2$$ is the number of times all other characters occur. We then have to repeat this computation for all plaintext characters, for example the probability that G is B is 0.0167, so it is more likely that G decrypts to A than B.

By applying the formula above for every plaintext letter and each ciphertext letter we get a 'lattice' of size $$n\times 26$$. It will look something like the following, note that each column has been normalised to sum to 1:

 G I U I F G C E I I P R C A 0.100 0.125 0.061 0.125 0.061 0.100 0.100 0.061 0.125 0.125 0.061 0.061 0.100 B 0.008 0.000 0.028 0.000 0.028 0.008 0.008 0.028 0.000 0.000 0.028 0.028 0.008 C 0.026 0.004 0.045 0.004 0.045 0.026 0.026 0.045 0.004 0.004 0.045 0.045 0.026 D 0.036 0.008 0.050 0.008 0.050 0.036 0.036 0.050 0.008 0.008 0.050 0.050 0.036 E 0.130 0.350 0.054 0.350 0.054 0.130 0.130 0.054 0.350 0.350 0.054 0.054 0.130 F 0.014 0.001 0.035 0.001 0.035 0.014 0.014 0.035 0.001 0.001 0.035 0.035 0.014 ... ... ... ... ... ... ... ... ... ... ... ... ... ... W 0.010 0.000 0.031 0.000 0.031 0.010 0.010 0.031 0.000 0.000 0.031 0.031 0.010 X 0.000 0.000 0.004 0.000 0.004 0.000 0.000 0.004 0.000 0.000 0.004 0.004 0.000 Y 0.009 0.000 0.029 0.000 0.029 0.009 0.009 0.029 0.000 0.000 0.029 0.029 0.009 Z 0.000 0.000 0.002 0.000 0.002 0.000 0.000 0.002 0.000 0.000 0.002 0.002 0.000

This post is ongoing, I will edit it some more when I get a chance.