Thursday 18 December 2014

A look at the second Feynman cipher [part 3]

The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 3 of a discussion about the second cipher.

I have talked about this cipher twice previously in Part 1 and Part 2. In part 1 I was playing around with vigenere type ciphers, in part 2 I decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. One of the ciphers I mentioned there was the Hill cipher, but I only considered the 2 by 2 Hill cipher. It can't be a 2x2 because the ciphertext has an odd number of characters. However, it could be a 3 by 3 Hill cipher because the cipher length is divisible by 3. Fortunately cracking 3x3 Hill ciphers is fairly easy, so we'll consider various types of hill cipher as well as routes through the ciphertext (in case the hill cipher is coupled with a route cipher of some sort).

One of the reasons the Hill cipher is a candidate is because it produces ciphertexts with 26 characters, something many ciphers do not (e.g. foursquare=25, playfair=25, trifid=27 etc. etc.)

The Basic Hill Cipher

In the equation below, the 3x3 matrix is called the key, the matrix \(\left[0,19,19\right]\) represents the characters 'ATT' (A=0,B=1,...,Z=25). The equation shows the enciphering process, see here for a larger tutorial.

\( \left[ \begin{array}[pos]{ccc} 2 & 4 & 5\\ 9 & 2 & 1\\ 3 & 17 & 7\\ \end{array} \right] \left[ \begin{array}[pos]{c} 0\\ 19\\ 19\\ \end{array} \right] = \left[ \begin{array}[pos]{c} 171\\ 57\\ 456\\ \end{array} \right] \pmod{26} = \left[ \begin{array}[pos]{c} 15\\ 5\\ 14\\ \end{array} \right] =\) 'PFO'

The major thing to notice for breaking hill ciphers is that the first ciphertext letter ('P'=15) depends only on the top row of the key matrix, changing the other elements of the key wont affect the fact that 'P' comes out. To make use of this fact we search through all 26^3 possible top rows and look at every third letter in the candidate decryption. We rank the key row by how well the monogram frequencies of every third deciphered character match English. This means we don't have to search through all 26^9 possible keys, just 26^3 3 times, which is much easier. In Part 1 I looked at 8 routes, this time I'll do 40 routes: f2routes.txt.

The code to break 3 by 3 hill ciphers is provided here: break_hill3.py. This code was run on each of the routes keeping the top 15 candidates from each route. Unfortunately nothing came of it, all the decryptions were gibberish. On to the next variant!

Extended Hill Cipher

We can add an extra vector to the Hill enciphering process to make it a little more complicated, increasing the number of key elements from 9 to 12:

\( \left[ \begin{array}[pos]{ccc} 2 & 4 & 5\\ 9 & 2 & 1\\ 3 & 17 & 7\\ \end{array} \right] \left[ \begin{array}[pos]{c} 0\\ 19\\ 19\\ \end{array} \right] + \left[ \begin{array}[pos]{c} 15\\ 5\\ 14\\ \end{array} \right] = \left[ \begin{array}[pos]{c} 186\\ 62\\ 470\\ \end{array} \right] \pmod{26} = \left[ \begin{array}[pos]{c} 4\\ 10\\ 2\\ \end{array} \right] =\) 'EKC'

Note the extra vector \(\left[15,5,14\right]\) added to the first step. In this case the key consists of the square matrix \([2,4,5;9,2,1;3,17,7]\) as well as the vector \([15;5;14]\) (12 numbers to specify the key). This doesn't complicate things too much, we now have to search through 4 numbers 0-25 instead of three. The code for breaking these is here: break_xhill3.py. We have to check a few more possibilities, which makes the program a little bit slower, but not too bad.

This code was run on each of the routes keeping the top 15 candidates from each route. Unfortunately nothing came of it, all the decryptions were gibberish as well. On to the next variant!

Hill cipher with arbitrary substitutions

The next more complicated hill cipher is specified like the standard hill cipher, except the numbers don't necessarily map to the letters in order. For example the standard Hill cipher uses A=0,B=1,...,Z=25. Now we relax that so that e.g. A=7,B=12,...Z=1. This is equivalent to enciphering the message with a substitution cipher prior to applying the hill cipher. This complicates everything a little more.

To break this cipher we can't rank key rows using monogram frequencies because of the additional layer of encipherment, we have to use Index of Coincidence to rank them instead (I.C. is invariant to substitutions). Otherwise the cracking process is similar, except we have to keep more candidates because I.C. is not as effective as monogram frequencies at discriminating good keys from bad.

Cracking this variant was much slower than planned, it meant I was unable to run it on all the different routes, I only ran it on the original cipher. My program output the top few hundred most likely Hill cipher keys, then I have to run a substitution cipher cracker on all the candidates. No success yet.

Tuesday 2 December 2014

Converting PDB files to FASTA format with python

This is a simple script to convert PDB files (which contain information about protein structure) into FASTA format, which is just a list of the amino acids. This code handles one PDB file at a time, to do multiple ones put it in a loop e.g. for bash:

for i in `ls *.pdb`; do python pdb2fasta.py $i > $i.fasta; done

In operation, it just looks for lines with 'ATOM', then looks at the residue number to see if has changed. When the residue numbers change it outputs the current amino acid label. Feel free to use/modify this code for any purpose.

Sunday 30 November 2014

Framing and reconstructing speech signals

This post will deal with framing and overlap-add resynthesis. This can also be known as AMS (Analysis-Modification-Synthesis) when doing things like speech enhancement. First of all, what is the point of framing? An audio signal is constantly changing, so we assume that on short time scales the audio signal doesn't change much (when we say it doesn't change, we mean statistically i.e. statistically stationary, obviously the samples are constantly changing on even short time scales). This is why we frame the signal into 20-40ms frames. If the frame is much shorter we don't have enough samples to get a reliable spectral estimate, if it is longer the signal changes too much throughout the frame, and the FFT will end up smearing the contents of the frame.

What is involved: frame the speech signal into short, overlapping frames. Typically frames are taken to be about 20ms long. For a 16kHz sampled audio file, this corresponds to 0.020s * 16,000 samples/s = 400 samples in length. We then use an overlap of 50%, or about 200 samples. This means the first frame starts at sample 0, the second starts at sample 200, the third at 400 etc.

MATLAB code for framing: frame_sig.m and unframing:deframe_sig.m.

Framing the signal is pretty simple, the only thing to note is that the signal is padded with zeros so that it makes an integer number of frames. A window function is also applied. The overlap-add process has a few things that make it tricky, as well as adding up the overlapped signal we also add up the window correction which is basically what our signal would be if every frame was just the window. This is important since the windowed frames won't necessarily add up to get the original signal back. You can see this by plotting the window_correction variable in deframe_sig.m and thinking about how it gets like that. We also have to add eps (this is just a very small constant i.e. epsilon) to the window correction just in case it is ever zero, this prevents infs appearing in our reconstructed signal.

To see how the AMS framework can be used for spectral subtraction, have a look at this spectral subtraction tutorial. The framing and deframing routines on this page can be used to implement the enhancement routines there. Some example code for the tutorial above would look something like this:

Processing lists of files in MATLAB

It is often necessary to apply a certain operation to many files all at once e.g. adding noise to a database, creating feature files from a list of audio files, enhancing files, etc. This post will explain the method I use for processing lists of files, because I tend to use it quite often.

The main points are to use the unix find command to generate the file list (of course this is not necessary, any method for generating file lists is fine), then use matlab's fgetl to read each file name and process it. The code for adding noise to a list of NIST files is below, you can find the read_NIST_file script here, and the addnoise script can be found here.

Note that we check for end of file by checking ischar(wavname), once the end of file is reached, wavname (a line from the file list) will no longer contain character data.

Saturday 29 November 2014

Adding Noise of a certain SNR to audio files

A common task when dealing with audio is to add noise to files, e.g. if you want to test the performance of a speech recognition system in the presence of noise. This is based on computing the Signal to Noise Ratio (SNR) of the speech vs. noise. To compute the energy in a speech file, just add up the sum of squares of all the samples:

\[ E_{Speech} = \sum_{i=0}^N s(i)^2 \]

where \(s(i)\) is the vector of speech samples you read with a function like wavread. We will also need some noise, which we can generate using a function like randn(N,1) where N is the length of the speech signal. Alternatively we can use a dedicated noise file containing e.g. babble noise and just truncate it at the correct length. When using a noise file, it is important to randomise the start position for each file so you don't always have e.g. a door banging or a guy laughing at the same point in every file. This can mess with classifiers. Anyway, now compute the energy of the noise:

\[ E_{Noise} = \sum_{i=0}^N n(i)^2 \]

where \(n(i)\) is the noise vector. To compute the SNR of a speech file compared to noise:

\[ SNR = 10\log_{10} \left( \dfrac{E_{Speech}}{E_{Noise}} \right) \]

If you don't have the pure noise, you just have a corrupted version of the original, you compute the noise as: \(n(i) = x(i) - s(i)\), where \(x(i)\) is the corrupted signal.

Now we want to scale the noise by a certain amount and add it to the original speech signal so that the SNR is correct. This assumes we have a target SNR, for the sake of this post assume we want the noise to be at 20dB SNR. We now use the following formula (This formula assumes the noise signal has unit variance, you may need to normalise it before using this formula):

\[ K = \sqrt{ \dfrac{E_{Speech}}{10^{20\text{dB}/10}} } \]

Once we have done this we need to create \(\hat{n}(i) = K\times n(i)\) for our noise samples. Our noisy speech file is calculated as:

\[ x(i) = s(i) + \hat{n}(i) \]

for \(i = 1 \ldots N\). You should be able to compute the SNR between the new noisy signal and the original signal and it should come out to be very close to 20dB (it could be 19.9 or 20.1 or something). Function for computing snr in matlab: snr.m, function for adding white noise to files: addnoise.m.

Thursday 27 November 2014

Reading and writing NIST, RAW and WAV files in MATLAB

To open files (NIST or WAV) when you are not sure which it could be, use audioread.m, which depends on the read_X_file.m explained below.

NIST files

NIST files are very common when doing speech processing, for example the TIMIT and RM1 speech databases are in NIST format. The NIST format consists of 1024 bytes at the start of the file consisting of ASCII text, after this header the speech data follows. For TIMIT and RM1 the speech data is 16-bit signed integers which can be read directly from the file and treated as a signal.

To get the sampling frequency, you'll have to parse the text that forms the header. In any case, the functions to read and write NIST files is supplied here: read_NIST_file.m and write_NIST_file.m.

Note that writing a NIST file using the scripts above requires a header. The easiest way to get a header is to read another NIST file. So if you want to modify a NIST file then you would use something like:

[signal, fs, header] = read_NIST_file('/path/to/file.wav');
write_NIST_file('/path/to/new/file.wav',fs,header);

This reuses the header from previously and works fine. If you want to create completely new files i.e. there is no header to copy, I recommend not creating NIST formatted files, create wav files instead as they are far better supported by everything.

An example header from the timit file timit/test/dr3/mrtk0/sx283.wav. Note the magic numbers NIST_1A as the first 7 bytes of the file. The actual audio data starts at byte 1024, the rest of the space between the end of the header text and byte 1024 is just newline characters.

NIST_1A
   1024
database_id -s5 TIMIT
database_version -s3 1.0
utterance_id -s10 rtk0_sx283
channel_count -i 1
sample_count -i 50791
sample_rate -i 16000
sample_min -i -2780
sample_max -i 4675
sample_n_bytes -i 2
sample_byte_format -s2 01
sample_sig_bits -i 16
end_head

RAW files

RAW files are just pure audio data without a header. This can make it a little difficult to figure out what is actually in them, often you may just have to try things until you get meaningful data coming out. Common settings would be 16-bit signed integer samples. You'll also have to take care of endianness, if the files were written on a windows machine all the integers will be byte-swapped and you'll have to swap them back.

read_RAW_file.m and write_RAW_file.m.

WAV files

wav files are supported natively by matlab, so you can just use matlabs wavread and wavwrite functions.

Generating Pretty Spectrograms in MATLAB

Spectrograms are a time-frequency representation of speech (or any other) signals. It can be difficult to make them pretty, as there are a lot of settings that change various properties. This post will supply some code for generating spectrograms in MATLAB, along with an explanation of all the settings that affect the final spectrogram.

The file itself can be found here: myspectrogram.m.

An example of what it can do:

We will now look at some of the default settings. To call it all you need to do is: myspectrogram(signal,fs);. This assumes signal is a sequence of floats that represent the time domain sequence of an audio file, and fs is the sampling frequency. fs is used to display the time (in seconds) at the bottom of the spectrogram.

There are a lot more settings, the full call with everything included is:

[handle] = myspectrogram(s, fs, nfft, T, win, Slim, alpha, cmap, cbar);

s     - speech signal
fs    - sampling frequency
nfft  - fft analysis length, default 1024
T     - vector of frame width and frame shift (ms), i.e. [Tw, Ts], default [18,1] in ms
w     - analysis window handle, default @hamming
Slim  - vector of spectrogram limits (dB), i.e. [Smin Smax], default [-45, -2]
alpha - fir pre-emphasis filter coefficients, default false (no preemphasis)
cmap  - color map, default 'default'
cbar  - color bar (boolean), default false (no colorbar)

nfft is the number of points used in the FFT, larger values of nfft will have more detail, but there will be diminishing returns. 1024 is a good value.

The frame lengths and shifts can be important, shorter window lengths give better time resolution but poorer frequency resolution. Long window lengths give better frequency resolution but poorer time resolution. By playing with these numbers you can get a better idea of how they work.

The window function is specified as an inline function. @hamming is the MATLAB hamming function. If you want blackman you would use @blackman. For a parameterised window like the Chebyshev, you can use @(x)chebwin(x,30) using 30dB as the chebyshev parameter. For a rectangular window you can use @(x)ones(x,1).

The vector of spectrogram limits clips the spectrogram at these points. First the highest point of the log spectrogram is set to 0, the everything outside the limits is set to the limits. This can make spectrograms look much cleaner if there is noise below e.g. -50dB, which is not audible but makes the spectrogram look messy. You can remove it with the limits.

Thursday 16 October 2014

Getting HHBlits to produce profiles

HHblits was introduced in the paper HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment.

First download the HHsuite from here: ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/. You will also need a database from here: ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/databases/hhsuite_dbs/. I used uniprot20 as it was the most recent at the time of writing this post.

Extract HHsuite and the databases. To run hhblits, you'll need fasta files for each of the proteins you want. I happened to have individual files with the protein sequences in it, but hhblits needs proper fasta formatting, which basically just means adding a '>' to the first line. This can be done as follows:


for i in `ls /home/james/Desktop/new_pssm_and_ssp/pssm_new/EDD/infile/`; 
    do echo '>' > eddtrain/$i; 
    cat /home/james/Desktop/new_pssm_and_ssp/pssm_new/EDD/infile/$i >> eddtrain/$i; 
done

For the script above to work, you'll need a directory called 'eddtrain'. The script reads every file in ...EDD/infile/ and creates a corresponding file in eddtrain/ with > as the first line and the protein sequence as the second line. To run hhblits on all of these files, use the following:

for i in `ls eddtrain/*.txt | sed 's-eddtrain/--'`; do 
    echo $i; hhsuite-2.0.16-linux-x86_64/bin/hhblits -d uniprot20_2013_03/uniprot20_2013_03 -i eddtrain/$i -ohhm eddhmms/$i -n 4 -M first; 
done

This will output hmms for all the files in the eddhmm directory.

The hmms have a set of 20 numbers which are transformed probabilities. They can be transformed back into probabilities using the following formula (Note that '*' is the same as infinity i.e. p=0):

\[p = 2^{(-N/1000)}\]

Possible errors

Error in hhblits: input file missing!

You need to add the -i flag. e.g. -i train/file.fasta

Error in hhblits: database missing (see -d)

You need to add the database name e.g. -d uniprot20_2013_03/uniprot20_2013_03. You can get the database from here: ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/databases/hhsuite_dbs/. Extract the database as you would a normal zipped file.

Error in hhblits: Could not open A3M database uniprot20_2013_03_a3m_db, No such file or directory (needed to construct result MSA)

In this case you have given the database directory, you need to provide the path to the database inside the directory. It has to be of the correct format, see the previous error.

Getting SPINEX to produce Protein Secondary Structure

SPINEX is a tool that takes PSSM files as input and predicts the secondary structure of the protein. It is roughly 80% accurate. It can be downloaded here: http://sparks-lab.org/pmwiki/download/index.php?Download=spineXpublic.tgz.

Extract the tgz file, then cd into the directory. There should be bin, code, test and weights directories. If you don't already, you'll need a fortran compiler, gfortran works fine. cd into the code directory and run the compile script. This should create all the executables required.

The next caveat is that SPINE-X only takes pssm files with the '.mat' extension. If your pssm files don't have this extension you'll need to change them. E.g. where I generate PSSM files, I created them with a '.txt' extension. To change them all at once, you can do the following:

for i in `ls`; do mv $i `echo $i | sed 's/txt/mat/'`; done

To run the script above, make sure you are in the directory with all the pssm files. It will change the extension of all the '.txt' files in that directory.

Now move up one directory and do the following ('pssm' is the directory containing the pssm files that we just renamed):

ls pssm | sed 's/.mat$//' > list.txt

This will create a list of all the pssm files without their extensions. SPINE-X automatically adds '.mat' to them. Now we can finally run SPINEX itself:

/home/james/Downloads/spineXpublic/spX.pl list.txt pssm/

The spX.pl file is a perl script from the spineXpublic.tgz we downloaded originally. 'list.txt' is the file list we just created, and 'pssm/' is the directory containing all the pssm files renamed to have a '.mat' extension. This procedure will create a new directory called 'spXout/' which will contain all the predicted secondary structure files.

Getting PSI-BLAST to create PSSMs

This post will be a reference on how to generate PSSM profiles for a list of proteins using NCBI's PSI-BLAST.

First of all, you'll need the PSI-BLAST tool itself from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ and a database e.g. the nr dataset from here: ftp://ftp.ncbi.nlm.nih.gov/blast/db/, just download all the numbered files e.g. nr.00.tar.gz -> nr.26.tar.gz and extract them all to a folder somewhere.

Let's say you have a text file containing protein sequences called prots.txt containing something like:


SQETFSDLWKLLPEN
STSRHKKLMFK
SPLPSQAMDDLMLSPDDIEQWFTEDPGP
SHLKSKKGQSTSRHKKLMFK

PSI-BLAST needs these as separate input files, so we need to create a fasta file for each one. We will do this at the same time as running PSI-BLAST. I will assume we extracted the nr dataset into a directory called nr/. Now to run PSI-BLAST, I move in to the nr/ directory. This ensures PSI-BLAST can find the dataset. From here we run the following script (e.g. put it in a file called 'run.sh' and execute it):


c=1
for i in `cat prots.txt`
do
    echo $c $i
    echo $i > ~/lab_work/nr_db/train/train${c}.txt
    ../ncbi-blast-2.2.26+-src/c++/GCC461-Release64/bin/psiblast -query ~/lab_work/nr_db/train/train${c}.txt -db nr -out ~/lab_work/nr_db/out/out${c}.txt -num_iterations 3 -out_ascii_pssm ~/lab_work/nr_db/pssm/pssm${c}.txt -inclusion_ethresh 0.001
    c=`expr $c + 1`
done

The script above outputs each line of prots.txt into a file called trainN.txt, where N is a number that is incremented for each protein. You'll have to make sure the directory you want to put these train files in exists. There will also need to be directories called 'out' and 'pssm'. These will be filled with PSI-BLAST outputs and the pssm files respectively.

Monday 17 February 2014

Multivariate Wiener Filter Derivation

The Wiener filter is the optimal filter to use if you want to remove Gaussian noise from a Gaussian signal. Most useful applications though involve multiple (possibly correlated) signals with additive noise, so a multivariate Wiener filter is required. In this post I'll derive the multivariate Wiener filter as a Maximum A-Posteriori (MAP) estimator of the clean signal given the noisy signal. Because Gaussians are symmetric, the MAP estimator will be identical to the MMSE estimator, with the advantage that the MAP estimator is much easier to compute.

So our assumption is we have a noisy signal \(y(n)\) which is the result of adding gaussian noise \(d(n)\) to our clean signal \(x(n)\). Remember each of these are vectors, so \(y(n)\), \(d(n)\) and \(x(n)\) are vectors with \(k\) dimensions.

\[ y(n) = x(n) + d(n)\]

What we want to do is find the most likely value of \(x(n)\) given \(y(n)\). We never get to know \(d(n)\), but we do have an estimate of the covariance of \(d(n)\) which we will call \(\Sigma_d\). We assumed initially that \(x(n)\) and \(d(n)\) were distributed according to the multivariate Gaussian distribution, but because the pdf takes up so much space, I'll be using the following notation:

\[ a \sim \mathcal{N}(\boldsymbol\mu_a,\boldsymbol\Sigma_a) = (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_a|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}(\mathbf{a}-\boldsymbol\mu_a)'\boldsymbol\Sigma_a^{-1}(\mathbf{a}-\boldsymbol\mu_a) } \]

This means \(a\) is distributed according to a multivariate Gaussian with mean \(\boldsymbol\mu_a\) and covariance \(\boldsymbol\Sigma_a\). So we have \(d(n) \sim \mathcal{N}(0,\boldsymbol\Sigma_d)\) and \(x(n) \sim \mathcal{N}(\boldsymbol\mu_x,\boldsymbol\Sigma_x)\).

From our intial assumtion that \(y(n) = x(n) + d(n)\), we can also determine the pdf of \(y(n)\) if we know \(x(n)\). This is simply \(p(y | x) = \mathcal{N}(x,\boldsymbol\Sigma_d)\), i.e. we know \(y\) will be centered around \(x\) with a variance due only to the noise. We now have the distribution of the noisy given the clean, unfortunately we won't know the clean signal. What we want is the distribution of the clean signal given the noisy signal. From here we can use Bayes Rule:

\[p(x | y) = \dfrac{p(y|x)p(x)}{p(y)} \]

This will give us the pdf of the clean signal given the noisy, then we just have to find the maximum of the distribution to find our estimate of \(x\). Note that \(p(y)\) in the above equation does not effect the maximisation (because it does not depend on \(x\)), so it will be left out. From here we write out everything we have, using the values for \(p(x)\) and \(p(y|x)\) we worked out above, hopefully fitting everything on the page:

\[ p(x | y) = p(y|x)p(x)\] \[=(2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_d|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}(y - x)'\boldsymbol\Sigma_d^{-1}(y-x)} \times (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_x|^{-\frac{1}{2}}\, e^{ -\frac{1}{2}(x-\boldsymbol\mu_x)'\boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x)}\]

We want to maximise the equation above, unfortunately it is rather complicated. To get around this we take the log of the whole thing, since maximising a function and maximising its log will give the same point. The result is much easier to differentiate:

\[ \log(p(x | y)) = \log(C) - \frac{1}{2}(y - x)'\boldsymbol\Sigma_d^{-1}(y-x)- \frac{1}{2}(x-\boldsymbol\mu_x)'\boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x) \]

where \(C\) consists of all the bits that don't depend on \(x\).

\[C = (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_d|^{-\frac{1}{2}} (2\pi)^{-\frac{k}{2}}|\boldsymbol\Sigma_x|^{-\frac{1}{2}}\]

Now, to maximise \(\log(p(x|y))\), we differentiate it. This involves some vector calculus, the rule we use can be seen in The Matrix Cookbook (page 11, eqn 85 and 86).

\[ \dfrac{d\log(p(x | y))}{dx} = -\boldsymbol\Sigma_d^{-1}(y-x) + \boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x) \]

To find the maximum we set the left hand side to zero, then solve for \(x\):

\[ 0 = -\boldsymbol\Sigma_d^{-1}(y-x) + \boldsymbol\Sigma_x^{-1}(x-\boldsymbol\mu_x) \]

With a tiny bit of algebra, we can manipulate the above equation to isolate \(x\),

\[\hat{x'} = \boldsymbol\Sigma_x(\boldsymbol\Sigma_x + \boldsymbol\Sigma_d)^{-1}(y - \boldsymbol\mu_x) + \boldsymbol\mu_x \]

and that is it. To get an estimate for the clean signal, we just use the equation above which takes the noisy signal. This is optimal (i.e. the best that is theoretically possible) if \(x\) and \(d\) are both Gaussian. It can still be O.K. if they are not, but you can easily repeat the above steps for different distributions. If the distributions aren't symmetric, then the MMSE estimator will usually perform better than the MAP estimator.

Sunday 9 February 2014

LPCs using the Burg method in Julia

This post will discuss the Burg method of computing LPCs. Other methods include the Autocorrelation method, the Covariance method and the Modified Covariance method.

The burg method builds on the Modified Covariance (MCOV) method. The prediction error is defined in exactly the same way, however it may be that the MCOV method produces an unstable filter because it uses unconstrained optimisation to obtain the LPC coefficients. The burg method minimises the prediction error under the Levinson constraint:

\[ a_{p,k} = a_{p-1,k} + a_{p,p}a_{p-1,p-k}\]

Lets define forward and backward prediction error as:

\[f_p = \sum_{k=0}^p a_{p,k}x(n+p-k) \]

and

\[b_p = \sum_{k=0}^p a_{p,k}x(n+k) \]

Minimisation of the prediction error power with respect to \(a_{p,p}\) leads to

\[ a_{p,p} = -\dfrac{2 \sum_{n=0}^{N-p-1} b_{p-1}(n) f_{p-1}(n+1)}{\sum_{n=0}^{N-p-1} \left(b_{p-1}^2(n) + f_{p-1}^2(n+1)\right)} \]

This leads to a recursive equation described below.

Initialise:

\[ P_0 = \dfrac{1}{N}\sum_{n=0}^{N-1}x^2(n) \] \[ f_0(n) = b_0(n) = x(n), \qquad n = 0,1,\ldots,N-1\]

For \(p = 1,2,\ldots,L\):

\[ a_{p,p} = -\dfrac{2 \sum_{n=0}^{N-p-1} b_{p-1}(n) f_{p-1}(n+1)}{\sum_{n=0}^{N-p-1} \left(b_{p-1}^2(n) + f_{p-1}^2(n+1)\right)} \] \[ a_{p,k} = a_{p-1,k} + a_{p,p}a_{p-1,p-k}, \qquad k=1,2,\ldots,p-1\] \[ P_p = P_{p-1}(1-a_{p,p}^2) \] \[ f_p(n) = f_{p-1}(n+1) + a_{p,p}b_{p-1}(n)\] \[ b_p(n) = b_{p-1}(n) + a_{p,p}f_{p-1}(n+1)\]

Where \(G_L^2 = P_L \). A julia implementation of this algorithm:

Wednesday 5 February 2014

Log Area Ratios and Inverse Sine coefficients

Linear prediction coefficients are used for speech coding to compactly characterise the short-time power spectrum. Unfortunately, LPCs are not very suitable for quantisation due to their large dynamic range, also small quantisation errors in individual LPC coefficients can produce relatively large spectral errors and can also result in instability of the filter. To get around this, it is necessary to transform the LPCs into other representations which ensure the stability of the LPC filter after quantisation. The available representations include: reflection coefficients (RCs), log area ratios (LARs), inverse sine coefficients (ISs) (or arcsine reflection coefficients), and line spectral frequencies (LSFs).

Reflection coefficients are computed from LPCs using the levinson durbin recursion, and have 2 main advantages over LPCs: 1) they are less sensitive to quantisation than LPCs, and 2) the stability of the resulting filter can easily be ensured by making all reflections coefficients stay between -1 and 1.

Though RCs are less sensitive to quantisation than LPCs, they have the drawback that their spectral sensitivity is U-shaped, having large values whenever the magnitude of these coefficients is close to unity. This means the coefficients are very sensitive to quantisation distortion when they represent narrow bandwidth poles. However, this can be overcome by using a non-linear transformation of the RCs. Two such transformations are LARs and ISs.

More information on this topic can be sourced from Speech Coding and Synthesis.

Log Area Ratios

Log area ratios (LAR) are computed from reflection coefficients for transmission over a channel. They are not as good as line spectral pairs (LSFs), but they are much simpler to compute. If \(k\) is a a reflection coefficent, LARs can be computed using the following formula:

\[ \text{LAR} = \log \left( \dfrac{1+k}{1-k} \right) \]

The inverse operation, computing reflection coefficents from the LARs is:

\[ k = \dfrac{e^{\text{LAR}}-1}{e^{\text{LAR}}+1} \]

As an example, if you have the following reflection coefficents: -0.9870 0.9124 -0.2188 0.0547 -0.0383 -0.0366 -0.0638 -0.0876 -0.1178 -0.1541, then the corresponding LARs are -5.0304 3.0836 -0.4447 0.1095 -0.0766 -0.0732 -0.1279 -0.1757 -0.2367 -0.3106.

Inverse Sine Coefficients

IS parameters are another transformation of reflection coefficients, their formula is as follows:

\[ \text{IS} = \dfrac{2}{\pi} \sin^{-1} (k) \]

The inverse operation, computing reflection coefficents from the ISs is:

\[ k = \sin \left( \dfrac{\pi \text{IS}}{2} \right) \]

Using the same reflection coefficients as previously, the IS parameters are -0.8973 0.7316 -0.1404 0.0348 -0.0244 -0.0233 -0.0407 -0.0559 -0.0752 -0.0985

Tuesday 4 February 2014

LPCs using the Modified Covariance Method in Julia

This post will discuss the modified covariance method (a.k.a the forward-backward method) of computing LPCs. Other methods include the Autocorrelation method, the Covariance method and the Burg method.

Linear prediction coefficients (LPCs) are used a lot in signal processing, and the theory around it is covered in other places (see here for an intuitive explanation). In this post I will go over how to implement LPCs using the modified covariance method in Julia. Given a signal \(x\), the job of LPCs is to predict the current sample i.e. \(x(n)\) using a weighted sum of previous samples:

\[ x(n) = \sum_{i=1}^L a(i) x(n-i) + e(n)\]

Ideally, \(e(n)\), the error, will be zero but usually it is not. As a result we want values of \(a(i)\) that make \(e(n)\) as small as possible. The autocorrelation method resulted in the Yule-Walker equations, which have a nice structure (the resulting matrix is toeplitz). The modified covariance method results in a similar set of equations, but the resulting matrix is not toeplitz, so we can't use the Levinson-Durbin algorithm to solve it. This means we have to rely on good old matrix inversion. The covariance equations we have to solve look like this:

\[\sum_{i=1}^L \phi_{i,k} a(i) = -\phi_{k,0} \qquad 1 \leq k \leq L\]

where the \(\phi\)s are defined as:

\[ \phi_{i,k} = \dfrac{1}{2(N-L)} \sum_{n=0}^{N-L-1} x(n+k)x(n+i) + x(n+L-k)x(n+L-i)\]

where N is the length of the signal we are working with, and L is the order of the model we want. The matrix equation that results looks like this:

\[ \left[\begin{array}{cccc} \phi_{1,1} & \phi_{2,1} & \ldots & \phi_{L,1} \\ \phi_{1,2} & \phi_{2,2} & \ldots & \phi_{L,2} \\ \vdots & \vdots & \ddots & \vdots\\ \phi_{1,L} & \phi_{2,L} & \ldots & \phi_{L,L} \end{array}\right] \left[\begin{array}{c} a(1) \\ a(2) \\ \vdots \\ a(L) \end{array}\right] = -\left[\begin{array}{c} \phi_{1,0} \\ \phi_{2,0} \\ \vdots \\ \phi_{L,0} \end{array}\right] \]

This can also be written as:

\[Pa = -p\]

Note that the matrix \(P\) is symmetric but not toeplitz. To solve it we use:

\[a = P^{-1}({-p})\]

Once we have the values of \(a\) we are done. Julia code for this procedure follows.

LPCs using the Covariance Method in Julia

This post will discuss the covariance method of computing LPCs. Other methods include the Autocorrelation method, the modified covariance (Forward-Backward) method and the Burg method.

Linear prediction coefficients (LPCs) are used a lot in signal processing, and the theory around it is covered in other places (see here for an intuitive explanation). In this post I will go over how to implement LPCs using the covariance method in Julia. Given a signal \(x\), the job of LPCs is to predict the current sample i.e. \(x(n)\) using a weighted sum of previous samples:

\[ x(n) = \sum_{i=1}^L a(i) x(n-i) + e(n)\]

Ideally, \(e(n)\), the error, will be zero but usually it is not. As a result we want values of \(a(i)\) that make \(e(n)\) as small as possible. The autocorrelation method resulted in the Yule-Walker equations, which have a nice structure (the resulting matrix is toeplitz). The covariance method results in a similar set of equations, but the resulting matrix is not toeplitz, so we can't use the Levinson-Durbin algorithm to solve it. This means we have to rely on good old matrix inversion. The covariance equations we have to solve look like this:

\[\sum_{i=1}^L \phi_{i,k} a(i) = -\phi_{k,0} \qquad 1 \leq k \leq L\]

where the \(\phi\)s are defined as:

\[ \phi_{i,k} = \dfrac{1}{N-L} \sum_{n=L}^{N-1} x(n-k)x(n-i)\]

where N is the length of the signal we are working with, and L is the order of the model we want. The matrix equation that results looks like this:

This can also be written as:

\[Pa = -p\]

Note that the matrix \(P\) is symmetric but not toeplitz. To solve it we use:

\[a = P^{-1}({-p})\]

Once we have the values of \(a\) we are done. Julia code for this procedure follows.

Saturday 1 February 2014

Implementing Linear Prediction Coefficients in Julia

This post will discuss the autocorrelation or Yule-Walker method of computing LPCs. Other methods include the Covariance method, the modified Covariance Method (Forward-Backward method) and the Burg method.

Linear prediction coefficients (LPCs) are used a lot in signal processing, and the theory around it is covered in other places (see here for an intuitive explanation). In this post I will go over how to implement LPCs in Julia. Given a signal \(x\), the job of LPCs is to predict the current sample i.e. \(x(n)\) using a weighted sum of previous samples:

\[ x(n) = \sum_{i=1}^L a(i) x(n-i) + e(n)\]

Ideally, \(e(n)\), the error, will be zero but usually it is not. As a result we want values of \(a(i)\) that make \(e(n)\) as small as possible. Formally, we try to minimise the prediction error power:

\[P = \dfrac{1}{N}\sum_{n=-\infty}^{\infty} e(n)^2\] \[ = \dfrac{1}{N}\sum_{n=-\infty}^{\infty} \left[ x(n) + \sum_{k=1}^{L} a(k) x(n-k) \right] \]

What we want to do is minimise \(P\) with respect to \(a(i)\). When this is done, we get a set of equations that must be satisfied for \(P\) to be minimised, the well known Yule-Walker equations:

\[\sum_{i=1}^L R_{xx}(i-k) a(i) = R_{xx}(k) \qquad 1 \leq k \leq L\]

where \(R_{xx}\) is the autocorrelation sequence of our signal \(x\). We can write out the Yule-Walker equations as a matrix equation, so solving for the LPCs (values of \(a(i)\)) becomes easy:

\[ \left[\begin{array}{cccc} R(0) & R(1) & \ldots & R(L-1) \\ R(1) & R(0) & \ldots & R(L-2) \\ \vdots & \vdots & \ddots & \vdots\\ R(L-1) & R(L-2) & \ldots & R(0) \end{array}\right] \left[\begin{array}{c} a(1) \\ a(2) \\ \vdots \\ a(L) \end{array}\right] = \left[\begin{array}{c} R(1) \\ R(2) \\ \vdots \\ R(L) \end{array}\right] \]

where \(R(0)\) is the same \(R_{xx}(0)\) in the previous equations, and \(a(i)\) are the LPCs we want to compute. The naive way of solving the Yule Walker equations is to invert the square matrix of autocorrelation coefficients, however this has \( O(L^3)\) complexity. There is faster way that uses the special structure of the \(R\) matrix (it has toeplitz structure), the algorithm for solving it is called the Levinson-Durbin algorithm which can find the values of \(a(i)\) with \(O(L^2)\) complexity.

The Levinson-Durbin algorithm looks a little like the following:

For \(m = 1\):

\[a_{1,1} = -\dfrac{R(1)}{R(0)} \] \[P_1 = R(0)(1-a_{1,1}^2) \]

For \(m = 2,3,\ldots,L\):

\[a_{m,m} = -\dfrac{R(m) + \sum_{i=1}^{m-1} a_{m-1,i}R(m-i)}{P_{m-1}} \] \[a_{m,i} = a_{m-1,i} + a_{m,m}a_{m-1,m-i} \qquad i=1,2,\ldots,m-1\] \[P_m = P_{m-1}(1-a_{m,m}^2) \]

Our final LPCs are a(1:L) = \(a_{L,1:L}\), and the gain term is P = \( P_L\) where \(L\) is the number of LPCs we want. In Julia code, this will look like the following:

And that is it.

Friday 31 January 2014

Fast Calculation of Autocorrelation Coefficients in Julia

Autocorrelation is one of those things that gets used a lot in signal processing which has a very simple formula, but which is easy to speed up. In essence, autocorrelation is a measure of how similar a signal is to itself. Periodic signals are exactly those signals that are self-similar, so autocorrelation is good for locationg periodicities. The basic formula for autocorrelation of a signal \(x(i)\) is:

\[ R_{xx}(j) = \sum_{i=0}^N x(i)x(i-j) \]

In the above equation, \( R_{xx}(j) \) is the notation for the \(j^{th}\) autocorrelation coefficient of signal \(x\), where \(x\) is of length \(N\) samples. Note that the above equation is \(O(N^2)\) in complexity if we want all the autocorrelation coefficients (of which there are around \( 2N - 1\) ). To speed it up, we are going to use the Wiener–Khinchin theorem, which states that the power spectrum of a signal and the autocorrelation sequence of a signal are related to each other via the Fourier Transform (FFT). This means:

\[ S_{xx} = FFT(R_{xx})\]

and

\[ R_{xx} = IFFT(S_{xx}) = IFFT( |FFT(x)|^2 ) \]

where \(S_{xx}\) is the power spectrum of our signal \(x\). Note that the complexity of the FFT is \(O(n\log n)\), so this is a significant speedup over the original equation. This means our code for fast autocorrelation is simply:

Alternatively, Julia provides a xcorr function for computing cross correlation, to produce the same output as the function above, you would need: xcorr(x,x)[length(x):], this is the preferred way of achieving it.

Tuesday 28 January 2014

A quick look at the 3rd Feynman cipher

The Feynman ciphers are a set of 3 ciphertexts apparently given to Richard Feynman at Los Alamos, only the first of which has been solved. This post will be a quick look at some of the statistics of the 3rd cipher. The Cipher itself is:

WURVFXGJYTHEIZXSQXOBGSV
RUDOOJXATBKTARVIXPYTMYA
BMVUFXPXKUJVPLSDVTGNGOS
IGLWURPKFCVGELLRNNGLPYT
FVTPXAJOSCWRODORWNWSICL
FKEMOTGJYCRRAOJVNTODVMN
SQIVICRBICRUDCSKXYPDMDR
OJUZICRVFWXIFPXIVVIEPYT
DOIAVRBOOXWRAKPSZXTZKVR
OSWCRCFVEESOLWKTOBXAUXV
B

It is 231 characters in length (which has factors 3, 7 and 11), and has an I.C. of 0.0429, which means it is unlikely to be a transposition or substitution cipher. If we look at the monogram frequencies, we see all 26 letters appear, which means it can't be a cipher based on a 5 by 5 keysquare. This cipher actually looks statistically very close to the previous (second feynman cipher) which I have talked about previously (1, 2).

Since this one is so similar to the second Feynman cipher, much of the analysis here also applies to this cipher. Using this cipher identifier, the top candidates are the following ciphers:

RunningKey 9
Beaufort 9
FracMorse 9
period 7 Vigenere 9
Quagmire3 10
Porta 10
Vigenere 10
Quagmire4 10
Gronsfeld 11
Quagmire2 11
Nicodemus 11
Vigslidefair 13

These are almost all Vigenere-type ciphers, I will try and run some cracking tools on them to see what happens.

Saturday 25 January 2014

A look at the second Feynman cipher [part 2]

The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 2 of a discussion about the second cipher, which I refer to as F2.

In the previous part, I was playing around with vigenere type ciphers, in this part I have decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. To do this we need a comprehensive list of cipher algorithms, thankfully there are lists around.

Using the list of ciphers above plus a few more from here, I have created the following list of ciphers, ruling out the ones it can't possibly be:

Transposition: AMSCO, COMPLETE COLUMNAR TRANSPOSITION, INCOMPLETE COLUMNAR, MYSZKOWSKI, NIHILIST TRANSPOSITION, RAILFENCE, ROUTE TRANSPOSITION, REDEFENCE, SWAGMAN

Substitution: SIMPLE SUBSTITUTION, BACONIAN, CAESAR, ROT13, AFFINE, ATBASH

Uses 25 letter alphabet: BAZERIES, BIFID, CADENUS, CHECKERBOARD, CM BIFID, FOURSQUARE, PHILLIPS, PHILLIPS-C, PHILLIPS-RC, PLAYFAIR, SERIATED PLAYFAIR, TRI-SQUARE, TWIN BIFID, TWO-SQUARE

More than 26 chars: DIGRAFID, HOMOPHONIC, TRIFID, TWIN TRIFID

Other: GRANDPRE (ct is numbers), GRILLE (ct length must be square number), GROMARK (ct starts with 5 digits), PERIODIC GROMARK (ct starts with digits), HEADLINES (ct has spaces), KEY PHRASE (word divisions retained), MONOME-DINOME (ct is numbers), MORBIT (ct is numbers), NIHILIST SUBSTITUTION (ct is numbers), POLLUX (ct is numbers), RAGBABY (word divisions are kept, 24 chars), TRIDIGITAL (ct is numbers), ADFGX/ADFGVX (only 5/6 characters used), POLYBIUS SQUARE (only 5/6 characters used), HILL (2*2)(requires even number of characters)

The above ciphers can be ruled out. The transposition ciphers because the frequency distribution does not match english, the substitution ciphers because the IC is 0.045, which is too low for English. The F2 cipher also has 26 characters in total, so it can't be anything based on a 5 by 5 key square. Some ciphers have too many characters (more than 26) so they can also be ruled out. Others can be ruled out because the ciphertext (ct) they produce consists of numbers.

The Possibilities

The following are the ciphers I can't rule out: FRACTIONATED MORSE, GRONSFELD, AUTOKEY, INTERRUPTED KEY, NICODEMUS, NULL, PORTA, QUAGMIRE I, QUAGMIRE II, QUAGMIRE III, QUAGMIRE IV, RUNNING KEY, VARIANT, VIGENÈRE, PORTAX, PROGRESSIVE KEY, SLIDEFAIR.

Assumptions: we all have to make assumptions, it is better to lay them out at the start. 1. The plaintext is in English, 2. There are no NULLs distributed through the ciphertext, 3. It is not a combination transposition + something else. 4. When making the ciphertext, no mistakes were made that prevent decryption.

I may have to weaken some of these assumptions later, but for now I think focusing on simpler stuff should be prioritised.

Narrowing Possibilities

In part 1, I tried solving it using Vigenere, Beaufort, Variant Beaufort, Gronsfeld, VigAutokey and Porta, and this was unsuccessful which makes me confident I can ignore these possibilities under the assumptions above.

I have used this site to get a list of candidate ciphers ranked by likelyhood. I have also removed the ones I know are not possible:


FracMorse 5
RunningKey 8
Quagmire3 8
Quagmire2 9
Quagmire4 9
Nicodemus 11
Vigslidefair 12
Portax 13
Swagman 22
Progkey beaufort 23
Progressivekey 24

This means I will concentrate my focus on FracMorse, Running Key and the Quagmires. I will update again in Part 3.

Thursday 23 January 2014

Removing Annotations from the Brown Corpus with Sed

The Brown corpus is a collection of texts on various topics amounting to around 1 million words, and each word is annotated with its Part Of Speech (POS). The following is an excerpt from it:

 The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at 
investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd 
``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

You can find a copy of the Brown corpus on this page. To get a copy of the Brown corpus with tags removed, you can get it here.

Each word ends with a "/" character, followed by the part of speech, e.g. "The" is an article, which is abbreviated to "at". The full list of abbreviations can be seen here. I would like to strip off all the POS tags though, so I can use the text on its own. My main motivation is to get a wide selection of texts on several topics to use as a test set for computing the perplexity of language models. The Brown corpus seems to fit the bill.

Stripping the Tags

I am going to use sed to do all the work here, based on the observation that each word has a slash ("/") in the middle, followed by some more characters. To express this as a regular expression, we'll use the following:

[^ ]*/[^ ]*

This regular expression (RE) will match a sequence of characters (not including space), followed by a slash, followed by some other sequence of characters (not including space). Each time we match this, we want to keep only the first part of match i.e. the part before the slash. In sed this is done like so:

s_\([^ ]*\)/[^ ]*_\1_g

We have put the RE into sed's syntax inside the "s/a/b/g" construct, which tells sed to match "a" and substitute any instances matched with "b". Note that the forward slash characters are optional, any character can be used in that spot, in the example above "_" is used instead. The next sed feature we have used is bracketing "\( \)", which tells sed to record the part of the match that happened inside the brackets and put it in a variable called "\1". Note we used this variable in the second half of the example. This tells sed to match the whole RE, but replace it with only the first bit.

If we have a file called "temp.txt" with the following text in it:

The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at 
City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in 
the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at 
City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at 
election/nn was/bedz conducted/vbn ./.

we'll want to strip out the tags using sed like this:

sed 's_\([^ ]*\)/[^ ]*_\1_g' temp.txt > newtemp.txt

and we are left with:

The jury further said in term-end presentments that the City Executive Committee 
, which had over-all charge of the election , `` deserves the praise and thanks of the 
City of Atlanta '' for the manner in which the election was conducted .

Applying it to Multiple Files

The Brown corpus contains 500 files, and we want to apply our sed expression to all the files, to do this we'll use bash.

for fname in `ls c*`; do 
    sed -i 's_\([^ ]*\)/[^ ]*_\1_g' $fname; 
done

The script above loops over all the files in the brown corpus beginning with the letter c, then uses sed to apply the regular expression to each file. The '-i' flag tells sed to apply the regular expression to the file in-place, whereas usually it would print out the results.

To get a copy of the Brown corpus with tags removed, you can get it here.

Monday 20 January 2014

A look at the second Feynman cipher [part 1]

In 1987, someone posted a message to an internet cryptology list, saying that Caltech Physics Professor Richard Feynman was given three samples of code by a fellow scientist at Los Alamos. Only one of the three was ever solved (see elonka.com).

The second cipher is as follows (I'm going to refer to it as "F2", feynman cipher 2 throughout the post):

XUKEXWSLZJUAXUNKIGWFSOZRAWURORKXAOSLHROBXBTKCMUWDVPTFBLMKEFVWMUXTVTWUIDDJVZKBRMCWOIWYDXMLUFPVSHAGSVWUFWORCWUIDUJCNVTTBERTUNOJUZHVTWKORSVRZSVVFSQXOCMUWPYTRLGBMCYPOJCLRIYTVFCCMUWUFPOXCNMCIWMSKPXEDLYIQKDJWIWCJUMVRCJUMVRKXWURKPSEEIWZVXULEIOETOOFWKBIUXPXUGOWLFPWUSCH

The first cipher was a plain route cipher, decrypting to some Chaucer. F2 has 261 characters, which factors into a rectangle 9*29, which means it could be a route cipher like the previous one, or a combination of route cipher and some other cipher. We will jump right into some analysis following this guide: Identifying unknown ciphers.

First of all, the monogram frequencies look nothing like English, so we can confidently rule out a plain transposition/route cipher. Next, the Index of coincidence is 0.045, which is too low to be a substitution cipher. From here I can reasonably confidently conclude it is not a transposition, monographic substitution or combination of these.

Another major signal is the presence of 26 characters; this means it can't be a cipher based on a 5 by 5 or 6 by 6 grid such as playfair, foursquare, phillips etc.

There are also all 26 characters appearing, so it is not ADFGX/ADFGVX or Polybius square. It is probably not trifid, since trifid uses 27 characters.

We can actually go further and rule out digraphic ciphers like bifid, hill, playfair and foursquare because the number of characters is odd. What does this leave? It leaves all the vigenere-type ciphers like gronsfeld, porta, autokey, beaufort etc.

Next steps

I will have to eventually make some assumptions and start cracking, but it is good to narrow the playing field a bit first. From here I am going to assume it is a Vigenere-type cipher with possibly some route-type transpositions going on. If we want to break route transpositions, we need to first identify the candidate routes we want to try. This will involve slicing up the 9*29 into various routes. Of course, the previous cipher was split into 75*5, so this one may actually be 3*87 instead of 9*29, but we can try that later.

If our block of text looks like:

XUKEXWSLZJUAXUNKIGWFSOZRAWURO
RKXAOSLHROBXBTKCMUWDVPTFBLMKE
FVWMUXTVTWUIDDJVZKBRMCWOIWYDX
MLUFPVSHAGSVWUFWORCWUIDUJCNVT
TBERTUNOJUZHVTWKORSVRZSVVFSQX
OCMUWPYTRLGBMCYPOJCLRIYTVFCCM
UWUFPOXCNMCIWMSKPXEDLYIQKDJWI
WCJUMVRCJUMVRKXWURKPSEEIWZVXU
LEIOETOOFWKBIUXPXUGOWLFPWUSCH

Then we can read off letters in various directions e.g.


XUKEXWSLZJUAXUNKIGWFSOZRAWURORKXAOSLHROBXBTKCMUWDVPTFBLMKEFVWMUXTVTWUIDDJVZKBRMCWOIWYDXMLUFPVSHAGSVWUFWORCWUIDUJCNVTTBERTUNOJUZHVTWKORSVRZSVVFSQXOCMUWPYTRLGBMCYPOJCLRIYTVFCCMUWUFPOXCNMCIWMSKPXEDLYIQKDJWIWCJUMVRCJUMVRKXWURKPSEEIWZVXULEIOETOOFWKBIUXPXUGOWLFPWUSCH

XRFMTOUWLUKVLBCWCEKXWUEMUJIEAMFRUFUOXOUPTWPMEWSXVUPOVTSLTSNYXROLHVHOTCCOZRTAJRNJFJOWGULMUWUBUSZGCMKAXIVHBIVBXBDWVMWRIUTDUTCMKUNKJFWYSXXKCVWKPKWPIMZOOOPUXGUKRRJXRUWWBCSCEKGFDRWVLDPOSVMURRLSWOPCIZIYELZTWDSYIEFRFOUVTQIPABIJVVKWWWLWCFFDZUUMYNSCJVSRKDVQCWXCOEXTXMIUH

HUIMXTXEOCXWCQVDKRSVJCSNYMUUZDFFCWLWWWKVVJIBAPIQTVUOFRFEIYSDWTZLEYIZICPOWSLRRUMVSOPDLVWRDFGKECSCBWWURXJRRKUGXUPOOOZMIPWKPKWVCKXXSYWFJKNUKMCTUDTUIRWMVWDBXBVIBHVIXAKMCGZSUBUWUMLUGWOJFJNRJATRZOCCTOHVHLORXYNSTLSTVOPUVXSWEMPWTPUOXOUFURFMAEIJUMEUWXKECWCBLVKULWUOTMFRX

HCSUWPFLWOGUXPXUIBKWFOOTEOIELUXVZWIEESPKRUWXKRVMUJCRVMUJCWIWJDKQIYLDEXPKSMWICMNCXOPFUWUMCCFVTYIRLCJOPYCMBGLRTYPWUMCOXQSFVVSZRVSROKWTVHZUJONUTREBTTVNCJUDIUWCROWFUWVSGAHSVPFULMXDYWIOWCMRBKZVJDDIUWTVTXUMWVFEKMLBFTPVDWUMCKTBXBORHLSOAXKRORUWARZOSFWGIKNUXAUJZLSWXEKUX

OEXTXMIUHRKDVQCWXCUMYNSCJVSWLWCFFDZUABIJVVKWWRFOUVTQIPZTWDSYIEFOPCIZIYELSVMURRLSWFDRWVLDPOWWBCSCEKGGUKRRJXRUIMZOOOPUXKCVWKPKWPNKJFWYSXXUTDUTCMKUXBDWVMWRIAXIVHBIVBUBUSZGCMKJOWGULMUWZRTAJRNJFLHVHOTCCOSLTSNYXROWSXVUPOVTXOUPTWPMEEAMFRUFUOKXWUEMUJIUKVLBCWCEXRFMTOUWL

ORUWARZOSFWGIKNUXAUJZLSWXEKUXEKMLBFTPVDWUMCKTBXBORHLSOAXKRXDYWIOWCMRBKZVJDDIUWTVTXUMWVFTVNCJUDIUWCROWFUWVSGAHSVPFULMXQSFVVSZRVSROKWTVHZUJONUTREBTMCCFVTYIRLCJOPYCMBGLRTYPWUMCOIWJDKQIYLDEXPKSMWICMNCXOPFUWUUXVZWIEESPKRUWXKRVMUJCRVMUJCWHCSUWPFLWOGUXPXUIBKWFOOTEOIEL

LWUOTMFRXECWCBLVKUIJUMEUWXKOUFURFMAEEMPWTPUOXTVOPUVXSWORXYNSTLSOCCTOHVHLFJNRJATRZWUMLUGWOJKMCGZSUBUBVIBHVIXAIRWMVWDBXUKMCTUDTUXXSYWFJKNPWKPKWVCKXUPOOOZMIURXJRRKUGGKECSCBWWOPDLVWRDFWSLRRUMVSLEYIZICPOFEIYSDWTZPIQTVUOFRWWKVVJIBAUZDFFCWLWSVJCSNYMUCXWCQVDKRHUIMXTXEO

LEIOETOOFWKBIUXPXUGOWLFPWUSCHWCJUMVRCJUMVRKXWURKPSEEIWZVXUUWUFPOXCNMCIWMSKPXEDLYIQKDJWIOCMUWPYTRLGBMCYPOJCLRIYTVFCCMTBERTUNOJUZHVTWKORSVRZSVVFSQXMLUFPVSHAGSVWUFWORCWUIDUJCNVTFVWMUXTVTWUIDDJVZKBRMCWOIWYDXRKXAOSLHROBXBTKCMUWDVPTFBLMKEXUKEXWSLZJUAXUNKIGWFSOZRAWURO

I am going to start with these 8 routes, and try to break them assuming they are any of the following: vigenere, vigenere-autokey, beaufort, variant beaufort, porta. I 'll also have to search all keys with lengths 2-20, and I'll be using the method described here.

Sources of Difficulty

While the code for the above tests is running, I thought I'd look at reasons why these experiments may not work. First is that it is not a cipher from the vigenere family, or perhaps it is e.g. a running key cipher, i.e. something I've not tried. Second is that the actual route is one I've not tried. The 8 routes from the 9*29 square are above, but i'll also have to try the same 8 routes from the 29*9, 3*87 and 87*3. There are many other routes I can't try. The last source of error I can see is mismatch between the plaintext language and the english model I am using to score fitness. The first cipher is Chaucer, which is old English and doesn't match up with modern english very well. There is a chance the cipher is in German or French or something too, in which case I probably won't break it.

The results of the above testing: I have run Vigenere, Vig-Autokey, Beaufort, Porta and variant Beaufort crackers on 64 possible routes for key lengths 3-20. I am fairly certain that if F2 was one of these ciphers it would have been broken. In the next post I'll broaden the search a bit.

Thursday 16 January 2014

Auto-generating an Index for Blogger posts

I am not a big fan of blog layouts, it is usually very difficult to find a certain page without using a search engine. To fix this, I wanted to create an Index page with headings and all the posts I have written. I don't want to do this manually, however, as that seems like a lot of ongoing work. There are solutions that use javascript to dynamically build an Index page from a feed, but this means that google won't index the page. So my solution was to go with a semi-manual process in which I use a python script to build the html Index page from the blog feed, then I just have to paste the html into the blog periodically.

The code below generates the following page: index.html

The headings in the page are generated from the labels for the page, e.g. this page is labeled as "Programming", so it appears under the Programming heading. Note that you will have to change the line with "jameslyons0.blogspot.com.au" with your own blog url. To run the code, save the code into a file called e.g. "index.py", then use python index.py to run the code. This will output html, copy and paste it into a page on your blogger site.

Pages

Thursday 18 December 2014

The Basic Hill Cipher

Extended Hill Cipher

Hill cipher with arbitrary substitutions

Tuesday 2 December 2014

Sunday 30 November 2014

Saturday 29 November 2014

Thursday 27 November 2014

NIST files

RAW files

WAV files

Thursday 16 October 2014

Possible errors

Monday 17 February 2014

Sunday 9 February 2014

Wednesday 5 February 2014

Log Area Ratios

Inverse Sine Coefficients

Tuesday 4 February 2014

Saturday 1 February 2014

Friday 31 January 2014

Tuesday 28 January 2014

Saturday 25 January 2014

The Possibilities

Narrowing Possibilities

Thursday 23 January 2014

Stripping the Tags

Applying it to Multiple Files

Monday 20 January 2014

Next steps

Sources of Difficulty

Thursday 16 January 2014