Friday, 23 October 2015

Installing RNNLIB on Ubuntu 14.04

I have previously looked at CURRENNT, but now I'll go through how to use RNNLIB, which is a different rnn library. RNNLIB doesn't use cuda, but it can do multi-dimensional rnns and also can use the CTC loss function which can be handy. Let's get right into it. Download RNNLIB from sourceforge here.

You'll have to install the netcdf library:

sudo apt-get install libnetcdf-dev

You should also install boost: sudo apt-get install libboost-all-dev. I have libboost1.55 installed already.

There is an installation guide over here if you want some additional pointers. First run configure and make:

configure
make

Fixing the errors

The first error I got was:

Mdrnn.hpp:239:6:   required from here
Helpers.hpp:311:41: error: ‘range_min_size’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
  size_t size = range_min_size(r1, r2, r3);

To fix this one you need to put templates code above the code that uses them. Moving the 4 templates for range_min_size to the top of the file should fix all the issues in Helpers.hpp. I cut lines 532-547 and pasted the lines to line 163 of Helpers.hpp.

Running make again I got: Container.hpp:157:24: error: ‘resize’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation. To fix this one change on line 157 of containers.hpp "resize" to "this->resize."

That was all I needed to do to get this version installed.

Thursday, 22 October 2015

Running the CURRENNT rnn library

Previously I have made posts about installing Currennt and writing data files for Currennt. This post will be about actually running it assuming it is installed and the data files have been written.

The first thing you'll need to specify is the config file, which I called config.cfg. Its contents look like this:

max_epochs_no_best   = 20
max_epochs           = 100
learning_rate        = 1e-5
network              = network.jsn 
train                = true
train_file           = full_train_currennt.nc
val_file             = full_test_currennt.nc
weights_dist         = normal
weights_normal_sigma = 0.1
weights_normal_mean  = 0
stochastic           = true
validate_every       = 1
parallel_sequences   = 20
input_noise_sigma    = 0
shuffle_fractions    = true
shuffle_sequences    = false

Many of the options are described on the currennt wiki. The network.jsn is a file that I will describe down below and contains the network configuration e,g, number of layers, layer types etc. The files full_train_currennt.nc and full_test_currennt.nc were created using the MATLAB script in a previous post. If you set parallel sequences too high you'll run out of memory on your GPU, but higher means it will train faster.

The network configuration

Info on the network configuration can be found over here. The network I used looks like this:

{
    "layers": [
        {
            "size": 20,
            "name": "input",
            "type": "input"
        },
        {
            "size": 200,
            "name": "level_1",
            "bias": 1.0,
            "type": "blstm"
        },
        {
            "size": 200,
            "name": "level_2",
            "bias": 1.0,
            "type": "blstm"
        },
        {
            "size": 4,
            "name": "output",
            "bias": 1.0,
            "type": "softmax"
        },
        {
            "size": 4,
            "name": "postoutput",
            "type": "multiclass_classification"
        }
    ]
}

actually running stuff

Now to run things, just do:

/path/to/currennt/build/current config.cfg

and everything should work, just wait for it to finish training. This will output a file called trained_network.jsn which we will need during testing. For testing, you will need another HDF5 file just like the training one, except with the test sequences in it. For testing we have another config file which I called ff_config:


network          = trained_network.jsn
cuda             = 0
ff_output_format = csv
ff_output_file   = test_results
ff_input_file    = full_test_currennt.nc

Notice that cuda is off, I found it ran faster in feed forward mode without cuda. To do testing run:

/path/to/currennt/build/current ff_config.cfg

Now all you need to do is parse the output file test_results.csv if you need the individual predictions, it is in comma separated variable format so it is not hard to work out.

Writing HDF5 files for Currennt

In a previous post I explained how to install currennt. This post will deal with writing the data files. The next post will be about actually running things. Currennt uses data written to HDF5 files, so whatever format your data is in you have to write it to a HDF5 file then use currennt to process it. This post will use matlab to generate the HDF5 files, but e.g. python could easily do it as well.

Recurrent neural networks predict things about sequences, so for this example we'll be using protein sequences as the example. Predicting things like phonemes for speech files is exactly the same. For each amino acid in a protein, we'll be predicting the secondary structure, which can be C, E, H or X. The training information will be a L by 20 matrix, this means the protein is of length L and each amino acid has 20 features (we'll use PSSM features, of which there are 20, extracted using PSI-BLAST.). I will just be using 400 training proteins and 100 test proteins, it is often beneficial to try things out on small datasets because you get the results quicker and can figure out problems quicker.

HDF5 files

Currennt needs the data written to HDF5 files, but it needs a heap of extra parameters written to the files as well as the data, this section will specify what you need to write to them, then I'll put some actual MATLAB code in the next section.

The data we'll be using is protein data, there will be 400 training proteins, each of length L (the sequences can be anywhere from about 20 to 2000 in length). There are 20 features per amino acid, this means each protein is represented by a matrix of size 20 by L, and there are 400 matrices.

These parameters are for classification, if you want something else like regression you may have to use different dimensions and variables e.g. numLabels only makes sense for classification. HDF5 files have dimensions and variables, and they need to be specified separately. The dimensions that currennt needs specified in the HDF5 files are as follows:

numTimesteps: the total number of amino acids if you lined 
      up all the proteins end to end
inputPattSize: the number of features per amino acid, in this case 20.
numSeqs: the number of proteins
numLabels: the number of classes we want to predict, in this case 4
maxLabelLength: the length of the longest string used to hold 
      the class label name
maxTargStringLength: to be honest I'm not sure about this one, I 
      just set it to be 5000 and things seem to work
maxSeqTagLength: this is another one I'm not sure about. I set 
      it to 800 and it seems to work.

Now we have specified the dimensions, now we want to specify the variables i.e. the actual data we will be using for features and labels. They are specified like so:


seqTags,          size: maxSeqTagLength x numSeqs
numTargetClasses: size: 1
inputs,           size: inputPattSize x numTimesteps
seqLengths,       size: numSeqs
targetClasses,    size: numTimesteps
labels,           size: maxLabelLength x numLabels

You then just have to write the relevant data to each of the variables and run currennt. Some MATLAB code that does all this stuff is provided below. Note that this matlab file will just create one file for e.g. training. If you want a separate dataset for e.g. testing (strongly recommended) then you will need to run the code a second time changing the file names and protein dataset.

Tuesday, 20 October 2015

Installing Currennt on ubuntu 14.04

This post is all about how to install Currennt on ubuntu 14.04. The next post will deal with how to write the data files that Currennt needs as well as actually running currennt. Currennt is a recurrent neural net library that is sped up using cuda. Installing it is not as simple as it could be.

First step: grap the source code from here: currennt on sourceforge. I am using current-0.2-rc1.zip.

You'll first need to install cuda:

sudo apt-get update
sudo apt-get install cuda

You'll also have to install the netcdf library:

sudo apt-get install libnetcdf-dev

You should also install boost: sudo apt-get install libboost-all-dev. I have libboost1.55 installed already.

Now you can start the installation. Extract the Currennt zip file to a directory and cd to that directory. Now, create a folder called build and cd into that, then run make.

mkdir build
cd build
cmake ..
make

When I did this I got the error:nvcc fatal : Value 'compute_13' is not defined for option 'gpu-architecture'. This can be fixed by editing the CMakeLists.txt, change compute_13 to compute_30, and rerun cmake and make.

The next make got a little bit further, but I got the error: error: namespace "thrust" has no member "transform_reduce". This one can be fixed by adding the line: #include <thrust/transform_reduce.h>to the top of the file /currennt_lib/src/layers/Layer.hpp .

That was all I had to do this time, previously I have had a heap of different propblems when installing currennt on older computers, but this time seemed relatively painless.

Monday, 5 October 2015

Numerically Checking Neural Net Gradients

It is highly recommended by everyone that implements neural networks that you should numerically check your gradient calculations. This is because it is very easy to introduce small errors into the back-propagation equations that will make it seem like the network is working, but it doesn't do quite as well as it should. A simple numerical check can let you know that all your numbers are right.

This tutorial will go through the process of how to numerically compute the derivatives for a simple 1 hidden layer neural network. The exact same process is used for checking the derivatives of e.g. LSTMs, which are much more complicated.

The main thing to remember is the error measure you are using for your neural network. If you are using Mean Squared Error then the error \(E = \dfrac{1}{N}\sum^N_{n=1} \sum^K_{k=1} (y_k - x_k)^2 \) is what we want to minimise. In the previous equation, \(y_k\) is the desired label and \(x_k\) is the network output. \(N\) is the minibatch size, if you are just putting in a single input at a time then \(N\) is one. This single number \(E\) tells us how good our network is at predicting the output, the lower the value of \(E\) the better. Our aim is to adjust the weights so that \(E\) gets smaller.

For a neural net with 1 hidden layer we have 2 sets of weights and 2 sets of biases. The output is computed as follows:

\( a = \sigma (i*W_1 + b_1) \)
\( x = \sigma ( a*W_2 + b_2) \)

\(i\) is our input vector, \(a\) is the hidden layer activation, and \(x\) is the network output.

The way to numerically check the gradient is to pick one of the weights e.g. element (1,1) of \(W_1\), and to add and subtract a small number to/from it e.g. 0.0001. This way we have \(W_1^+\) and \(W_1^-\) (note that only element (1,1) is changed, all the other weights stay the same for now). We now have to compute \(x^+\) and \(x^-\) using the new slightly modified weights. Then we can compute \(E^+\) and \(E^-\) using \(x^+\) and \(x^-\). The gradient of E is then \((E^+ - E^-)/(2*0.0001)\).

Now that we have the derivative of \(E\) with respect to weight (1,1) of \(W_1\), we have to do it for all the other weights as well. This follows the exact same procedure, we just add/subtract a small number from a different weight. The final matrix of derivatives should exactly match the gradient as calculated by back-propagation. This python code: nn_2layer.py implements both back-propagation and the numerical check for a simple single hidden layer nn.

Thursday, 20 August 2015

Caffe Tutorial - Part 1 - basic feedforward nets

Caffe is a deep learning framework often used for image recognition projects, but I've found that the tutorials aren't that great, or at least I have trouble finding solutions to the problems I have. So I've decided to write a sequence of tutorials to lay out information that I can revise in the future and that others might use as well.

The first tutorial (this one) will be a basic single hidden layer feed-forward neural net trained on the fisher iris dataset. I'll use MATLAB to write the feature files that caffe uses, for this tutorial we'll use layer type HDF5Data because HDF5 files are easy to write.

Also try this lenet tutorial as it covers some things that this page does not.

Writing the Data Files

This is a short matlab script for writing the train and test files, each file must contain both the features and labels. Other languages can be used e.g. python has a library for writing hdf5 files as well. Note that fisheriris is built in to matlab. When we load it the features are 150x4, when it gets written to the hdf5 files it has to be transposed (4x150). The labels come as a cell array of strings, we need to convert them to integers starting at 0. We randomly choose 100 samples to be in the train set and 50 to be in the test set.

We use innerProduct layers as these are fully connected feedforward layers, and they are initalised using the 'xavier' algorithm which is explained in this other blog post.

We now have 2 files: iris_train.hdf5 containing the train features and labels, and iris_test.hdf5 containing the test features and labels.

Specifying the Network

At this point I recommend reading a bit about layers, nets and blobs, you can also look at the types of layers available. Our network will be a single hidden layer with 20 nodes. The iris dataset has 3 classes, so our output layer will have 3 nodes. We use the rectified linear unit (ReLU) activation, and softmax outputs.

We are doing training and testing in this network, this is why there are 2 data layers, one with phase: TRAIN and one with phase: TEST. One annoying bit is that we can't directly specify the location of the HDF5 file, we have to specify the location of a txt file that contains as its first line the location of the HDF% file. So "./files/train_file_location.txt" that appears in the network specification actually has as its only contents:

./files/iris_train.hdf5

which is the location the matlab script saved the hdf5 files to.

At the end of this script, we have SoftmaxWithLoss as well as softmax layers. This is because we want to see the accuracy, and the loss can't be used with the accuracy. We pass the softmax outputs to the Accuracy layer so we can see how well it performs. Note that accuracy is only computed during phase: TEST.

Specifying the Solver

The solver is pretty standard.

Running Everything

Now you probably want to run it, just type the following into a terminal while being in the correct directory:


/path/to/install/caffe-master/build/tools/caffe train --solver=./iris_caffe_solver.prototxt

Tuesday, 3 February 2015

A look at the second Feynman cipher - fractionated morse [part 4]

The Feynman ciphers are a set of 3 ciphers given to Richard Feynman, the first of which has been solved, but the second 2 remain unsolved. This is part 4 of a discussion about the second cipher.

I have talked about this cipher three times previously in Part 1, Part 2 and Part 3. In part 1 I was playing around with vigenere type ciphers, in part 2 I decided to step back a little bit a figure out what the ciphertext is not i.e. identify the cipher algorithms it can't be. Part 3 looked at the 3x3 Hill cipher and some of its variants.

So far I have been working my way through the cipher types I think it may be. In part 2 I identified fractionated morse as a candidate cipher. It turns out breaking fractionated morse is not too difficult, the procedure is similar to breaking standard substitution ciphers. I will start with a short description of Fractionated morse, then we will look at how to break it.

Description of Fractionated Morse

There are many descriptions of the algorithm around, the ACA has a good description, I will reproduce a part of G. worley's description. It assumes you know morse code.


If Alice wanted to send Bob the
message "I love you." she would tap over the telegraph:

Pt:  I love you.
Et:  ..xx.-..x---x...-x.xx-.--x---x..-xx.-.-.-

After encoding the plaintext in Morse Code, the encoded text is broken into
n-graphs, usually trigraphs because there are 26 Morse Code trigraphs (you never
have three dividers in a row, so you don't count that one).  Here's a sample
trigraph mapping (using a keyword):

Keyword:  LOOKINGGLASS

L O K I N G A S B C D E F H J M P Q R T U V W X Y Z
. . . . . . . . . - - - - - - - - - x x x x x x x x
. . . - - - x x x . . . - - - x x x . . . - - - x x
. - x . - x . - x . - x . - x . - x . - x . - x . -

So, breaking our message into trigraphs we get:

     . x . - x . . - - - . x - .
     . . . - . - x . x - . x . -
     x - x - . x x - - x - . - x
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ct:  K T K H R G B D P J O Y D G

Cracking this is much harder because the dividers are obfuscated.  The only way
to attack this is to do a statistical analysis for common Morse Code trigraphs.

The cipher assumes 'x' is placed between each letter and 'xx' is placed between each word.

Cryptanalysis of Fractionated Morse

Lets look at the following Fractionated Morse table:


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
. . . . . . . . . - - - - - - - - - x x x x x x x x
. . . - - - x x x . . . - - - x x x . . . - - - x x
. - x . - x . - x . - x . - x . - x . - x . - x . -

If you look at the table above, you can see that certain ciphertext letter combinations are not possible e.g. RS, RT, ... RZ cannot occur because we can't have xxx in the plaintext Morse. Also IS, IT, ... IZ can't happen for the same reason. Other impossible combinations: CY, CZ, FY, FZ, OY, OZ etc. We also can't have e.g. ABD because there are no Morse letters that go '.....-.-.'. You should be able to come up with many other impossible combinations. We can also use frequency information, since THE occurs very often in English, we expect 'xx-x....x.xx'='ZSCI' to be quite common in the ciphertext (with this particular key, if you look at the counts provided down below, we see ZSCI is actually the most common quadgram). This means certain keys will be very unlikely if they lead to impossible transitions, others more likely if they lead to likely transitions.

The trick to breaking Fractionated Morse is that finding the key can be done completely independently of the English->Morse translation. Knowing the statistics of a particular Fractionated Morse key means we can just find the key that translates to it, without ever checking to see if the putative plain text even decodes properly. The Idea is to start with a random key, then continuously swap characters in the key trying to make the ciphertext look like it has been enciphered with the key "ABCDEFGHIJKLMNOPQRSTUVWXYZ". Once this is done, we can just decrypt it. To do this we need quadgram statistics for a lot of English text that has been enciphered with the key "ABCDEFGHIJKLMNOPQRSTUVWXYZ". I did this for 40 million sentences, or around 1 billion characters. The results are linked below.

Code for breaking Fractionated Morse ciphers is here, with the required statistics here. I have run the code on many possible routes through the second Feynman cipher, and nothing resulted in any good looking plaintexts. I would expect it to work, since breaking length 100 Fractionated Morse ciphertexts is not too difficult using the code above, and F2 is 261 characters. This leads me to believe that F2 is not a Fractionated Morse cipher.

From here I'll try to write an efficient running key solver for vigenere, beaufort, variant and porta rules and see how that goes.

jimblog

Pages