Pages

Thursday, 22 October 2015

Writing HDF5 files for Currennt

In a previous post I explained how to install currennt. This post will deal with writing the data files. The next post will be about actually running things. Currennt uses data written to HDF5 files, so whatever format your data is in you have to write it to a HDF5 file then use currennt to process it. This post will use matlab to generate the HDF5 files, but e.g. python could easily do it as well.

Recurrent neural networks predict things about sequences, so for this example we'll be using protein sequences as the example. Predicting things like phonemes for speech files is exactly the same. For each amino acid in a protein, we'll be predicting the secondary structure, which can be C, E, H or X. The training information will be a L by 20 matrix, this means the protein is of length L and each amino acid has 20 features (we'll use PSSM features, of which there are 20, extracted using PSI-BLAST.). I will just be using 400 training proteins and 100 test proteins, it is often beneficial to try things out on small datasets because you get the results quicker and can figure out problems quicker.

HDF5 files

Currennt needs the data written to HDF5 files, but it needs a heap of extra parameters written to the files as well as the data, this section will specify what you need to write to them, then I'll put some actual MATLAB code in the next section.

The data we'll be using is protein data, there will be 400 training proteins, each of length L (the sequences can be anywhere from about 20 to 2000 in length). There are 20 features per amino acid, this means each protein is represented by a matrix of size 20 by L, and there are 400 matrices.

These parameters are for classification, if you want something else like regression you may have to use different dimensions and variables e.g. numLabels only makes sense for classification. HDF5 files have dimensions and variables, and they need to be specified separately. The dimensions that currennt needs specified in the HDF5 files are as follows:

numTimesteps: the total number of amino acids if you lined 
      up all the proteins end to end
inputPattSize: the number of features per amino acid, in this case 20.
numSeqs: the number of proteins
numLabels: the number of classes we want to predict, in this case 4
maxLabelLength: the length of the longest string used to hold 
      the class label name
maxTargStringLength: to be honest I'm not sure about this one, I 
      just set it to be 5000 and things seem to work
maxSeqTagLength: this is another one I'm not sure about. I set 
      it to 800 and it seems to work.

Now we have specified the dimensions, now we want to specify the variables i.e. the actual data we will be using for features and labels. They are specified like so:

seqTags,          size: maxSeqTagLength x numSeqs
numTargetClasses: size: 1
inputs,           size: inputPattSize x numTimesteps
seqLengths,       size: numSeqs
targetClasses,    size: numTimesteps
labels,           size: maxLabelLength x numLabels

You then just have to write the relevant data to each of the variables and run currennt. Some MATLAB code that does all this stuff is provided below. Note that this matlab file will just create one file for e.g. training. If you want a separate dataset for e.g. testing (strongly recommended) then you will need to run the code a second time changing the file names and protein dataset.

2 comments:

  1. Hi.
    This post really helpful for me.
    However, I cannot understand about the structure of input file.
    I guess that your input file is applied with "protein_dataset".
    If I am right, can you share this file for me?

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete