jimblog: Writing HDF5 files for Currennt

In a previous post I explained how to install currennt. This post will deal with writing the data files. The next post will be about actually running things. Currennt uses data written to HDF5 files, so whatever format your data is in you have to write it to a HDF5 file then use currennt to process it. This post will use matlab to generate the HDF5 files, but e.g. python could easily do it as well.

Recurrent neural networks predict things about sequences, so for this example we'll be using protein sequences as the example. Predicting things like phonemes for speech files is exactly the same. For each amino acid in a protein, we'll be predicting the secondary structure, which can be C, E, H or X. The training information will be a L by 20 matrix, this means the protein is of length L and each amino acid has 20 features (we'll use PSSM features, of which there are 20, extracted using PSI-BLAST.). I will just be using 400 training proteins and 100 test proteins, it is often beneficial to try things out on small datasets because you get the results quicker and can figure out problems quicker.

HDF5 files

Currennt needs the data written to HDF5 files, but it needs a heap of extra parameters written to the files as well as the data, this section will specify what you need to write to them, then I'll put some actual MATLAB code in the next section.

The data we'll be using is protein data, there will be 400 training proteins, each of length L (the sequences can be anywhere from about 20 to 2000 in length). There are 20 features per amino acid, this means each protein is represented by a matrix of size 20 by L, and there are 400 matrices.

These parameters are for classification, if you want something else like regression you may have to use different dimensions and variables e.g. numLabels only makes sense for classification. HDF5 files have dimensions and variables, and they need to be specified separately. The dimensions that currennt needs specified in the HDF5 files are as follows:

numTimesteps: the total number of amino acids if you lined 
      up all the proteins end to end
inputPattSize: the number of features per amino acid, in this case 20.
numSeqs: the number of proteins
numLabels: the number of classes we want to predict, in this case 4
maxLabelLength: the length of the longest string used to hold 
      the class label name
maxTargStringLength: to be honest I'm not sure about this one, I 
      just set it to be 5000 and things seem to work
maxSeqTagLength: this is another one I'm not sure about. I set 
      it to 800 and it seems to work.

Now we have specified the dimensions, now we want to specify the variables i.e. the actual data we will be using for features and labels. They are specified like so:


seqTags,          size: maxSeqTagLength x numSeqs
numTargetClasses: size: 1
inputs,           size: inputPattSize x numTimesteps
seqLengths,       size: numSeqs
targetClasses,    size: numTimesteps
labels,           size: maxLabelLength x numLabels

You then just have to write the relevant data to each of the variables and run currennt. Some MATLAB code that does all this stuff is provided below. Note that this matlab file will just create one file for e.g. training. If you want a separate dataset for e.g. testing (strongly recommended) then you will need to run the code a second time changing the file names and protein dataset.

	close all;
	clear all;

	% load the protein dataset, this is a array of struct that has fields .pssm and .ss
	% which are our features and labels respectively. pssm is a L by 20 array of features.
	% ss is a L by 1 array of characters C, E, H or X which are our 4 classes.
	% the actual variable that appears is called 'combined'
	load protein_dataset;
	% this is the name of the HDF5 file we will be writing to
	fname = ['proteins_train_currennt.nc'];

	% compute the total length of all the proteins combined
	N = 0;
	for i = 1:length(combined)
	N = N + size(combined(i).pssm,1);
	end

	D=20; % the number of dimensions
	input = zeros(D,N); %our array of features (all proteins combined)
	target = zeros(N,1); %our array of labels
	upto = 1; % this keeps track of where we are in the input array
	seqlens = []; % array of sequence lengths
	for i = 1:length(combined)
	L = size(combined(i).pssm,1);
	input(:,upto:upto+L-1) = combined(i).pssm';
	temp = combined(i).ss';
	% our labels are characters, they have to be converted to integers 0-3
	temp(temp=='C') = 0;
	temp(temp=='E') = 1;
	temp(temp=='H') = 2;
	temp(temp=='X') = 3;
	seqlens(i) = L;
	target(upto:upto+L-1) = temp;
	upto = upto + L;
	end

	% now we actually start writing the HDF5 file using MATLABS built in functionality
	ncid = netcdf.create(fname,'64BIT_OFFSET');
	dimid1 = netcdf.defDim(ncid,'numTimesteps',N);
	dimid2 = netcdf.defDim(ncid,'inputPattSize',size(inputs,1));
	dimid3 = netcdf.defDim(ncid,'numSeqs',i);
	dimid4 = netcdf.defDim(ncid,'numLabels',4);
	dimid5 = netcdf.defDim(ncid,'maxLabelLength',1);
	dimid6 = netcdf.defDim(ncid,'maxTargStringLength',5000);
	dimid7 = netcdf.defDim(ncid,'maxSeqTagLength',800);
	dimid8 = netcdf.defDim(ncid,'one',1);

	varid = netcdf.defVar(ncid,'seqTags','char',[dimid7 dimid3]);
	varid = netcdf.defVar(ncid,'numTargetClasses','int',dimid8);
	varid1 = netcdf.defVar(ncid,'inputs','double',[dimid2 dimid1]);
	varid2 = netcdf.defVar(ncid,'seqLengths','int',dimid3);
	varid3 = netcdf.defVar(ncid,'targetClasses','double',dimid1);
	varid4 = netcdf.defVar(ncid,'labels','char',[dimid5 dimid4]);

	netcdf.endDef(ncid);
	netcdf.putVar(ncid,varid1,double(input));
	netcdf.putVar(ncid,varid2,seqlens);
	netcdf.putVar(ncid,varid,4);
	netcdf.putVar(ncid,varid3,int32(target));
	netcdf.putVar(ncid,varid4,['C';'E';'H';'X']);
	netcdf.close(ncid)

view raw create_currennt_HDF5.m hosted with ❤ by GitHub

2 comments:

WonGil Huh21 January 2016 at 23:35
Hi.
This post really helpful for me.
However, I cannot understand about the structure of input file.
I guess that your input file is applied with "protein_dataset".
If I am right, can you share this file for me?
Unknown17 February 2016 at 00:40
This comment has been removed by the author.

jimblog

Pages

Thursday, 22 October 2015

Writing HDF5 files for Currennt

HDF5 files

2 comments: