Thursday 16 October 2014

Getting HHBlits to produce profiles

HHblits was introduced in the paper HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment.

First download the HHsuite from here: You will also need a database from here: I used uniprot20 as it was the most recent at the time of writing this post.

Extract HHsuite and the databases. To run hhblits, you'll need fasta files for each of the proteins you want. I happened to have individual files with the protein sequences in it, but hhblits needs proper fasta formatting, which basically just means adding a '>' to the first line. This can be done as follows:

for i in `ls /home/james/Desktop/new_pssm_and_ssp/pssm_new/EDD/infile/`;
do echo '>' > eddtrain/$i;
cat /home/james/Desktop/new_pssm_and_ssp/pssm_new/EDD/infile/$i >> eddtrain/$i;

For the script above to work, you'll need a directory called 'eddtrain'. The script reads every file in ...EDD/infile/ and creates a corresponding file in eddtrain/ with > as the first line and the protein sequence as the second line. To run hhblits on all of these files, use the following:

for i in `ls eddtrain/*.txt | sed 's-eddtrain/--'`; do 
echo $i; hhsuite-2.0.16-linux-x86_64/bin/hhblits -d uniprot20_2013_03/uniprot20_2013_03 -i eddtrain/$i -ohhm eddhmms/$i -n 4 -M first;

This will output hmms for all the files in the eddhmm directory.

The hmms have a set of 20 numbers which are transformed probabilities. They can be transformed back into probabilities using the following formula (Note that '*' is the same as infinity i.e. p=0):

\[p = 2^{(-N/1000)}\]

Possible errors

Error in hhblits: input file missing!

You need to add the -i flag. e.g. -i train/file.fasta

Error in hhblits: database missing (see -d)

You need to add the database name e.g. -d uniprot20_2013_03/uniprot20_2013_03. You can get the database from here: Extract the database as you would a normal zipped file.

Error in hhblits: Could not open A3M database uniprot20_2013_03_a3m_db, No such file or directory (needed to construct result MSA)

In this case you have given the database directory, you need to provide the path to the database inside the directory. It has to be of the correct format, see the previous error.

Getting SPINEX to produce Protein Secondary Structure

SPINEX is a tool that takes PSSM files as input and predicts the secondary structure of the protein. It is roughly 80% accurate. It can be downloaded here:

Extract the tgz file, then cd into the directory. There should be bin, code, test and weights directories. If you don't already, you'll need a fortran compiler, gfortran works fine. cd into the code directory and run the compile script. This should create all the executables required.

The next caveat is that SPINE-X only takes pssm files with the '.mat' extension. If your pssm files don't have this extension you'll need to change them. E.g. where I generate PSSM files, I created them with a '.txt' extension. To change them all at once, you can do the following:

for i in `ls`; do mv $i `echo $i | sed 's/txt/mat/'`; done

To run the script above, make sure you are in the directory with all the pssm files. It will change the extension of all the '.txt' files in that directory.

Now move up one directory and do the following ('pssm' is the directory containing the pssm files that we just renamed):

ls pssm | sed 's/.mat$//' > list.txt

This will create a list of all the pssm files without their extensions. SPINE-X automatically adds '.mat' to them. Now we can finally run SPINEX itself:

/home/james/Downloads/spineXpublic/ list.txt pssm/ 

The file is a perl script from the spineXpublic.tgz we downloaded originally. 'list.txt' is the file list we just created, and 'pssm/' is the directory containing all the pssm files renamed to have a '.mat' extension. This procedure will create a new directory called 'spXout/' which will contain all the predicted secondary structure files.

Getting PSI-BLAST to create PSSMs

This post will be a reference on how to generate PSSM profiles for a list of proteins using NCBI's PSI-BLAST.

First of all, you'll need the PSI-BLAST tool itself from here: and a database e.g. the nr dataset from here:, just download all the numbered files e.g. nr.00.tar.gz -> nr.26.tar.gz and extract them all to a folder somewhere.

Let's say you have a text file containing protein sequences called prots.txt containing something like:


PSI-BLAST needs these as separate input files, so we need to create a fasta file for each one. We will do this at the same time as running PSI-BLAST. I will assume we extracted the nr dataset into a directory called nr/. Now to run PSI-BLAST, I move in to the nr/ directory. This ensures PSI-BLAST can find the dataset. From here we run the following script (e.g. put it in a file called '' and execute it):

for i in `cat prots.txt`
echo $c $i
echo $i > ~/lab_work/nr_db/train/train${c}.txt
../ncbi-blast-2.2.26+-src/c++/GCC461-Release64/bin/psiblast -query ~/lab_work/nr_db/train/train${c}.txt -db nr -out ~/lab_work/nr_db/out/out${c}.txt -num_iterations 3 -out_ascii_pssm ~/lab_work/nr_db/pssm/pssm${c}.txt -inclusion_ethresh 0.001
c=`expr $c + 1`

The script above outputs each line of prots.txt into a file called trainN.txt, where N is a number that is incremented for each protein. You'll have to make sure the directory you want to put these train files in exists. There will also need to be directories called 'out' and 'pssm'. These will be filled with PSI-BLAST outputs and the pssm files respectively.