Thursday, 16 October 2014

Getting HHBlits to produce profiles

HHblits was introduced in the paper HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment.

First download the HHsuite from here: You will also need a database from here: I used uniprot20 as it was the most recent at the time of writing this post.

Extract HHsuite and the databases. To run hhblits, you'll need fasta files for each of the proteins you want. I happened to have individual files with the protein sequences in it, but hhblits needs proper fasta formatting, which basically just means adding a '>' to the first line. This can be done as follows:

for i in `ls /home/james/Desktop/new_pssm_and_ssp/pssm_new/EDD/infile/`; 
    do echo '>' > eddtrain/$i; 
    cat /home/james/Desktop/new_pssm_and_ssp/pssm_new/EDD/infile/$i >> eddtrain/$i; 

For the script above to work, you'll need a directory called 'eddtrain'. The script reads every file in ...EDD/infile/ and creates a corresponding file in eddtrain/ with > as the first line and the protein sequence as the second line. To run hhblits on all of these files, use the following:

for i in `ls eddtrain/*.txt | sed 's-eddtrain/--'`; do 
    echo $i; hhsuite-2.0.16-linux-x86_64/bin/hhblits -d uniprot20_2013_03/uniprot20_2013_03 -i eddtrain/$i -ohhm eddhmms/$i -n 4 -M first; 

This will output hmms for all the files in the eddhmm directory.

The hmms have a set of 20 numbers which are transformed probabilities. They can be transformed back into probabilities using the following formula (Note that '*' is the same as infinity i.e. p=0):

\[p = 2^{(-N/1000)}\]

Possible errors

Error in hhblits: input file missing!

You need to add the -i flag. e.g. -i train/file.fasta

Error in hhblits: database missing (see -d)

You need to add the database name e.g. -d uniprot20_2013_03/uniprot20_2013_03. You can get the database from here: Extract the database as you would a normal zipped file.

Error in hhblits: Could not open A3M database uniprot20_2013_03_a3m_db, No such file or directory (needed to construct result MSA)

In this case you have given the database directory, you need to provide the path to the database inside the directory. It has to be of the correct format, see the previous error.

1 comment: