Pages

Thursday, 16 October 2014

Getting PSI-BLAST to create PSSMs

This post will be a reference on how to generate PSSM profiles for a list of proteins using NCBI's PSI-BLAST.

First of all, you'll need the PSI-BLAST tool itself from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ and a database e.g. the nr dataset from here: ftp://ftp.ncbi.nlm.nih.gov/blast/db/, just download all the numbered files e.g. nr.00.tar.gz -> nr.26.tar.gz and extract them all to a folder somewhere.

Let's say you have a text file containing protein sequences called prots.txt containing something like:

SQETFSDLWKLLPEN
STSRHKKLMFK
SPLPSQAMDDLMLSPDDIEQWFTEDPGP
SHLKSKKGQSTSRHKKLMFK

PSI-BLAST needs these as separate input files, so we need to create a fasta file for each one. We will do this at the same time as running PSI-BLAST. I will assume we extracted the nr dataset into a directory called nr/. Now to run PSI-BLAST, I move in to the nr/ directory. This ensures PSI-BLAST can find the dataset. From here we run the following script (e.g. put it in a file called 'run.sh' and execute it):

c=1
for i in `cat prots.txt`
do
    echo $c $i
    echo $i > ~/lab_work/nr_db/train/train${c}.txt
    ../ncbi-blast-2.2.26+-src/c++/GCC461-Release64/bin/psiblast -query ~/lab_work/nr_db/train/train${c}.txt -db nr -out ~/lab_work/nr_db/out/out${c}.txt -num_iterations 3 -out_ascii_pssm ~/lab_work/nr_db/pssm/pssm${c}.txt -inclusion_ethresh 0.001
    c=`expr $c + 1`
done

The script above outputs each line of prots.txt into a file called trainN.txt, where N is a number that is incremented for each protein. You'll have to make sure the directory you want to put these train files in exists. There will also need to be directories called 'out' and 'pssm'. These will be filled with PSI-BLAST outputs and the pssm files respectively.

1 comment: