Tutorial for simple Blastp search for finding close human homologs
Blastp is the most frequently used method for searching a protein sequence database for similar sequences. The algorithm for Blast will be dealt with in detail in the course MOL204 (Applied Bioinformatics I). Here you are asked to run the program with your sequence and with the search parameters given here in the tutorial, searching only among the human protein sequence (In the next question 8, you will search in the whole database). Feel free, however, to explore the program with other sequences and search parameters.
Consult the NCBI Blast home page for more information, documentation and help.
- Go to the NCBI Blast search page (Make sure that the blastp tab is selected in the tab heading).
- Copy the fasta version of your protein sequence into the window under "Enter Query Sequence".
- From the "Database" pull-down tab under "Choose Search Set", select "UniProtKB/Swiss-Prot
- In the "Organisms" window, type "homo sapiens" and select Homo sapiens (taxid:9606) when it appears.
- Further down on the page under "Scoring Parameters", choose the "Matrix" option "Blosum80". (This is a scoring matrix for the search that is most suitable when we are searching for closely related sequences).
- Then hit the blue Blast button.
The search might take a while, so be patient. Once the results page appears, inspect the different sections as you scroll down.
- At the top you find a brief summary of the search
- Then follows a graphic summary. The upper part shows matches to know protein domains and superfamilies. This information can be very useful, and you can click on the icons and learn more about them.
- Next follows a graphical representation of the list of hits from the search. Red lines shows the most closely related sequences (including the sequence you searched with). Then follows shorter hits with lower search scores.
- Scroll further down till you come to the list of Descriptions. Here you will see the protein names, and Max score, Total score, Querey cover, E-value, and percent identity.
- We will consider only sequences that have E-values lower than 1e-04. The E-value is a measure of the number of hits you would be expected to get with this score just by pure chance. Hence, and E-value=0.0 means that the hit is essentially certain. For the purpose of this exercise, we shall consider only hits with E-values of 1e-04 (0.0001) as significant.
- Scroll further down till you get to the "Alignments" section. The first hit is to it self and is 100% identical. Skip this one and move on to the next hits with E<1e-04. Note the percentage identity and inspect the alignment. Do you see a large number of matching residues? Note also the full name/description text for each entry and check that it relates to the sequence you searched with.
- For the report, list the entries that you judge to be true homologs of your enzyme.
Return to the Questions page.