Sequence Searches

Rein Aasland

Based on previous exercise by Rein Aasland, Hans-Petter Kleppen, Angèle Abboud, and Pål Puntervoll
This is the H2016 version for the PC lab sessions on 29-30 September.
Last updated: 29-SEP-2016 by Rein

In this exercise you will work on the cyclin protein family. Cyclins are regulatory proteins that bind to and regulate cyclin-dependent kinases. They play an important role during cell cycle progression.

Note1: There will be no report to hand in, but please keep
  a note with written answers for all the questions, which we 
  inspect during the session
Note2: It would be really good if you all started by your selves
  on the tasks for Q1, with the blast searches and collecting sequences.
Note3: Rein has prepared a brief (16 min) video tutorial for this 
  PC lab session which deals with the use of the Blast servers at ExPASy
  and NCBI. Please bear with me, as it is the first time I do this, - 
  hence, the video has low quality, sound is not good (and I need a 
  better microphone). But check it out if you can BEFORE you come to
  the PC lab (beware: it takes a long time to load):
Tip: If you use the browser Firefox in this exercise, you can easily 
highlight multiple occurrences of words on a page by using the find 
function (Ctrl-F) and the associated "Highlight all" button.


The purpose of this exercise is to get acquainted with normal Blast and the sensitive variant of Blast: PSI-Blast. We will also do a pairwise alignment using the Smith-Waterman algorithm.

Search for Human Cyclin-A1 Paralogs Using Blast (blastp)

Use the ExPASy blast server:

Screenshot exapzy.png

  1. Enter the UniProt identifier for Cyclin-A1: CCNA1_HUMAN, or paste in the sequence.
  2. Restrict the search to human sequences only (Use taxonomic subsets)
  3. Search only in the SwissProt part of UniProt.

Inspect the results.


Count the number of sequences that are annotated as cyclins. Do not count isoforms (isoforms have "Isoform" written in their description that you can highlight using "Highlight all" button in the find function on the top of this page).
Note the identifier, accession number and E-value for the poorest scoring cyclin (or cyclin-like) sequence.


Before we move on, save all the sequences with an E-value that is less than e-10 in FASTA format from the initial search (suggested file name: cyclin_blast_human.fasta). To do this you can :

Screenshot expazy2.png

  1. Click on the SELECT UP TO... button. Then mark the last sequence you want to include.
  2. Remove all the 'Isoforms' from the selection and exclude the query sequence
  3. Select 'Retrieve sequences (FASTA format)', and click on the SUBMIT QUERY button.

This file of sequences will be used in the next PC lab session : Sequence Searching and Multiple Sequence Alignments.

The Effect of Scoring Matrices

ExPASy Blast automatically chooses scoring matrix based on the sequence length (see ExPASy help). In the case of CCNA1_HUMAN, BLOSUM62 was chosen.

Repeat the search using BLOSUM80 and BLOSUM45.


Count the number of cyclin sequences found in the new searches. Is it different from the initial search? What scoring matrix do you think is the most appropriate to use here? Briefly justify your answer.


Inspect the alignment of the query sequence and the poorest scoring cyclin from the initial search (BLOSUM62). Make note of the start and end positions for each sequence. Now, investigate the two sequences using SMART or Pfam. Note the start and end positions of the domains that are reported.


What is the correspondence between the parts of the sequences that were aligned and the domains reported by SMART or Pfam?


Smith-Waterman Pairwise Alignment

Go on the server at EBI. Align the two sequence by performing a pairwise alignment using the Smith-Waterman algorithm (EMBOSS: Water). Remember to choose local alignment. Note the start and end positions for each sequence.


What is the correspondence between the parts of the sequences that were aligned using Smith-Waterman and the domains reported by SMART or Pfam? How do you explain the difference to the results obtained with Blast?


Perform a Sensitive PSI-Blast Search to Identify Remote Homologues of Cyclins

Position-Specific Iterative Blast or PSI-Blast is a more sensitive version of Blast. The first round of a PSI-Blast search (iteration 1) is a normal Blast search. The results from this search are aligned and turned into a profile. The next search round (iteration 2) is performed using the generated profile. In an iterative manner new hits can be found and included in the profile for further searches.

Perform the first iteration of a PSI-Blast search with CCNA1_HUMAN (for this, use the fasta version of the UniProt entry), again restricting the search to SwissProt and human sequences:

Ncbi blast annotated.png

  1. Paste in the sequence or the accession number. (The identifier will not work here.)
  2. Choose the right database.
  3. Restrict to human sequences.
  4. Select the PSI-BLAST algorithm.


How do these results compare to the results obtained in the normal Blast search (see section above on [for Human Cyclin-A1 Orthologs Using Blast (blastp)])?


Perform the second iteration. Include only the sequences selected by default (with an E-value better than threshold).


What is the E-value of the poorest-scoring sequence annotated as cyclin that you identified after the second iteration of PSI-Blast? Why is it (radically) different from the E-value you observed in the regular Blast search?
What is the best scoring apparent non-cyclin sequence? Note identifier, accession number and E-value.


Investigate the apparent non-cyclin sequence using SMART or Pfam. Does the domain analysis suggest why this sequence was picked up in the PSI-Blast search?

Often, to verify that sequences obtained with PSI-Blast searches really are related to the query sequence, reciprocal searches are performed. The purpose of reciprocal searches is to find back the query sequence by starting with one (or more) of the hit sequences.

Perform a new (reciprocal) PSI-Blast search using the sequence from Q7. In this case, search in the nr database, but still restricted to only human sequences.


How many iterations are needed to pick up the original query sequence (CCNA1_HUMAN)?


Note: You will continue to work on this case next PC lab session, 
so we advise you to keep all files.