Text Queries in Databases 2012

Rein Aasland & Pål Puntervoll

Note that this is the H2011-version, and the material is subject to revision

Last updated: 7-SEP-2011

Homework before PC lab session - Watch the [ENSEMBL tutorial] on YouTube.


The learning outcome from this first practical exercise is two-fold:

  1. You will become familiar with real bioinformatical data as they appear in some of the most widely used databases
  2. You will be able to search and access such databases using the web-based search engines Entrez, SRS and ENSEMBL

When you use sequence analysis web tools, you must always keep an eye on what the tools ask for and what information they provide.

Some of the questions in this exercise require that you use your "molecular biological competence and common sense" to find the correct answer.

Before you start - make your own MOL204 directory

Once you have logged into your UiB computer account, create a new directory called mol204 - in which you should store all the files that you generate while you work on the FakLab computers. If you don't know how to do that in Windows, please ask the instructors. Familiarise yourselves with the environment and how you move up and down to this directory.

Important: as you work through the tutorial, take notes (write and use copy/paste) of what you do. This will make it easier for you to answer the specific questions listed at the end of each section. It is the answers to these questions that you (and your partner) shall hand in as report for this exercise.

Finding sequences with SRS and Entrez

SRS - the Sequence Retrieval System

Open a web browser (e.g. Firefox or Internet Explorer). Go to the SRS webpage. SRS can be used in more or less advanced modes. The default mode is the Quick Search mode. The Quick Search page can be used for simple text searches in a selected set of databases. For more advanced searches, the Standard Query Form or the Extended Query Form can be used. The latter two require that one or more libraries (databases/databanks) are selected from the Library Page before entering the query form page. Note that SRS has excellent help functionality (blue help icon and blue 'i' - information icons). In the following you will perform some simple searches to get acquainted with the Standard Query Form of SRS.

Searching in the EMBL database using SRS

First of all - try to get an overview of what kind of data and databases are available through SRS:

  • Click on the Library Page tab. Note that a short description of each database is available by moving the mouse pointer over the name.

Next, try to find all the different sequences in the EMBL database that has Rune Male stated as the author:

  • First, select the EMBL nucleotide sequence database. Then, go to the the Standard Query Form by clicking on the red button on the left hand side. Perform the search.

Does it matter how you write the name (e.g. full name or last name only), and in which field you search?

  • Click on the information icon next to the search fields to find out what format the (author) names should be. Use this information to perform a correct search for Rune Male sequences.

Now, find out if Rune Male has co-published any sequences with Rein Aasland.

Finally, find out how many sequences has Rein Aasland published in total, and how many of these are from zebrafish.

Play with the settings in the window called 'Create a view' at the bottom of the query page. See if you can get a view that shows both ID, description, sequence length and the sequence itself.

While the Standard Query form is easiest, but you might want to try the Extended Query Form to get an impression on the possibilities for doing really fine-grained searches. For instance, you can try to find out how many zebrafish genomic DNA sequences are shorter than 50 nucleotides.

Entrez - the NCBI text-based search and retrieval system

Use Entrez to find the sequence entries in the Nucleotide sequence database that have Rune Male as author. Remember to choose the 'nucleotide database'.

To find out how to write the author name in the right format click on help (left hand menu), and search (in the help section).

Try searching with only 'Male' as query and see how many hits you get. Repeat the search restricting to authors using the [AU] field tag.

The tags are explained in the Entrez help files, e.g. here for PubMed when you search for sequences. Some useful tags are: AU, AD, TA and DP (these are also useful when using PubMed to search for scientific literature)

Use the functionality under the Limits link (just below the search field) to fine-tune your searches. Also look at the History, which is a section under Advanced link. Note that previous searches can be combined here.

When using SRS (Standard Query Form), how many sequence entries did you find in the EMBL database that has Rune Male as author?
How many sequence entries did you find in the NCBI nucleotide sequence database which Rune Male as author? Note that with Entrez, these sequences are of three different types (see upper right corner). List the number for each type as well as the total number.
Compare the number of entries found for Rune Male with Entrez and SRS, - and comment on why the two systems give different numbers.
How many zebrafish sequences have Rein Aasland as author? List the GI numbers (GenInfo Identifiers) of the corresponding NCBI nucleotide entries (Use Entrez).

Using SRS to explore UniProt records

Use SRS to find UniProt/Swiss-Prot entries which have Rein Aasland as author (You must first choose the protein databank 'UniProKB/Swiss-Prot' and then use the standard query form). Locate the human NSD2 protein and inspect its contents. Note that many of the references have links to full-text articles. Locate the Database cross-references and find the links to the MIM database (MIM: Mendelian Inheritance in Man, - a database for genes and genes involved in inheritable diseases). Study the MIM entry carefully.

What kind of protein is NSD2 and what type of disease is it involved in? Using the cross-links to InterPro, Pfam and SMART, you can find information on known globular domains in NSD2. [ You can also search with 'NSD2_HUMAN' directly in the SMART and Pfam resources and get nice graphical views of the domain organisation of this protein (but with Pfam, you must use the actual sequence as query).

You can also use Entrez to search directly in MIM (OMIM). Search e.g. with 'leukemia' and see how many entries you find. Inspect some of them, including the one for MLL.

Use SRS to search for NSD2. Which known globular domains does NSD2 have (combine information from InterPro, Pfam and SMART)?
What disease is NSD2 involved in?

Advanced SRS - linking to other databases

Both SRS and Entrez have many features allowing for sophisticated queries. It is far beyond the scope of this exercise to go into all the details. But you are encouraged to use the help and documentation to learn more about the use of these two tools. In the next task we shall use some of these advanced features of SRS to illustrate how you can take search results as input to new queries.

Linking to MIM

We shall illustrate this feature of SRS by exploring genes and proteins involved in diabetes. Use SRS again to locate the UniProt/Swiss-Prot entry for human glucokinase. Select it, read about its function, and then find the links to the MIM database (OMIM in this case). With the OMIM entries open, click the red 'link' button on the left side, which take you back to the Library page. Now choose UniProtKB/SwissProt again and search. You should find several entries. Play with the display options so you get more information on each of the entries.

For the linked SRS search with glucokinase, list the other proteins (with descriptions) that are involved in the disease Maturity-onset diabetes of the young (MODY).

Linking to GO (Gene Ontology)

Use SRS again to locate the fission yeast SET1 protein in UniProtKB. Select it and inspect the annotation to find out what kind of protein it is, - in particular, does it have a catalytic activity? Click the red 'link' button on the left side, and select GO from the Library page (you must open the 'Gene Dictionaries and Ontologies' and choose 'GO'). Search and inspect the GO-terms listed. One of the GO-terms concerns the COMPASS protein complex. Chose this GO term and use the SRS link-function to link back to UniProtKB and find other proteins from yeast (the species Schizosaccharomyces pombe) that are part of this complex.

You can also query the GO database using specialised tools like AmiGO. Use AmiGO to query for COMPASS. Inspect the information on the page. How many subunits do you find for COMPASS here? Follow the hierarchy of GO terms from the term 'chromatin remodelling complex' up to the top level 'cellular component'.

Another tool that gives information on protein interactions (and complexes) is MINT. Go there and search with SET1 in the 'Protein or gene name' field and choose species: 'Saccharomyces cerevisiae'. Use the MINT viewer (a graphical tool that requires that Java is properly installed) to explore the known interactions with SET1. Click on the 'plusses' to expand the interaction network and use the slide bars to alter the view and the score threshold.

From the searches with yeast SET1:

What kind of protein is SET1? Using information in the annotation and other places, give a brief description of the SET1 protein and its function.
SET1 is part of the COMPASS complex. How many subunits are in the yeast (Schizosaccharomyces pombe) COMPASS complex? - Give the protein names for each. How does this number compare with the information you find for the COMPASS complex when you search directly in GO with AmiGO?
When you query the MINT database for Saccharomyces cerevisiae SET1, how many interactions (associations) do you find (give a list of protein names)? - and do they correspond to those found with the GO terms (when using amigo)?

Using the genome browser ENSEMBL

Use SRS to find the human SET1A entry in the UniProtKB. Check out its relationship to the yeast SET1 protein,- i.e. do the two proteins have common features (e.g. SMART, Pfam) and functions?

Locate the database cross-reference to the genome browser ENSEMBL. Familiarize yourself with the information on the entry page for this protein (choose the ENSP...-entry). Click on the 'Location: ...' tab to obtain the chromosome view for this gene. Which chromosome does the SETD1A gene sit on? Inspect both the Overview window and the Detailed view window below. How many exons does the SET1DA gene have (orange and open boxes in the detailed view)? Use the zoom-out function (bars between the two windows) and look at the neighbouring genes. Which gene is upstream for SETD1A and how far upstream is it, approximately (use the scalebars and estimate the distance from the most 5'end of SETD1A and the 3'end of the upstream gene; - AND use the information in the pop-up windows that appear when you click on the gene icons to identify the chromosome coordinates for both genes, - which will give you the exact distance)?

From the searches with ENSEMBL:

Which chromosome is the human SET1DA gene on? Give the precise chromosomal location (look at the Overview window).
How many exons does SET1DA gene have? (give the number for the longest transcript).
Name the 5 nearest genes upstream and downstream for SET1DA.
What is the distance (in nucleotides) to the first gene upstream (use the information in the pop-up windows to identify the chromosome coordinates for both genes, - which will give you the exact distance).