Text Queries in Databases

Angèl Abboud, Rein Aasland & Pål Puntervoll

Last updated: 3-SEP-2014

Homework before PC lab session - Watch the [ENSEMBL tutorial] on YouTube.


The learning outcome from this first practical exercise is two-fold:

  1. You will become familiar with real bioinformatical data as they appear in some of the most widely used databases
  2. You will be able to search and access such databases using the web-based search engines Entrez, UniProt Search and ENSEMBL

When you use sequence analysis web tools, you must always keep an eye on what the tools ask for and what information they provide.

Some of the questions in this exercise require that you use your "molecular biological competence and common sense" to find the correct answer.

Before you start - make your own MOL204 directory

Once you have logged into your UiB computer account, create a new directory called mol204 - in which you should store all the files that you generate while you work on the FakLab computers. If you don't know how to do that in Windows, please ask the instructors. Familiarise yourselves with the environment and how you move up and down to this directory.

Important: as you work through the tutorial, take notes (write and use copy/paste) of what you do. This will make it easier for you to answer the specific questions listed at the end of each section.

Finding sequences with Entrez

Entrez -- the NCBI text-based search and retrieval system

  • Open a web browser (Firefox is recommended).
  • Use Entrez to find the sequence entries in the Nucleotide sequence database that have Rune Male as author. Remember to choose the 'nucleotide database'.

Does it matter how you write the name (e.g. full name or last name only)?

Entrez has excellent documentation available down to the left under the "Getting started" heading. To find out how to write the author name read the instructions in the Entrez Searching Options help page.

  • Try searching with only 'Male' as query and see how many hits you get. Repeat the search restricting to authors using the [AU] field tag (you may need to reselect the 'nucleotide' database).

The tags are explained in the Entrez help files, e.g. here for PubMed when you search for sequences. Some useful tags are: AU, AD, TA and DP (these are also useful when using PubMed to search for scientific literature)

  • Use the functionality under the Limits link (just below the search field) to fine-tune your searches. Also look at the History, which is a section under Advanced link. Note that previous searches can be combined here.

How many sequence entries did you find in the NCBI nucleotide sequence database which Rune Male as author? Note that with Entrez, these sequences are of several different types (see upper left corner). List the number for each type as well as the total number.

Using Entrez, how many cyclin protein sequences have Rein Aasland as author? List the GI numbers (GenInfo Identifiers) of those cyclin sequences that do not come from Oikopleura.

Note: GenBank is phasing out the use of GI numbers in Sept. 2016. From now on, accession numbers with version will be used; e.g. NM_205761.2 instead of GI:985567502. For today, you can use either GI or accession.version.

Finding protein sequences using Uniprot

On the bottom left, under “Getting started”, you can get a brief overview on the query syntax you can use on Uniprot.

We will focus during these sessions on the Human Cyclin protein family. The cyclins are involved in the control of the cell cycle by promoting the cyclin-dependent kinases (Cdks).

  • Let’s start by finding out how many cyclins there are in the human proteome. Enter 'cyclin AND "homo sapiens"' in the search field.

How many results do you obtain?

You can observe in the protein names column that there are proteins with names other than cyclin. There are also proteins from other species than human. Let’s try restrict the query to cyclin and “homo sapiens” only.

  • Find on the left of the table, the suggestions from Uniprot to narrow the research. We are going to filter the term homo sapiens as ‘organism’. How many do you now get. To filter further, click "Advanced" in the search field and change the field for cyclin to "Family and domains" and select "protein family. How many do you now get. As your search proceeds, more filtering options appear in the left margin and you can choose "protein family" directly there.

You can choose to see more columns with additional information about your sequences in the table by clicking on the "Columns" button (above the table of results) or on clicking on the pen at the right end of the table of results (you may have to reload the page to see the result).

  • Get an overview on the information available. You can also sort every column by clicking on the arrows on the right of the title of each column. Find the 3D structures available and the protein families for the Human Cyclin.

Uniprot is a combination of Swiss-Prot and TrEMBL databases. Swiss-Prot has protein sequences called ‘reviewed’ because they are manually annotated and non-redundant database. Sequences from TrEMBL are called ‘Unreviewed’, since they are only annotated by computational methods and large-scale functional characterization.

  • Let’s focus here on the reviewed sequences here to avoid redundancy.

How many hits do you have now?
Can you find if any of the sequences have 3D structures available?
How many types of cyclin subfamily you have in human?

Uniprot -- linking to other databases

  • Locate the cyclin-D1 protein and click on the entry to inspect its contents.

Have a look at the function. What is the gene name of this protein?

  • Locate the Database cross-references and find the reference to the MIM database (MIM: Mendelian Inheritance in Man - a database for genes and genes involved in inheritable diseases). Click on the accession number of the gene and phenotype. You can also use the "PATHOL/BIOTECH" tab on the left side in the entry and jump directly to information on diseases.
  • Go back to the Uniprot entry of the cyclin-D1. Have a look to the different links to other databases.

If you click on the name of the database you will have a brief description about it. You have the database KEGG that inform you on the pathways the cyclin-D1 is involved in, and also information on diseases (e.g. the MIM database). STRING is a protein-protein interaction databases with nice graphical view of the possible partners of the cyclin D1. InterPro, Pfam and SMART give information on the domains of cyclin-D1 (you can click on the graphical view to have a graphical summary of the different domains).

Which proteins partners are interacting with cyclin D1?
What type of disease(s) is cyclin D1 involved in?
From the databases SMART, search some information about the cyclin domain and in which types of proteins this domain appears.

Linking to GO (Gene Ontology)

Finally, let’s have a look to the GO terms (Gene Ontology). First of all you can click on the term “GO - Molecular Function" to know more about the organization of the GO system (click on “More” on the pop-up window).

  • Search and inspect the GO-terms listed. One of the GO-term concerns the ‘positive regulation of cyclin-dependent protein serine/threonine kinase activity’. Click on this link. Have a look on the ancestor chart and the protein annotation on the Quick Go website.

How many cellular components does cyclin-D1 have a role in?

Using the genome browser ENSEMBL

  • From the human cyclin-D1 entry in Uniprot, locate the database cross-reference to the genome browser ENSEMBL.

Familiarize yourself with the information on the entry page for this protein (choose the ENSP...-entry).

  • Click on the 'Location: ...' tab to obtain the chromosome view for this gene.

Which chromosome does the Cyclin-D1 gene (CCND1) sit on?

  • Inspect both the "Region in detail" view and the "locus-centered" window below.

How many exons does the Cyclin-D1 gene have (orange and open boxes in the detailed view)?

  • Use the zoom-out function (bars between the two windows) and look at the neighboring genes.

Which gene is upstream for Cyclin-D1 and how far upstream is it, approximately (use the scale bars and estimate the distance from the most 5'end of Cyclin-D1 and the 3'end of the upstream gene; - AND use the information in the pop-up windows that appear when you click on the gene icons to identify the chromosome coordinates for both genes, - which will give you the exact distance)?

From the searches with ENSEMBL:

Which chromosome is the human Cyclin-D1 gene (CCND1) on? Give the precise chromosomal location (look at the Overview window).
How many exons does Cyclin-D1 gene have? (give the number for the longest transcript).
Name the 5 nearest genes upstream and downstream for Cyclin-D1.
What is the distance (in nucleotides) to the first gene upstream (use the information in the pop-up windows to identify the chromosome coordinates for both genes, - which will give you the exact distance).