# Computer Exercise Spring 2016

By Professor Rein Aasland and Maxim Bril’kov & Joakim Brunet

Based on a concept developed by Rein Aasland and Øyvind Ødegaard, V2015.

Department of Molecular Biology - University of Bergen

## Background: Bioinformatics in molecular biology

While experiments have always been the foundation for molecular biology, the role of computer-based analyses and bioinformatics has become more and more important, in particular following the sequencing of the genome sequence of humans (published in 2001) and many other organisms, including all the major model organisms (such as yeast, fruit flies, a nematode, the zebrafish, and the mouse. Current DNA sequencing technologies allows for complete and high-quality sequencing of new genomes in relatively short times (weeks/months). Annotating a new genome, identifying all the genes and regulatory elements, is still a very difficult and often dautning task. Even more, a large fraction of the known and predicted proteins are poorly understood and will require a lot of experimental research in the years to come. In addition to the gene and genome sequences, we have also experienced tremendous developments in other molecular technologies that generate large amounts of many other types of data. Examples are: RNA sequencing: mRNAs, non-coding RNAs and transcription start sites (where promoters are); mass-spectrometry: identification and quantitation of proteins, protein modifications, and small biomolecules of all types; crystallography and NMR: protein and nucleic acid structures. Other technologies, such as chromatin immunoprecipitation (ChIP) allow us to map where transcription factors bind in the genome, and immunofluorescence allow us to see and quantify where proteins and other biomolecules are located in cells and tissues. Importantly, as many of the new technologies allow for fast, accurate, and medium to large scale measurements, it is now possible to obtain time series for many types of data and parameters, allowing for computational modelling how living systems work. When this is done on a large scale, considering many types of biomolecules in, e.g. organels, cells or tissues, we enter a new level of analysis known as systems biology. Last, but not least, the ability to search and compute on texts from the scientific literature allows us to combine both experimental data and the insight and interpretations done by scientists.

It may seem as a daunting challenge to cope with the more and more data-rich field of molecular biology. And it is. The course MOL204, Applied Bioinformatics, which is now compulsory in our bachelor programme in molecular biology, reflects the trends described above. We also offer an optional and more advance course, MOL217 Applied Bioinformatics II. While MOL204 covers the basics of protein bioinformatics in both theory and tutorials, we will in this MOL221 exercise give a very brief introduction to bioinformatics and computational biology.

## Rationale and format of the exercise

This exercise has been designed and set in the context of a scenario where we ask you to imagine that you have just joined a research group as a master student. The supervisor tells you that they are just about to start working on a new human enzyme they have recently identified as crucial for the phenomenon they are studying. As the enzyme is little known by the supervisor and the research group, you are asked to explore the bioinformatical databases and other computational resources to find out as much as possible about the enzyme and present this the the research group in their next lab meeting. You are asked, in particular, to search for the following information:

1. What is the biochemical reaction catalysed by the enzyme? Does it require any cofactors? Does it have any particular requirements for optimal function (e.g. temperature, pH, salt concentrations etc)?
2. What is the molecular composition of the enzyme? Is it monomeric, oligomeric, or is it embedded in a larger protein complex?
3. Is the structure of the enzyme known? If so, does the structure reveal something about how the enzyme works? [this is a large question, and is not expected to be extensively covered]
4. Where is it experssed in the organism? Is it present in some organs or tissues than others?
5. Where in the cell is the enzyme localised? Is it cytoplasmic? Is it localised in a particular organelle (e.g. nucleus, mitochondia, plasma membrane), or is it secreted and/or associated with the surface of the cells? Are there isoforms of the enzyme with different localisations in the cells?
6. Is there any data suggesting that the enzyme is post-translationally modified? (e.g. phosphorylation, acetylation, glycosylation, etc.)
7. Are there closely related enzymes in the human proteome? [You may need to perform a sequence-based database search]
8. Is the enzyme conserved in evolution? I.e.: is it found in other organisms such as other vertebrates, chordates, insects, yeasts, and bacteria?
9. Are there interesting papers published about the enzyme during the last year (i.e. 2014 or 2015).

It might well take an experienced molecular biologist (or bioinformatician) several days or more to find comprehensive answers to all these questions. You only have two days (i.e. the time allotted in this course) to do the job. Hence, we do not expect that you can find full details for all these questions. In the following we give you general instructions and hints in the form a generic tutorial text. This will also be presented, with an example, in the first plenum session for this exercise and we will guide you on the way.

Each pair of students get one enzyme to work with and the data found should be presented to the supervisor (i.e. course instructors) in the form of a PowerPoint (or similar) ready to be presented to the research group.

## Core bioinformatical data and data formats.

In bioinformatics, we deal with two classes of data: primary data: comes directly from experiments and include sequences of DNA, RNA, and proteins as well as their structures (atomic coordinates; x,y,z and atom connections), and secondary data: data that are derived from the the primary databases, or integrated or computed based on one or more primary databases. Note that we consider protein sequences as primary data even though we derive them from conceptual translation of DNA (or mRNA) sequences. Other primary data include information on protein-protein interactions, protein modifications, and protein complexes, - all directly from experiments. Primary data are stored in primary databases.

Let’s consider a protein sequence. The raw data is stored as the linear chain of amino acids in single letter code. The first 180 residues of the amino acid sequence for human histone demethylase KDM6A (also known as UTX) is shown here (in raw format):

 MKSCGVSLATAAAAAAAFGDEEKKMAAGKASGESEEASPSLTAEEREALGGLDSRLFGFV
RFHEDGARTKALLGKAVRCYESLILKAEGKVESDFFCQLGHFNLLLEDYPKALSAYQRYY
SLQSDYWKNAAFLYGLGLVYFHYNAFQWAIKAFQEVLYVDPSFCRAKEIHLRLGLMFKVN


To make these data useful, it is necessary to provide the raw sequence data with additional information, such as the name of the protein, the organism, etc. A very popular and compact format for sequence data is the fasta format (only the first 180 residues shown):

 >sp|O15550|KDM6A_HUMAN Lysine-specific demethylase 6A OS=Homo sapiens GN=KDM6A PE=1 SV=2
MKSCGVSLATAAAAAAAFGDEEKKMAAGKASGESEEASPSLTAEEREALGGLDSRLFGFV
RFHEDGARTKALLGKAVRCYESLILKAEGKVESDFFCQLGHFNLLLEDYPKALSAYQRYY
SLQSDYWKNAAFLYGLGLVYFHYNAFQWAIKAFQEVLYVDPSFCRAKEIHLRLGLMFKVN


The first line in the fasta format contains a lot of information. It starts with an angle bracket followed by the “fasta name” of the protein (sp|O15550|KDM6A_HUMAN; formally, all characters from left till the first space). In this case, the fasta name includes the database source (sp, SwissProt), its accession number (O15550), and then a short name called the identifier (KDM6A_HUMAN), where the letters following the underscore indicates the organism (HUMAN, in this case). This is the minimal information required in a fasta file. In the example above, there is further information: the full name of the protein: Lysine-specific demethylase 6A), the full species name: OS=Homo sapiens, and the gene name: GN=KDM6A.

The size of a file containing all the 20199 human proteins in fasta format is ~13.4 MB (with a total of 11.3 million amino acid residues: average protein length: 561 residues).

The major database for protein sequences is called UniProt Knowledge Base (UniProtKB) and is located and maintained at the European Bioinfomatics Institute (EBI) near Cambridge. A typical entry in UniProt contains much more information about the sequence than a fasta file. Besides the name, species etc. you will find information on the function of the protein, where it is expressed, what features there are along the sequence etc., as well as literature references for each of these.

The full entry for the example entry (for the human KDM6A protein) can be found here as a UniProt flat file and here, as a richly formatted UniProt Web page. We will explain this example and how you shall work with this exercise in the lecture introduction.

## Questions for the exercise

On the questions page, you will find the 9 questions above and more details about which bioinformatical tools to use and what questions to address with the given test proteins. Guidelines for the report for this exercise is shown here.