Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Biomolecular databases

Bioinformatics

Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique

Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

NEW ADDRESS (since Nov 1st, 2011) [email protected]

Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics

(TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/

B!GRe Bioinformatique des

Génomes et Réseaux

!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.!"#$Inserm U1090

Contents

  Examples of biological databases   Nucleic sequences: Genbank, EMBL, and DDBJ   Protein sequences: UniProt   The Gene Ontology (GO) project

  Issues and perspectives for biological databases

Examples of biomolecular databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Examples of biomolecular databases

  Sequence and structure databases   Protein sequences (UniProt)   DNA sequences (EMBL, Genbank, DDBJ)   3D structures (PDB)   Structural motifs (CATH)   Sequence motifs (PROSITE, PRODOM)

  Genome sequences and annotations   Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)   Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)

  Molecular functions   Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)   Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)   Transport (YTPdb)

  Biological processes   Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)   Signal transduction pathways (CSNdb, Transpath)   Protein-protein interactions (DIP, BIND, MINT)   Gene networks (GeneNet, FlyNets)

Databases of databases

  There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.

  Every year, the first issue of Nucleic Acids Research is dedicated to biological databases

  http://nar.oupjournals.org/   2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1

  The same journal maintains a database of databases: the Molecular Biology Database Collection

  http://www.oxfordjournals.org/nar/database/c/   Some bioinformatics centres maintain multiple database, with cross-links

between them. The SRS server at EBI holds an impressive collection of databases.

  http://srs.ebi.ac.uk/

Nucleic sequence databases: GenBank, EMBL, and DDBJ




Okubo et al. (2006) NAR 34: D6-D9

Nucleic sequence databases

  To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.

  There are 3 main repositories for nucleic acid sequences.   Sequences deposited in any of these 3 databases are automatically

synchronized in the 2 other ones.

Adapted from Didier Gonze

The sequencing pace   Nucleic sequences

  Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ •  126,551,501,141 bases in 135,440,924 sequence records in the

traditional GenBank divisions •  191,401,393,188 bases in 62,715,288 sequence records in the

Whole Genome Ssequencing   Entire genomes

  GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.

  http://www.genomesonline.org/gold_statistics.htm

  Protein sequences   Essentially obtained by translation of putative genes in nucleic

sequences (almost no direct protein sequencing).   UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.   http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html !Class entries nucleotides!------------------------------------------------------------------!CON:Constructed 7,236,371 359,112,791,043!EST:Expressed Sequence Tag 73,715,376 40,997,082,803!GSS:Genome Sequence Scan 34,528,104 21,985,922,905!HTC:High Throughput CDNA sequencing 491,770 594,229,662!HTG:High Throughput Genome sequencing 152,599 25,159,746,658!PAT:Patents 24,364,832 12,117,896,594!STD:Standard 13,920,617 37,665,112,606!STS:Sequence Tagged Site 1,322,570 636,037,867!TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279!WGS:Whole Genome Shotgun 88,288,431 305,661,696,545! ----------- ---------------!Total 252,106,363 450,481,663,919!!Division entries nucleotides!------------------------------------------------------------------!ENV:Environmental Samples 30,908,230 14,420,391,278!FUN:Fungi 6,522,586 11,614,472,226!HUM:Human 32,094,500 38,072,362,804!INV:Invertebrates 31,907,138 52,527,673,643!MAM:Other Mammals 40,012,731 145,678,620,711!MUS:Mus musculus 11,745,671 19,701,637,499!PHG:Bacteriophage 8,511 85,549,111!PLN:Plants 52,428,994 55,570,452,118!PRO:Prokaryotes 2,808,489 28,807,572,238!ROD:Rodents 6,554,012 33,326,106,733!SYN:Synthetic 4,045,013 782,174,055!TGN:Transgenic 285,307 849,743,891!UNC:Unclassified 8,617,225 4,957,442,673!VRL:Viruses 1,358,528 1,518,575,082!VRT:Other Vertebrates 22,809,428 42,568,889,857! ----------- ---------------!Total 252,106,363 450,481,663,919!

Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/

The EMBL Nucleotide Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/

DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/

URL Sequences

Bases (without shotgun)

bases (including shotgun) Organisms

DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09EMBL http://www.ebi.ac.uk/embl/ 1.0E+11 2.0E+05GenBank http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 2.1E+05

Size of the nucleic sequence databases

  Summary of database contents for the 3 main databases of nucleic sequences.   Source: NAR database issue January 2006.

UniProt : protein sequences and functional annotations




UniProt - the Universal Protein Resource http://www.uniprot.org/   Database content (Sept 2012)

  UniProtKB: •  24,532,088 entries •  Translation of EMBL coding sequences

(non-redundant with Swiss-Prot)   UniProtKB/Swiss-Prot section (reviewed):

•  537,505 entries •  annotation by experts •  high information content •  many references to the literature •  good reliability of the information

  The rest (90% of the entries) •  Automatic annotation by sequence

similarity.   Features

  The most comprehensive protein database in the world.

  A huge team: >100 annotators + developers.   Annotation by experts: annotators are

specialized for different types of proteins or organisms.

  World-wide recognized as an essential resource.

  References   Bairoch et al. The SWISS-PROT protein

sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9

  The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.

Number of entries (polypeptides) in Swiss-Prot

http://www.expasy.org/sprot/relnotes/relstat.html

Taxonomic distribution of the sequences

Within Eukaryotes

UniProt example - Human Pax-6 protein Header : name and synonyms

UniProt example - Human Pax-6 protein Human-based annotation by specialists

UniProt example - Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms

UniProt example - Human Pax-6 protein Protein interactions; Alternative products

UniProt example - Human Pax-6 protein Detailed description of regions, variations, and secondary structure

UniProt example - Human Pax-6 protein Peptidic sequence

UniProt example - Human Pax-6 protein References to original publications

UniProt example - Human Pax-6 protein Cross-references to many databases (fragment shown)

3D Structure of macromolecules



PDB - The Protein Data Bank http://www.rcsb.org/pdb/

Genome browsers



EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Human gene Pax6 aligned with Vertebrate genomes


Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes


Drosophila 120kb chromosomal region covering the Achaete-Scute Complex

ECR Browser http://ecrbrowser.dcode.org/

EnsEMBL - Example: Drosophila gene Pax6 http://www.ensembl.org/

Comparative genomics



Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/

Integr8 - genome summaries http://www.ebi.ac.uk/integr8/

Integr8 - clusters of orthologous genes (COGs) http://www.ebi.ac.uk/integr8/

Integr8 - clusters of paralogous genes http://www.ebi.ac.uk/integr8/

Databases of protein domains



Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/

Prosite - aligned sequences and logo http://www.expasy.ch/prosite/

  Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048).

  The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.

  Note the 6 cysteines, characteristic of this domain.

Prosite - Example of profile matrix http://www.expasy.ch/prosite/

Prosite - Example of sequence logo http://www.expasy.ch/prosite/

Prosite - Example of domain signature http://www.expasy.ch/prosite/

  The domain signature is a string-based pattern representing the residues that are characteristic of a domain.

PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)

CATH - Protein Structure Classification http://www.cathdb.info/

  CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:

  Class (C),   Architecture (A),   Topology (T)   Homologous superfamily (H).

  The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.

  References   Orengo et al. The CATH Database

provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9

  Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.

CATH - Protein Structure Classification http://www.cathdb.info/

InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/

InterPro (EBI - UK) Antennapedia-like Homeobox (entry IPR001827)

The Gene Ontology (GO) database




Ontology definition

  Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières

  Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993

The "bio-ontologies"

  Answer to the problem of inconsistencies in the annotations   Controlled vocabulary   Hierarchical classification between the terms of the controlled vocabulary

  E.g.: The Gene Ontology   molecular function ontology   process ontology   cellular component ontology

Gene ontology: processes

Gene ontology: molecular functions

Gene ontology: cellular components

Gene Ontology Database http://www.geneontology.org/

Gene Ontology Database (http://www.geneontology.org/)

Example: methionine biosynthetic process

Status of GO annotations (NAR DB issue 2006)

  Term definitions   Biological process terms 9,805   Molecular function terms 7,076   Cellular component terms 1,574   Sequence Ontology terms 963

  Genomes with annotation 30   Excludes annotations from UniProt, which represent 261 annotated proteomes.

  Annotated gene products   Total 1,618,739   Electronic only 1,460,632   Manually curated 158,107

QuickGO (http://www.ebi.ac.uk/QuickGO/)

  Web site http://www.ebi.ac.uk/QuickGO/

  A user-friendly Web interface to the Gene Ontology.

  Graphical display of the hierarchical relationships between terms.

  Convenient browsing between classes.

Remarks on "bio-ontologies"

  Improvement compared to free text   controlled vocabulary (choice among synonyms)   hierarchical relationships between the concepts

  Nothing to do with the philosophical concept of ontology   A "bio-ontologies" is usually nothing more than a taxonomical classification of

the terms of a controlled vocabulary   Multiple possibilities of classification criteria

  e.g. compartment subtypes (plasma membrane is a membrane)   e.g. compartment locations (nucleus is inside cytoplasm is inside plasma

membrane)   To be useful, should remain purpose-based

  each biologist might wish to define his/her own classification based on his/her needs and scope of interest

  impossible to define a unifying standard for all biologists   No representation of molecular interactions

  relationships between objects are only hierarchical, not horizontal or cyclic   e.g. does not describe which genes are the target of a given transcription

factor

What is biological function ?

  A general definition   Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble

(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.

  Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)

  Function and gene ontology   Understanding the function requires to establish the link between molecular activity

and the context in which it takes place (process).   Multifunctionality

•  Same activity can play different roles in different processes.   Example: scute gene in Drosophila melanogaster: a transcription factor

(activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes).

•  Multiple activities of a same protein in a given process   Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic

domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).

Small compounds, reactions and metabolic pathways




LIGAND - Small compounds and metabolic reactions

KEGG - Kyoto Encycplopaedia of Genes and Genomes

Ecocyc, BioCyc and Metacyc - Metabolic pathways

Protein interaction networks and transduction pathways




Microarray databases




Human genome resources



HapMap http://www.hapmap.org/

  The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.

  Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.

Issues for biomolecular databases




Issues for biological databases

  Dealing with biological complexity   Data content

  Coverage   Information content

  Data quality   Data structure   Consistency

  Query capabilities   Interfaces

  User interfaces   Programmatic interfaces

  Annotation   Funding

Towards biological complexity

  The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, …

  This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).

  It becomes more difficult if we want to represent   the interactions between biological objects,   the integration of various elements in a biological process (metabolic pathways, protein

interaction networks, regulatory networks, …)   complex concepts such as ”biological function”

Data content

  Scope of the database   types of biological objects represented

  Number of entries   coverage of the current knowledge

  Information content   Level of detail in the description of the biological objects

  References to the source of information

Data quality

  Data Consistency   always use the same name to indicate the same object   (this seems trivial, but its is unfortunately still not always the case)   event better: define an ID for each objects, and allow to retrieve it by any of its

synonyms   spelling mistakes

  Data Structuration   distinct fields for distinct attributes of the biological objects

  Reliability   Evidences ? Level of confidence ?   Assignation of function by similarity

•  recursive process → propagation of errors

Query capabilities

  Browsing (click and read)   Simple search

  select records with some constraints   More elaborate search

  select specific fields of some records with constraints on some fields (~SQL SELECT)

  Complex querying   ability to return an answer that results from a "live" computation, and was not part

of any record of the dabatase

Interfaces

  User interfaces   user-friendly   convenient browsing   intuitive query forms   visualization (graphical output)

  Programmatic interfaces   communication with external programs:

•  other databases (concept of distributed database) •  analysis tools

Annotation

  Problem   The flow of available data is increasing exponentially

  Strategies   internal curators   selected external experts   public submission   computer-based extraction of information from biological texts

Funding

  Public funding   Problem: easier to obtain public funds for creating a new database than for

maintaining or expanding existing resources   Private funding

  Industrial companies are •  ready to invest in good data and good query capabilities •  interested by academic expertise

  Solutions   All users pay (per query for example)

•  Note: academic users are anyway funded by public funds   Hybrid solution

•  access is free for academic users, not for companies •  companies can buy the whole database an install it in-house

(+ add their own private data) •  academia-industry interface is often ensured by a spinoff company

Documents

Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences