Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Biomolecular databases
Bioinformatics
Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique
Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/
NEW ADDRESS (since Nov 1st, 2011) [email protected]
Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/
B!GRe Bioinformatique des
Génomes et Réseaux
!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.!"#$Inserm U1090
Contents
Examples of biological databases Nucleic sequences: Genbank, EMBL, and DDBJ Protein sequences: UniProt The Gene Ontology (GO) project
Issues and perspectives for biological databases
Examples of biomolecular databases
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Examples of biomolecular databases
Sequence and structure databases Protein sequences (UniProt) DNA sequences (EMBL, Genbank, DDBJ) 3D structures (PDB) Structural motifs (CATH) Sequence motifs (PROSITE, PRODOM)
Genome sequences and annotations Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …) Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)
Molecular functions Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB) Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA) Transport (YTPdb)
Biological processes Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation) Signal transduction pathways (CSNdb, Transpath) Protein-protein interactions (DIP, BIND, MINT) Gene networks (GeneNet, FlyNets)
Databases of databases
There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.
Every year, the first issue of Nucleic Acids Research is dedicated to biological databases
http://nar.oupjournals.org/ 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1
The same journal maintains a database of databases: the Molecular Biology Database Collection
http://www.oxfordjournals.org/nar/database/c/ Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of databases.
http://srs.ebi.ac.uk/
Nucleic sequence databases: GenBank, EMBL, and DDBJ
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Okubo et al. (2006) NAR 34: D6-D9
Nucleic sequence databases
To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.
There are 3 main repositories for nucleic acid sequences. Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.
Adapted from Didier Gonze
The sequencing pace Nucleic sequences
Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ • 126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions • 191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing Entire genomes
GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.
http://www.genomesonline.org/gold_statistics.htm
Protein sequences Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing). UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences. http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html !Class entries nucleotides!------------------------------------------------------------------!CON:Constructed 7,236,371 359,112,791,043!EST:Expressed Sequence Tag 73,715,376 40,997,082,803!GSS:Genome Sequence Scan 34,528,104 21,985,922,905!HTC:High Throughput CDNA sequencing 491,770 594,229,662!HTG:High Throughput Genome sequencing 152,599 25,159,746,658!PAT:Patents 24,364,832 12,117,896,594!STD:Standard 13,920,617 37,665,112,606!STS:Sequence Tagged Site 1,322,570 636,037,867!TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279!WGS:Whole Genome Shotgun 88,288,431 305,661,696,545! ----------- ---------------!Total 252,106,363 450,481,663,919!!Division entries nucleotides!------------------------------------------------------------------!ENV:Environmental Samples 30,908,230 14,420,391,278!FUN:Fungi 6,522,586 11,614,472,226!HUM:Human 32,094,500 38,072,362,804!INV:Invertebrates 31,907,138 52,527,673,643!MAM:Other Mammals 40,012,731 145,678,620,711!MUS:Mus musculus 11,745,671 19,701,637,499!PHG:Bacteriophage 8,511 85,549,111!PLN:Plants 52,428,994 55,570,452,118!PRO:Prokaryotes 2,808,489 28,807,572,238!ROD:Rodents 6,554,012 33,326,106,733!SYN:Synthetic 4,045,013 782,174,055!TGN:Transgenic 285,307 849,743,891!UNC:Unclassified 8,617,225 4,957,442,673!VRL:Viruses 1,358,528 1,518,575,082!VRT:Other Vertebrates 22,809,428 42,568,889,857! ----------- ---------------!Total 252,106,363 450,481,663,919!
Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/
The EMBL Nucleotide Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/
DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
URL Sequences
Bases (without shotgun)
bases (including shotgun) Organisms
DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09EMBL http://www.ebi.ac.uk/embl/ 1.0E+11 2.0E+05GenBank http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 2.1E+05
Size of the nucleic sequence databases
Summary of database contents for the 3 main databases of nucleic sequences. Source: NAR database issue January 2006.
UniProt : protein sequences and functional annotations
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
UniProt - the Universal Protein Resource http://www.uniprot.org/ Database content (Sept 2012)
UniProtKB: • 24,532,088 entries • Translation of EMBL coding sequences
(non-redundant with Swiss-Prot) UniProtKB/Swiss-Prot section (reviewed):
• 537,505 entries • annotation by experts • high information content • many references to the literature • good reliability of the information
The rest (90% of the entries) • Automatic annotation by sequence
similarity. Features
The most comprehensive protein database in the world.
A huge team: >100 annotators + developers. Annotation by experts: annotators are
specialized for different types of proteins or organisms.
World-wide recognized as an essential resource.
References Bairoch et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9
The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.
Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html
Taxonomic distribution of the sequences
Within Eukaryotes
UniProt example - Human Pax-6 protein Header : name and synonyms
UniProt example - Human Pax-6 protein Human-based annotation by specialists
UniProt example - Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 protein Protein interactions; Alternative products
UniProt example - Human Pax-6 protein Detailed description of regions, variations, and secondary structure
UniProt example - Human Pax-6 protein Peptidic sequence
UniProt example - Human Pax-6 protein References to original publications
UniProt example - Human Pax-6 protein Cross-references to many databases (fragment shown)
3D Structure of macromolecules
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
PDB - The Protein Data Bank http://www.rcsb.org/pdb/
Genome browsers
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/
UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/
Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
ECR Browser http://ecrbrowser.dcode.org/
EnsEMBL - Example: Drosophila gene Pax6 http://www.ensembl.org/
Comparative genomics
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/
Integr8 - genome summaries http://www.ebi.ac.uk/integr8/
Integr8 - clusters of orthologous genes (COGs) http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous genes http://www.ebi.ac.uk/integr8/
Databases of protein domains
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/
Prosite - aligned sequences and logo http://www.expasy.ch/prosite/
Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048).
The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.
Note the 6 cysteines, characteristic of this domain.
Prosite - Example of profile matrix http://www.expasy.ch/prosite/
Prosite - Example of sequence logo http://www.expasy.ch/prosite/
Prosite - Example of domain signature http://www.expasy.ch/prosite/
The domain signature is a string-based pattern representing the residues that are characteristic of a domain.
PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
CATH - Protein Structure Classification http://www.cathdb.info/
CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:
Class (C), Architecture (A), Topology (T) Homologous superfamily (H).
The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.
References Orengo et al. The CATH Database
provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9
Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.
CATH - Protein Structure Classification http://www.cathdb.info/
InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/
InterPro (EBI - UK) Antennapedia-like Homeobox (entry IPR001827)
The Gene Ontology (GO) database
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Ontology definition
Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières
Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993
The "bio-ontologies"
Answer to the problem of inconsistencies in the annotations Controlled vocabulary Hierarchical classification between the terms of the controlled vocabulary
E.g.: The Gene Ontology molecular function ontology process ontology cellular component ontology
Gene ontology: processes
Gene ontology: molecular functions
Gene ontology: cellular components
Gene Ontology Database http://www.geneontology.org/
Gene Ontology Database (http://www.geneontology.org/)
Example: methionine biosynthetic process
Status of GO annotations (NAR DB issue 2006)
Term definitions Biological process terms 9,805 Molecular function terms 7,076 Cellular component terms 1,574 Sequence Ontology terms 963
Genomes with annotation 30 Excludes annotations from UniProt, which represent 261 annotated proteomes.
Annotated gene products Total 1,618,739 Electronic only 1,460,632 Manually curated 158,107
QuickGO (http://www.ebi.ac.uk/QuickGO/)
Web site http://www.ebi.ac.uk/QuickGO/
A user-friendly Web interface to the Gene Ontology.
Graphical display of the hierarchical relationships between terms.
Convenient browsing between classes.
Remarks on "bio-ontologies"
Improvement compared to free text controlled vocabulary (choice among synonyms) hierarchical relationships between the concepts
Nothing to do with the philosophical concept of ontology A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary Multiple possibilities of classification criteria
e.g. compartment subtypes (plasma membrane is a membrane) e.g. compartment locations (nucleus is inside cytoplasm is inside plasma
membrane) To be useful, should remain purpose-based
each biologist might wish to define his/her own classification based on his/her needs and scope of interest
impossible to define a unifying standard for all biologists No representation of molecular interactions
relationships between objects are only hierarchical, not horizontal or cyclic e.g. does not describe which genes are the target of a given transcription
factor
What is biological function ?
A general definition Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble
(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.
Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)
Function and gene ontology Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process). Multifunctionality
• Same activity can play different roles in different processes. Example: scute gene in Drosophila melanogaster: a transcription factor
(activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes).
• Multiple activities of a same protein in a given process Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).
Small compounds, reactions and metabolic pathways
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
LIGAND - Small compounds and metabolic reactions
KEGG - Kyoto Encycplopaedia of Genes and Genomes
Ecocyc, BioCyc and Metacyc - Metabolic pathways
Protein interaction networks and transduction pathways
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Microarray databases
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Human genome resources
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
HapMap http://www.hapmap.org/
The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.
Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.
Issues for biomolecular databases
Biomolecular Databases
[email protected] Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Issues for biological databases
Dealing with biological complexity Data content
Coverage Information content
Data quality Data structure Consistency
Query capabilities Interfaces
User interfaces Programmatic interfaces
Annotation Funding
Towards biological complexity
The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, …
This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).
It becomes more difficult if we want to represent the interactions between biological objects, the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks, …) complex concepts such as ”biological function”
Data content
Scope of the database types of biological objects represented
Number of entries coverage of the current knowledge
Information content Level of detail in the description of the biological objects
References to the source of information
Data quality
Data Consistency always use the same name to indicate the same object (this seems trivial, but its is unfortunately still not always the case) event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms spelling mistakes
Data Structuration distinct fields for distinct attributes of the biological objects
Reliability Evidences ? Level of confidence ? Assignation of function by similarity
• recursive process → propagation of errors
Query capabilities
Browsing (click and read) Simple search
select records with some constraints More elaborate search
select specific fields of some records with constraints on some fields (~SQL SELECT)
Complex querying ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase
Interfaces
User interfaces user-friendly convenient browsing intuitive query forms visualization (graphical output)
Programmatic interfaces communication with external programs:
• other databases (concept of distributed database) • analysis tools
Annotation
Problem The flow of available data is increasing exponentially
Strategies internal curators selected external experts public submission computer-based extraction of information from biological texts
Funding
Public funding Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources Private funding
Industrial companies are • ready to invest in good data and good query capabilities • interested by academic expertise
Solutions All users pay (per query for example)
• Note: academic users are anyway funded by public funds Hybrid solution
• access is free for academic users, not for companies • companies can buy the whole database an install it in-house
(+ add their own private data) • academia-industry interface is often ensured by a spinoff company