77
Biomolecular databases Bioinformatics Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/ NEW ADDRESS (since Nov 1 st , 2011) [email protected] Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ B!GRe Bioinformatique des Génomes et Réseaux !"#$%&’&()#*’ *,-*%#". /&0 ("%&1)#. *%, #’)%)#. !"#$ Inserm U1090

Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Biomolecular databases

Bioinformatics

Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique

Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

NEW ADDRESS (since Nov 1st, 2011) [email protected]

Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics

(TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/

B!GRe Bioinformatique des

Génomes et Réseaux

!"#$%&'&()#*'+*,-*%#".+/&0+("%&1)#.+*%,+#')%)#.!"#$Inserm U1090

Page 2: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Contents

  Examples of biological databases   Nucleic sequences: Genbank, EMBL, and DDBJ   Protein sequences: UniProt   The Gene Ontology (GO) project

  Issues and perspectives for biological databases

Page 3: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Examples of biomolecular databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 4: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Examples of biomolecular databases

  Sequence and structure databases   Protein sequences (UniProt)   DNA sequences (EMBL, Genbank, DDBJ)   3D structures (PDB)   Structural motifs (CATH)   Sequence motifs (PROSITE, PRODOM)

  Genome sequences and annotations   Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)   Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)

  Molecular functions   Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)   Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)   Transport (YTPdb)

  Biological processes   Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)   Signal transduction pathways (CSNdb, Transpath)   Protein-protein interactions (DIP, BIND, MINT)   Gene networks (GeneNet, FlyNets)

Page 5: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Databases of databases

  There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.

  Every year, the first issue of Nucleic Acids Research is dedicated to biological databases

  http://nar.oupjournals.org/   2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1

  The same journal maintains a database of databases: the Molecular Biology Database Collection

  http://www.oxfordjournals.org/nar/database/c/   Some bioinformatics centres maintain multiple database, with cross-links

between them. The SRS server at EBI holds an impressive collection of databases.

  http://srs.ebi.ac.uk/

Page 6: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Nucleic sequence databases: GenBank, EMBL, and DDBJ

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 7: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Okubo et al. (2006) NAR 34: D6-D9

Nucleic sequence databases

  To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.

  There are 3 main repositories for nucleic acid sequences.   Sequences deposited in any of these 3 databases are automatically

synchronized in the 2 other ones.

Page 8: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Adapted from Didier Gonze

The sequencing pace   Nucleic sequences

  Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ •  126,551,501,141 bases in 135,440,924 sequence records in the

traditional GenBank divisions •  191,401,393,188 bases in 62,715,288 sequence records in the

Whole Genome Ssequencing   Entire genomes

  GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.

  http://www.genomesonline.org/gold_statistics.htm

  Protein sequences   Essentially obtained by translation of putative genes in nucleic

sequences (almost no direct protein sequencing).   UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.   http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Page 9: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html !Class entries nucleotides!------------------------------------------------------------------!CON:Constructed 7,236,371 359,112,791,043!EST:Expressed Sequence Tag 73,715,376 40,997,082,803!GSS:Genome Sequence Scan 34,528,104 21,985,922,905!HTC:High Throughput CDNA sequencing 491,770 594,229,662!HTG:High Throughput Genome sequencing 152,599 25,159,746,658!PAT:Patents 24,364,832 12,117,896,594!STD:Standard 13,920,617 37,665,112,606!STS:Sequence Tagged Site 1,322,570 636,037,867!TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279!WGS:Whole Genome Shotgun 88,288,431 305,661,696,545! ----------- ---------------!Total 252,106,363 450,481,663,919!!Division entries nucleotides!------------------------------------------------------------------!ENV:Environmental Samples 30,908,230 14,420,391,278!FUN:Fungi 6,522,586 11,614,472,226!HUM:Human 32,094,500 38,072,362,804!INV:Invertebrates 31,907,138 52,527,673,643!MAM:Other Mammals 40,012,731 145,678,620,711!MUS:Mus musculus 11,745,671 19,701,637,499!PHG:Bacteriophage 8,511 85,549,111!PLN:Plants 52,428,994 55,570,452,118!PRO:Prokaryotes 2,808,489 28,807,572,238!ROD:Rodents 6,554,012 33,326,106,733!SYN:Synthetic 4,045,013 782,174,055!TGN:Transgenic 285,307 849,743,891!UNC:Unclassified 8,617,225 4,957,442,673!VRL:Viruses 1,358,528 1,518,575,082!VRT:Other Vertebrates 22,809,428 42,568,889,857! ----------- ---------------!Total 252,106,363 450,481,663,919!

Page 10: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/

Page 11: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

The EMBL Nucleotide Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/

Page 12: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/

Page 13: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

URL Sequences

Bases (without shotgun)

bases (including shotgun) Organisms

DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09EMBL http://www.ebi.ac.uk/embl/ 1.0E+11 2.0E+05GenBank http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 2.1E+05

Size of the nucleic sequence databases

  Summary of database contents for the 3 main databases of nucleic sequences.   Source: NAR database issue January 2006.

Page 14: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt : protein sequences and functional annotations

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 15: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt - the Universal Protein Resource http://www.uniprot.org/   Database content (Sept 2012)

  UniProtKB: •  24,532,088 entries •  Translation of EMBL coding sequences

(non-redundant with Swiss-Prot)   UniProtKB/Swiss-Prot section (reviewed):

•  537,505 entries •  annotation by experts •  high information content •  many references to the literature •  good reliability of the information

  The rest (90% of the entries) •  Automatic annotation by sequence

similarity.   Features

  The most comprehensive protein database in the world.

  A huge team: >100 annotators + developers.   Annotation by experts: annotators are

specialized for different types of proteins or organisms.

  World-wide recognized as an essential resource.

  References   Bairoch et al. The SWISS-PROT protein

sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9

  The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue.

Number of entries (polypeptides) in Swiss-Prot

http://www.expasy.org/sprot/relnotes/relstat.html

Taxonomic distribution of the sequences

Within Eukaryotes

Page 16: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Header : name and synonyms

Page 17: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Human-based annotation by specialists

Page 18: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms

Page 19: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Protein interactions; Alternative products

Page 20: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Detailed description of regions, variations, and secondary structure

Page 21: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Peptidic sequence

Page 22: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein References to original publications

Page 23: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UniProt example - Human Pax-6 protein Cross-references to many databases (fragment shown)

Page 24: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

3D Structure of macromolecules

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 25: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

PDB - The Protein Data Bank http://www.rcsb.org/pdb/

Page 26: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Genome browsers

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 27: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/

Page 28: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Human gene Pax6 aligned with Vertebrate genomes

Page 29: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes

Page 30: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/

Drosophila 120kb chromosomal region covering the Achaete-Scute Complex

Page 31: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

ECR Browser http://ecrbrowser.dcode.org/

Page 32: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

EnsEMBL - Example: Drosophila gene Pax6 http://www.ensembl.org/

Page 33: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Comparative genomics

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 34: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/

Page 35: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Integr8 - genome summaries http://www.ebi.ac.uk/integr8/

Page 36: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Integr8 - clusters of orthologous genes (COGs) http://www.ebi.ac.uk/integr8/

Page 37: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Integr8 - clusters of paralogous genes http://www.ebi.ac.uk/integr8/

Page 38: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Databases of protein domains

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 39: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/

Page 40: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Prosite - aligned sequences and logo http://www.expasy.ch/prosite/

  Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA-binding domain (ZN2_CY6_FUNGAL_2, PS50048).

  The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.

  Note the 6 cysteines, characteristic of this domain.

Page 41: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Prosite - Example of profile matrix http://www.expasy.ch/prosite/

Page 42: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Prosite - Example of sequence logo http://www.expasy.ch/prosite/

Page 43: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Prosite - Example of domain signature http://www.expasy.ch/prosite/

  The domain signature is a string-based pattern representing the residues that are characteristic of a domain.

Page 44: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)

Page 45: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

CATH - Protein Structure Classification http://www.cathdb.info/

  CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:

  Class (C),   Architecture (A),   Topology (T)   Homologous superfamily (H).

  The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.

  References   Orengo et al. The CATH Database

provides insights into protein structure/function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9

  Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.

Page 46: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

CATH - Protein Structure Classification http://www.cathdb.info/

Page 47: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/

Page 48: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

InterPro (EBI - UK) Antennapedia-like Homeobox (entry IPR001827)

Page 49: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

The Gene Ontology (GO) database

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 50: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Ontology definition

  Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières

  Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993

Page 51: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

The "bio-ontologies"

  Answer to the problem of inconsistencies in the annotations   Controlled vocabulary   Hierarchical classification between the terms of the controlled vocabulary

  E.g.: The Gene Ontology   molecular function ontology   process ontology   cellular component ontology

Page 52: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Gene ontology: processes

Page 53: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Gene ontology: molecular functions

Page 54: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Gene ontology: cellular components

Page 55: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Gene Ontology Database http://www.geneontology.org/

Page 56: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Gene Ontology Database (http://www.geneontology.org/)

Example: methionine biosynthetic process

Page 57: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Status of GO annotations (NAR DB issue 2006)

  Term definitions   Biological process terms 9,805   Molecular function terms 7,076   Cellular component terms 1,574   Sequence Ontology terms 963

  Genomes with annotation 30   Excludes annotations from UniProt, which represent 261 annotated proteomes.

  Annotated gene products   Total 1,618,739   Electronic only 1,460,632   Manually curated 158,107

Page 58: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

QuickGO (http://www.ebi.ac.uk/QuickGO/)

  Web site http://www.ebi.ac.uk/QuickGO/

  A user-friendly Web interface to the Gene Ontology.

  Graphical display of the hierarchical relationships between terms.

  Convenient browsing between classes.

Page 59: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Remarks on "bio-ontologies"

  Improvement compared to free text   controlled vocabulary (choice among synonyms)   hierarchical relationships between the concepts

  Nothing to do with the philosophical concept of ontology   A "bio-ontologies" is usually nothing more than a taxonomical classification of

the terms of a controlled vocabulary   Multiple possibilities of classification criteria

  e.g. compartment subtypes (plasma membrane is a membrane)   e.g. compartment locations (nucleus is inside cytoplasm is inside plasma

membrane)   To be useful, should remain purpose-based

  each biologist might wish to define his/her own classification based on his/her needs and scope of interest

  impossible to define a unifying standard for all biologists   No representation of molecular interactions

  relationships between objects are only hierarchical, not horizontal or cyclic   e.g. does not describe which genes are the target of a given transcription

factor

Page 60: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

What is biological function ?

  A general definition   Fonction: action, rôle caractéristique d’un élément, d’un organe, dans un ensemble

(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.

  Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)

  Function and gene ontology   Understanding the function requires to establish the link between molecular activity

and the context in which it takes place (process).   Multifunctionality

•  Same activity can play different roles in different processes.   Example: scute gene in Drosophila melanogaster: a transcription factor

(activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes).

•  Multiple activities of a same protein in a given process   Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic

domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).

Page 61: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Small compounds, reactions and metabolic pathways

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 62: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

LIGAND - Small compounds and metabolic reactions

Page 63: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

KEGG - Kyoto Encycplopaedia of Genes and Genomes

Page 64: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Ecocyc, BioCyc and Metacyc - Metabolic pathways

Page 65: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Protein interaction networks and transduction pathways

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 66: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Microarray databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 67: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Human genome resources

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 68: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

HapMap http://www.hapmap.org/

  The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.

  Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.

Page 69: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Issues for biomolecular databases

Biomolecular Databases

[email protected] Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/

Page 70: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Issues for biological databases

  Dealing with biological complexity   Data content

  Coverage   Information content

  Data quality   Data structure   Consistency

  Query capabilities   Interfaces

  User interfaces   Programmatic interfaces

  Annotation   Funding

Page 71: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Towards biological complexity

  The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, …

  This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).

  It becomes more difficult if we want to represent   the interactions between biological objects,   the integration of various elements in a biological process (metabolic pathways, protein

interaction networks, regulatory networks, …)   complex concepts such as ”biological function”

Page 72: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Data content

  Scope of the database   types of biological objects represented

  Number of entries   coverage of the current knowledge

  Information content   Level of detail in the description of the biological objects

  References to the source of information

Page 73: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Data quality

  Data Consistency   always use the same name to indicate the same object   (this seems trivial, but its is unfortunately still not always the case)   event better: define an ID for each objects, and allow to retrieve it by any of its

synonyms   spelling mistakes

  Data Structuration   distinct fields for distinct attributes of the biological objects

  Reliability   Evidences ? Level of confidence ?   Assignation of function by similarity

•  recursive process → propagation of errors

Page 74: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Query capabilities

  Browsing (click and read)   Simple search

  select records with some constraints   More elaborate search

  select specific fields of some records with constraints on some fields (~SQL SELECT)

  Complex querying   ability to return an answer that results from a "live" computation, and was not part

of any record of the dabatase

Page 75: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Interfaces

  User interfaces   user-friendly   convenient browsing   intuitive query forms   visualization (graphical output)

  Programmatic interfaces   communication with external programs:

•  other databases (concept of distributed database) •  analysis tools

Page 76: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Annotation

  Problem   The flow of available data is increasing exponentially

  Strategies   internal curators   selected external experts   public submission   computer-based extraction of information from biological texts

Page 77: Biomolecular databasespedagogix-tagc.univ-mrs.fr/courses/bioinfo_intro/... · Examples of biomolecular databases Sequence and structure databases Protein sequences (UniProt) DNA sequences

Funding

  Public funding   Problem: easier to obtain public funds for creating a new database than for

maintaining or expanding existing resources   Private funding

  Industrial companies are •  ready to invest in good data and good query capabilities •  interested by academic expertise

  Solutions   All users pay (per query for example)

•  Note: academic users are anyway funded by public funds   Hybrid solution

•  access is free for academic users, not for companies •  companies can buy the whole database an install it in-house

(+ add their own private data) •  academia-industry interface is often ensured by a spinoff company