La place de la phylogénie en Bio-informatique
Jean-Michel CLAVERIE
Structural & Genetic Information LaboratoryCNRS UPR2589Luminy, France
http://www.igs.cnrs-mrs.fr
Journées RNG « Phylogénie »
Bio-Informatique des Séquences:Une progression naturelle
• Une séquence: trouver où sont les motifs et les gènes
• Deux séquences: les aligner
• Trois et plus: faire une arbre
Gene finding: the general principles behind all current methods
1. Gene-encoded peptide sequences have to lead to foldable, compact structure in a water environnement,
2. Real protein are thus made of a balanced mixture of hydrophilic, hydrophobic, rigid, flexible, neutral, or charged residues
3. This constraint induces a recognizable statistical bias in coding DNA sequences, that we use for detection.
4. Around the good looking regions, we look for facultative additional signals,
5. The optimal prediction is then the best scoring one according to a « standard » gene model.
64 codons, including 3 STOPs : TGA, TAA, TAGSimple situation like E.coli: (A=T=G=C=0.25)
Between two successive STOP codons: each of the 61 codons -> 1.6 % expected relative frequency
In contrast, codon distributions within bona fide genes have to results into ‘foldable’ proteins, hence balancing hydrophylic and hydrophobic, + and - charged, rigid and flexible residues.
Non-coding vs. Protein-coding DNA sequence:Why is there a bias?
Coding segments have to be compatible with the typical amino acid composition of
natural proteins
Amino Acid % in N codon % expected proteins at random
Ala 9.46 4 6.4 +Arg 5.84 6 9.6 -Asn 3.99 2 3.2 +Asp 5.10 2 3.2 +Cys 1.20 2 3.2 -Gln 4.57 2 3.2 +Glu 5.50 2 3.2 +Gly 7.41 4 6.4 +His 2.44 2 3.2 -Ile 5.95 3 4.8 +Leu 10.5 6 9.6 +Lys 4.14 2 3.2 +Met 2.72 1 1.6 +Phe 4.19 2 3.2 +Pro 4.39 4 6.4 -Ser 5.71 6 9.6 -Thr 5.50 4 6.4 -Trp 1.48 1 1.6Tyr 2.76 2 3.2Val 7.16 4 6.4
AA with
+
or
–
20% deviation
from random
A variety of methods take advantageof this bias
Sequence
SlidingWindow
Chi2
(obs-rand)
CodingRegion
The Caesar (Julius) Cipher
Plaintext : veni, vidi, viciCiphertext: YHQL, YLGL, YLFL
Plain alphabet: abcdefghijklmnopqrstuvwxyzCipher alphabet: DEFGHIJKLMNOPQRSTUVWXYZABC
The Caesar cipher is based on a cipher alphabet that is shifted a certain number of places, relative to the plain alphabet
Frequency of letters in
English
letters % letters %a 8.2 n 6.7
b 1.5 o 7.5
c 2.8 p 1.9
d 4.3 q 0.1
e 12.7 r 6.0
f 2.2 s 6.3
g 2.0 t 9.1
h 6.1 u 2.8
i 7.0 v 1.0
j 0.2 w 2.4
k 0.8 x 0.2
l 4.0 y 2.0
m 2.4 z 0.1
From 100,362 characters
(Beker & Piper, 1950)
Frequency analysis of an
enciphered message
letters % letters %A 0.9 N 0.9
B 7.4 O 11.2
C 8.0 P 9.2
D 4.1 Q 0.6
E 1.5 R 1.8
F 0.6 S 2.1
G 0.3 T 0.0
H 0.0 U 1.8
I 3.3 V 5.3
J 5.3 W 0.3
K 7.7 X 10.1
L 7.4 Y 5.6
M 3.3 Z 1.5
XZAVOIDBYGERSPCFHJKLMNQTUWabcdefghijklmnopqrstuvwxyz
e = 12.7
The foundingFather of
« Gene-finding » bioinformatics
‘Al-kin-dee-u’
Al-Kindī
Al-Kindī (0850)Decyphering Cryptographic MessagesPMID: 0000001
Homophonic substitution cipher
a b c d e f g h i j k l m n o p q r s t u v w x y
09 48 13 01 14 10 06 23 32 15 04 26 22 18 00 38 94 29 11 17 08 34 60 28 21
12 81 41 03 16 31 25 39 70 37 27 58 05 95 35 19 20 61 89 52
33 62 45 24 50 73 51 66 54 42 76 43
47 79 44 56 83 84 71 72 77 86 49
53 46 65 88 91 90 80 96 69
67 55 68 93 99 75
78 57 85
92 64 97
74
82
87
Frequent letters, many codes
Amino Acid % in N codon % expected proteins at random
Leu 10.5 6 9.6 +Arg 5.84 6 9.6 -Ser 5.71 6 9.6 -Ala 9.46 4 6.4 +Gly 7.41 4 6.4 +Val 7.16 4 6.4Thr 5.50 4 6.4 -Pro 4.39 4 6.4 -Ile 5.95 3 4.8 +Glu 5.50 2 3.2 +Asp 5.10 2 3.2 +Gln 4.57 2 3.2 +Phe 4.19 2 3.2 +Lys 4.14 2 3.2 +Asn 3.99 2 3.2 +Tyr 2.76 2 3.2His 2.44 2 3.2 -Cys 1.20 2 3.2 -Met 2.72 1 1.6 +Trp 1.48 1 1.6
« Homophonic » Substitution in the Genetic Code
From old to new methods
Methods:1. Codon usage (Staden & McLachlan, 1982) : profile2. Differential k-tuple frequency (1986) : profile
K-tuple concepts and methods
ATGCTAGCATAGCTGCATGACATGCATGCAATGC TGCA TGCT ATGC GCTA CATG CTAG GCAT TAGC TGCA AGCA ATGC
4-tuple(tetramer)
Pre-compute: Fcoding (ATGC) , Fncoding (ATGC)
……………. , ………………
……………. , ………………
Fcoding (°°°°) , Fncoding (°°°°)
K-tuple profile
Sequence
Fc
-------
Fc
+Fnc
CodingRegion
ATGC
Fcoding (ATGC)
------------------------------------------------
Fcoding (ATGC) + Fncoding (ATGC)
0.5
k-tuple frequency analysis was probably knownof the Arabs (1000) and rediscovered
during the Renaissance
Homophonic deciphering:
looking for 2-tuples that often, never or rarely occurs: je, jx, wa
Blaise de Vigenère(1586)
invented his undecipherable cipher
to resist k-tuple analysis
The Vigenère
square
Plain abcdefghijklmnopqrstuvwxyz 1 BCDEFGHIJKLMNOPQRSTUVWXYZA 2 CDEFGHIJKLMNOPQRSTUVWXYZAB 3 DEFGHIJKLMNOPQRSTUVWXYZABC 4 EFGHIJKLMNOPQRSTUVWXYZABCD 5 FGHIJKLMNOPQRSTUVWXYZABCDE 6 GHIJKLMNOPQRSTUVWXYZABCDEF 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 9 JKLMNOPQRSTUVWXYZABCDEFGHI10 KLMNOPQRSTUVWXYZABCDEFGHIJ11-20 ……………………………………………………………………21 VWXYZABCDEFGHIJKLMNOPQRSTU22 WXYZABCDEFGHIJKLMNOPQRSTUV23 XYZABCDEFGHIJKLMNOPQRSTUVW24 YZABCDEFGHIJKLMNOPQRSTUVWX25 ZABCDEFGHIJKLMNOPQRSTUVWXY26 ABCDEFGHIJKLMNOPQRSTUVWXYZ
The Vigenère square: example
Keyword : WHITE WHITEWHITEWHITEWHIPlaintext : diverttroopstoeastridgeCiphertext: ZPDXVPAZHSLZBHIWZBKMZNM
Plain abcdefghijklmnopqrstuvwxyz 4 EFGHIJKLMNOPQRSTUVWXYZABCD 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH19 TUVWXYZABCDEFGHIJKLMNOPQRS22 WXYZABCDEFGHIJKLMNOPQRSTUV
WHITE
Charles Babbage:
Another pioneerIn bioinformatics
Cryptanalysis ofThe Vigenère cypher (1854)
PMID: 0000002
1. Look for repeats in the cipher text
2. Infer the keyword lenght from consistent repeat distance: L
3. Then analyze the character frequency at each position independently: 1, 2, 3 , …., L
4. Deduce (« call ») the corresponding Ceasar shift
The invention of inhomogeneous (and hidden) Markov Models
From old to new methods
Methods:1. Codon usage (Staden & McLachlan, 1982) : profile2. Differential k-tuple frequency (1986) : profile3. Differential in-phase k-tuples (1988) : profile4. Non-homogeous Markov chains (Borodvsky,1986;
Tavare & Song, 1989) : profile/call5. Neural-net (Mural & Uberbacher, 1991) : call6. Hidden Markov chain (Kulp & al., 1996) : call
Bioinformatics vs. Biology
• 1951 1st protein sequence (Insulin, Sanger)
• 1960 Sequence-structure relationship (Globins, Perutz)
• 1965 "Evolutionary divergence & convergence in Proteins" Zuckerkandl & Pauling
• 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 Atlas of Protein Sequences (M. Dayhoff, Georgetown)
• 1970 "A general method applicable to the search for similaries in amino-acid sequences of two proteins" Needleman & Wunsch
• 1973 Genetic engineering (Cohen, Boyer et al.)
• 1974 "Prediction of Protein Conformation" Chou & Fasman • 1977 ADN sequencing (Sanger, Maxam, Gilbert)
• 1977 1st bioinformatic "package" (Staden; DB/ assembly, analysis)
• 1978 Databases: ACNUC, PIR, EMBL, GenBank• 1980 Database access via telephone lines ( PIR )• 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt
• 1965 First « industrial » computer IBM/360• "Evolutionary divergence and convergence in Proteins"
Zuckerkandl & Pauling• 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 First « mini » computer DEC PDP-8 (floor top)• Atlas of Protein Sequences (M. Dayhoff, Georgetown)• 1970 "A general method applicable to the search for similaries "
Needleman & Wunsch.• 1971 1st work on RNA folding (Ninio)• 1972 First « micro » processor Intel 8008• 1973 Genetic engineering (Cohen et al.)• 1974 "Prediction of Protein Conformation" Chou & Fasman • 1975 Intel 8080, kit Altair• 1977 1st bioinformatic package (Staden; PDP 8/11) • DEC-VAX Mini-computers• Micro-computer (Apple, Commodore, Radioshack)• 1978 Databases: ACNUC, PIR, EMBL, GenBank• 1980 Telephone access to the PIR database• 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt• IBM-PC (8088), 16-32kb• 1983 IBM-XT HARD disk (10 Mbytes)• 1984 MacIntosh : graphic/mouse interface
Bioinformatics vs. Computers
Perspective historique (suite)
• 1981 Local Alignement (Smith-Waterman , JMB)
• 1985 "Fasta" (Pearson-Lipman, PNAS)
• 1989 ARPANET --> INTERNET
• 1990 "Blast" (Altschul et al., JMB)
• 1990 Positional cloning of NF-1
• 1991 "Grail", first practical gene «finder » (Mural et al., PNAS)
• 1991 "EST" (Venter et al., Matsubara et al.)
• 1992 Complete sequence of yeast chr. 3
• 1995 Complete sequence of H. influenza
• 1996 Complete sequence of S. cerevisiae
• 1997 "Gapped Blast" (Alschul et al., NAR)/Genscan (Burge & Karlin)
• 1997 11 complete bacterial genomes available
• 1998 2 Mbase/day of new public sequence data, C. elegans• 2000 Human Chr 22, 21, 90%draft, Drosophila, 30 bacterial
genomes
Bioinformatics & Genomics: more recent past
Bon workshop!
Genes & Genomes
DNA
RNA (transcripts)
transcription
Gene 1 Gene 2 Gene 3 Gene 4
translation+ folding
Proteins (or RNAs)Function 2 Function 3
Function 4
Function 1
Bio
Info
rmat
ics