View
16
Download
0
Category
Preview:
Citation preview
Genome annotation
Erwin Datema (2011) Sandra Smit (2012, 2013)
Genome annotation AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
Gene: SL6G63120.1
Description: Putative disease resistance gene, Mi-homolog
Domains: NBS, LRR
Best Blast hit: SA004581
Gene prediction – the eukaryotic gene model
intergenic intergenic
initial exon internal exon terminal exon
intron intron
5’ UTR 3’ UTR
splice sites ATG *
promoter poly-A
protein coding region
Structural gene annotation – alignment–based (1)
Prediction of gene structures based on alignments Transcripts and proteins provide direct evidence Requires experimental data for each gene
ATG *
ATG *
Structural gene annotation – alignment–based (2)
Genome-to-genome alignment Requires annotated genome of closely related species
Example – tomato genome browser
* ATG!
Structural gene annotation – ab initio (1)
Prediction of gene structures based on ‘gene model’ Start, stop, splice sites Exon, intron, intergenic length distributions Triplet/hexamer frequencies (coding vs. non-coding)
ATG! ATG! ATG!ATG!* * *
Structural gene annotation – ab initio (2)
Different predictors produce different results Underlying models (HMM, SVM, …) Quality of training Lack of understanding of biology
Structural gene annotation – ab initio (3)
Requirements for ab initio gene predictors Training through verified transcript (and protein) alignments Sufficient sequence context in order to make accurate predictions
Some properties are common for all eukaryotes Start, stop, splice site consensus
Many properties differ, even between related species Intron and intergenic length distribution Codon usage
Example - differences in intron lengths
~200 nt ~300 nt
~1200 nt ~2200 nt
Bradnam and Korf, PLoS One. 2008
Generation of a consensus gene structure
GP1
GP2 GP3 EST prot
gene
ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGG!
Example – tomato genome browser
Functional gene annotation – alignment–based
Inferring function through sequence similarity Proteins with similar sequence often share function
Annotation quality of database sequences Many proteins with unknown function Propagation of erroneous annotation
GQPKSKITHVVFCCTSGVDMPGADYQLTKLLGLRPSVKRLMMYQQG!|||| |: ||||| ||||||||| ||||| |||||:| |||||||!GQPKEKLGHVVFCTTSGVDMPGA--QLTKLMGLRPSIKKLMMYQQG!
Functional gene annotation – domain–based
Inferring function through domain searches Domains are the functional parts of a protein
Global functional annotation of the protein E.g. kinase, ATP-binding Gene Ontology (GO) terms
NBS CC LRR LRR LRR
The annotated gene
model
blastx Unknown protein [Arabidopsis thaliana]
domains NBS! LRR! LRR! LRR!
Gene Ontology terms GO:0005524 ATP binding GO:0006915 apoptosis
blastn Putative disease resistance gene, Mi-homolog
ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCAGCTTTGTACTTATCTAAGG!
Genome annotation: more than gene finding AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
AGACAAAGATCCGCTAAATTAAATCTGGACTTCACATATTGAAGTGATATCACACGTTTCTCTAATAATCTCCTCACAATATTATGTTTGGGATGAACTTGTCGTGATTTGCCATTGTAGCAATCACTTGAAGCCAAAATGTAGCAGCTTGGTTATCACAATAGATAGTTAAAGGTGGTACTGACTTACTCCACAAAGGACTATCTATCAACATAGATTGCAACCAATCATCTTCCTCCGAAAGAAGACATAGCTATAAATTCTGATTCATGGTTGAATGAGTAATACATGTCTGCTTCTTTGATTTACAAGAGATCGATGCTCCAGCCAAAGTGAAAATTCAAGCAGTATTTAGACTCATTCGGATCAGAATTCCAAGTAGCATCGCTATAACCTACGAAACATTTGAAATGTAGAATAATGTAGGCCAACATGTATCATACCCTTCAGGCATCTTAAATCACGAATTATGACATTCCAATGTTCAACTCTGGGATTACTAGTAAACTTGCTCAAAGTTCCTAGTGCTTAACCATATCCGGCCTGGTACAATGCATGACATACATAAAACTCCCAACAATCTGAGAATATGTCAGTTGATCGAGAACATCACTAGAGTTTCGCATTAGTTCGATGCTAGGATCAAATGGTGTAGGAGCAAGTTTGTCTCAAAGTGACCAAACCTCTTTAAATTTAAATTTAAATTTAAATTTAAATTTAAACTCAATATAACTTGATTGAATAAGAGTTAGGCCATTCGTTGATCTTATAATTTTGATGCCCAAAATAAATTTATAATGTTATAATACATAAAGACATATTATAACACAGATGTGTTTTGAAATTTACTAAATATGCAAATATCATCACCATTGATTGAGTAGTCATTAGAAATCATTACTCATCTAAATTTTTCATTTCATTATTTTGGAGCTTGCTTTAATCCAAAAAGAGATTTAAAAAGCTTACAGACTTTGTGTTCTTACAGGTATGACAAATACTTCTGATTGTTTCATGTACACTTCTTCATCTAGATCACCATTTAGAAATGCAGTCTTCACATCCATTTGATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCTACCAAGCGAGCTTTGTACTTATCTAAGGTACCATCGATTTTAAATTCTTTCTAAAAGTCCATCTACAATCAATAGATTTACATCCAGAAGGGAGGTCTGATAAAATTCATGTATTGTTTGACATAATAGAGTGCATTTCATCATTAATACCTTCACGCCAGAAAGGATCATCATGTGAAGCCGTTGCTTCAACAAAACTTTCAGGATCCCTTTCAACTAAATAAACTTGAAATTTTGGTCTAAAATCTCTTGGTTTGGCTGATCTTGCACTACGTCTTGATTGATTTTCTTCCAAAGGTTCAACTATTTTCTTTTAAGAGAAGATATA!
Non-coding RNAs tRNA genes rRNA miRNAs …
Repetitive sequences Interspersed repeats (e.g. transposons) Tandem repeats, SSRs
Repeat identification and masking
Repeats may contain coding elements E.g. reverse transcriptase in a retrotransposon
This may result in many ‘false’ gene predictions 56,797 genes predicted in rice 16,220 of these are repeat-related!
Prior to gene prediction, repeats should be masked sequence similarity: requires database of known repeats de novo: distinguish between gene families and repeats
Genome annotation pipeline genome sequence
BLAST
gene predictor
repeat masker
gene predictor gene predictor
integration
domain search
repeat database transcripts
miRNA tRNA
The annotated genome
ATGTGTTACCATACTATGAATTGCGGCTAAAGCAACAAGAGTTGAATATAAGTCATTCGAGCTATAGGTGCAAAGGTATCAAAATAATCAATATCTTTCAATTGAGTATAACCTTTTGCAC
genes
repeats
tRNAs
CGAGTCAGCTTCATATACTGCGCGCGATATATATTATCGCGTACGATCGATCGATCTGTACGGGTGACTTATTCGTGTATAGTCTATATCTTCGCTAGCTGATTATCGAGCGTACGTACGT
genes
repeats
tRNAs
blastx
blastx
ABC transporter
Cytochrome P450 MADS box transcription factor
Example – tomato genome browser
Beyond genome annotation
Automated annotation can provide candidate genes Similarity to known genes from other species Targets for crop improvement, treatment of (genetic)
diseases, etc.
Comparative genomics Study the evolution of species What makes a species unique? What makes an individual unique?
Activities
Read “A beginner's guide to eukaryotic genome annotation” Mark Yandell, Daniel Ence Nature Reviews Genetics 2012 vol. 13 (5) pp. 329-42
Explore the tomato genome annotation http://solgenomics.net/ Tomato Genome Project
Explore the Arabidopsis genome annotation http://www.arabidopsis.org/ Tools, gbrowse
The End
© Wageningen UR
Recommended