6
1 CS646: Machine learning in bioinformatics An explosion of biological data DNA Sequencing …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… Interaction networks What is Bioinformatics? The science of managing (collecting, storing, making available) and analyzing biological data using computer science and statistics tools. Course objectives Present interesting and challenging biological machine learning problems Data: High dimensional, heterogeneous, non- vectorial Require non-standard machine learning solutions Machine learning methods for solving them Our focus: kernel methods such as Support Vector Machines Course Overview Homework: [30%] Around 3 assignments Theoretical and practical Projects [50%] Paper presentation(s) [20%] Some biology…

What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

1

CS646: Machine learning in bioinformatics��

An explosion of biological data�

DNA Sequencing �

…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

Interaction �networks�

What is Bioinformatics?�•  The science of managing (collecting, storing,

making available) and analyzing biological data using computer science and statistics tools.�

Course objectives�•  Present interesting and challenging biological

machine learning problems�•  Data: High dimensional, heterogeneous, non-

vectorial�•  Require non-standard machine learning solutions�

•  Machine learning methods for solving them�•  Our focus: kernel methods such as Support

Vector Machines�

Course Overview �•  Homework: " " " " "[30%] �

•  Around 3 assignments�•  Theoretical and practical�

•  Projects " " " " " "[50%] �•  Paper presentation(s)" " " "[20%] ��

Some biology…�

Page 2: What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

2

All living things are made of cells

A cell:   Is separated from its surroundings by a membrane.   Takes nutrients from the environment and converts them into other molecules

and energy.   Replicates

Prokaryotes vs. Eukaryotes�

prokaryote

eukaryote

http://www.bio.miami.edu/dana/104/

no nucleus

What’s in a cell�Prokaryotes� Eukaryotes�

Water� 70% � 70% �DNA � 0.25% � 1% �RNA � 1% � 6% �Proteins� 18% � 15% �Lipids (fat) � 5% � 2% �Polyshachrides� 2% � 2% �Metabolites� 4% � 4% �

Prokaryotes and Eukaryotes�

•  Three main branches to the tree of life.�•  Prokaryotes: mostly unicellular�•  Eukaryotes: plants, animals, fungi and certain algae. �

Prokaryotes vs Eukaryotes (cont)�

Prokaryotes� Eukaryotes�

Single cell� Single or multi cell�

No nucleus� Nucleus�

No organelles� Organelles�

One circular DNA � Chromosomes�

Compact genome� DNA mostly non-coding, lots of repeats�Exons/Introns splicing �

•  Proteins: the work-horses of the cell�•  Catalyze reactions�•  Carry signals and transport molecules�•  Form many of the cellular structures�•  Regulate cellular processes�

Page 3: What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

3

Proteins�•  Protein: a chain of

amino acids that folds into a 3-d shape that gives it its function �

myoglobin

DNA, RNA, protein, and the flow of genetic information �

The Genetic Code�

•  DNA, RNA: strings over the four-letter nucleotide alphabet (A C G T/U) �

•  Proteins: strings over the twenty-letter amino acid alphabet. Each amino acid is coded by 3 nucleotides (a codon).�

DNA: The Code of Life�

•  The structure and the four genomic letters code for all living organisms �

•  Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G on complementary strands.�

Some Terminology •  The genome is an organism’s genetic material�

•  length of a bacterial genome: about 600,000 base pairs (bp)�

•  human and mouse genomes: around 3 billion.�•  The human genome is composed of 24 chromosomes.�

•  Each chromosome contains many genes. �•  Genes �

•  basic physical and functional units of heredity. �•  sequences of DNA that encode proteins. �

A bit of history…�•  Chargaff and Vischer (1949) �

•  DNA consists of A, T, G, C; approximately equal frequencies of A and T, and of G and C�

•  1953 James D. Watson and Francis H. C. Crick deduced the double helical structure of DNA (using Chargaff’s rule and the X-ray image from Rosalind Franklin) �

Watson & Crick with DNA model

Rosalind Franklin with X-ray image of DNA

Page 4: What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

4

Genome Sequencing �•  1986 Leroy Hood: automatic

sequencing �•  1990 Human Genome Project launched

by congress�•  1995 John Craig Venter: First

bacterial genome sequenced�•  1996 First eukaryotic genome (yeast)

sequenced�•  1998 Complete sequence of the

Caenorhabditis elegans genome�•  1999 First human chromosome (number

22) sequenced�

John Craig Venter

Leroy Hood

Sequencing the human genome �

•  2001 First draft of the sequence of the human genome published�

What’s next?�""The sequence of the genome is the key to all biological processes…. The ability to sequence genomes at low cost will revolutionize biological research, and will form the basis of the individualized medicine of the future.”�!Prof. Hans Lehrach, Ph.D.�Head, Department of Vertebrate Genomics�Archon X PRIZE for Genomics Scientific Advisory Board�

http://genomics.xprize.org/

The future is already here�

Genome is just the beginning…�•  By itself, a raw genome sequence is

practically useless�•  Need to infer the location of genes�•  Infer the biochemical functions of those

genes�•  How those genes interact �•  How they are regulated�

Why sequence?�•  “Nothing in biology makes

sense except in the light of evolution” - Theodosius Dobzhansky. �

•  The functionality of many genes is virtually the same among many organisms: Can understand biology in simpler organisms than ourselves (“model organisms”)�

Page 5: What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

5

Genome sizes�

Organism " " # of base pairs # of Chromosomes��Virus�

"HIV " " " " 9193 " " "1 �"SARS" " " "29751 " " "1 �

�Prokayotic�

"Haemophilus influenzae " "1.8x106 " " "1 �"Escherichia coli (bacterium) "4.6x106 " " "1 �"�

Eukaryotic�"S. cerevisiae (yeast) " "1.35x107 " " "17 �"Drosophila melanogaster (fly) "1.65x108 " " "4 �"Homo sapiens (human) " "2.9x109 " " "23 �"Zea mays (corn) " " "5.0x109 " " "10

"" "�

More on DNA (Deoxyribonucleic acid) �

•  DNA has a double helix structure composed of �•  sugar molecule�•  phosphate group �•  and a base (A,C,G,T)�

•  Double stranded with complementary strands: A-T, C-G (“base pairing”)�

•  DNA always reads from 5’ end to 3’ end for transcription and replication �5’ ATTTAGGCC 3’ �3’ TAAATCCGG 5’ �

http://www.bio.miami.edu/dana/104/

Prokaryotic genes�

http://nitro.biosci.arizona.edu/courses/EEB600A-2003/lectures/lecture24/lecture24.html

Eukaryotic genes�

splicing

Genes: some numbers�•  Regulatory regions: up to 50 kb upstream of +1 site�"�

•  Exons: " 1 to 178 exons per gene (mean 8.8)�" " " 8 bp to 17 kb per exon (mean 145 bp)�

�•  Introns: " 1 kb – 50 kb per intron (mean ~2,000 bp)��•  Gene size: Largest – 2.4 Mb (Dystrophin). �" " " "Mean – 27 kb.�

Page 6: What is Bioinformatics? Course objectivesasa/courses/cs646/spr11/pdfs/01... · 2011-01-17 · What is Bioinformatics? • The science of managing (collecting, storing, making available)

6

Computational problems�

•  Genome assembly�•  Gene finding: finding the structure of a gene

(caveat: alternative splicing)�•  Assigning function to proteins�•  Predicting interaction networks�•  Understand gene regulation and expression �