52
Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH BIOL 7210 A – Computational Genomics 2/20/2018

El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Leonardo Mariño-Ramírez, PhDNCBI / NLM / NIH

BIOL 7210 A – Computational Genomics2/20/2018

Page 2: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,
Page 3: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

The $1,000 genome is here!

http://www.illumina.com/systems/hiseq-x-sequencing-system.ilmn

Page 4: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Bioinformatics bottleneck

Page 5: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Bioinformatics challenges• Methods: How do I analyze my data using procedures

for various data types?

• Infrastructure: Where do I process my data? Large scale compute accessibility, Installing and maintaining software

• Standards: How do I ensure my results are useful? Common, shared formats using community developed software and tools

Page 6: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

High throughput sequencing map

http://omicsmaps.com/

Page 7: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

The case for cloud computing in genome informatics

http://genomebiology.com/2010/11/5/207

Page 8: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

The case for cloud computing in genome informatics

http://genomebiology.com/2010/11/5/207

Page 9: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

The case for cloud computing in genome informatics

http://genomebiology.com/2010/11/5/207

Page 10: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

The National Center for Biotechnology Information

Created in 1988 as a part of theNational Library of Medicine at NIH

– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

Bethesda,MD

Page 11: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

The NCBI microbial annotation pipeline

http://www.ncbi.nlm.nih.gov/genome/annotation_prok/process/

Page 12: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Bacterial Annotation Released PGAP-2 replacement available for GenBank Support large throughput

1,000 assemblies/day

Current development tasks: Replace dataflow to create RefSeq assemblies Perform annotation via mapping from “close” assembly

Page 13: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Assembly

Protein Placement

(Core + Conserved + Universal)

Non-Coding Elements (rRNA, tRNA,

ncRNA)

Mobile Elements

Resolve Conflicts GeneMark

Assign Names

Expand Evidence GeneMark II

Prepare Reports

Page 14: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Assembly

Non-Coding Elements (rRNA, tRNA,

ncRNA)

Mobile Elements

Resolve Conflicts GeneMark

Assign Names

Expand Evidence GeneMark II

Prepare Reports

Reference Assembly

Align Assemblies

Project Protein

Annotation

A lot depends on quality of the reference set!

We may optionally choose to project non-coding annotation as well

Page 15: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis

Rapid identification of species and strain

Rapid assembly and annotation of bacterial short read sequences

Rapid identification of key characteristics separating “outbreak” strains from background samples

Page 16: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis

https://www.ncbi.nlm.nih.gov/pathogens/

Page 17: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis Timeline

Page 18: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis Timeline

Page 19: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis 2017

Page 20: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis Timeline

Global Microbial Identifier (GMI)

Page 21: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Analysis Pipeline

Page 22: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,
Page 23: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,
Page 24: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogens of Interest Food-borne illness

Salmonella enterica Listeria monocytogenes Escherichia coli O157 (STEC) Campylobacter jejuni

Resistant / virulent hospital-acquired infections Resistant Staphylococcus aureus (MRSA) Resistant Klebsiella pneumoniae

Difficult to culture Mycobacterium tuberculosis

Page 25: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Targets

BioSample

K-merclassify

Assemble Annotate

Call Variations

Reclassify Place In Tree

Detect Signal

Raw Classify Commit To Assemble Assemblies Annotated

GenBank

All TreesReference Genomes

Combined Tree

Pathogen Analysis

Page 26: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Assembly Process

Reads

Reference Guided

Assembler (ARGO)

MaSuRCA

SOAPdenovo

Newbler

Combine Assemblies

Produce and Load Reports

Page 27: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Pathogen Future Increase throughput

Currently, see throughput of 400 samples/day Goal is 1,000 samples/day

Add automation tasks for signal detection Add automated submission and report back to

submitter Predicting virulence and drug resistance Extending to more organisms Provide public browsing resources

Page 28: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

NCBI BLAST in the Cloud!

Page 29: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Cloud Options for NCBI BLAST! Amazon Web Services Google Compute Engine Any cloud app with Galaxy project

Page 30: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Select the NCBI CloudBLAST AMI

Page 31: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Select a Reasonable* Instance

Page 32: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Launch the instance

Page 33: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

How the Genome has changed?• More complex genome structures – (chromosomes,

organelles, plasmids)• Genome sequencing – NextGen sequencing• More complex genome assembly – (chromosomes,

scaffolds, contigs)• Genome-scale projects - (transcriptome, exome,

epigenomics, proteomics)• Multi-isolate genome sequencing - (1001 Arabidopsis,

1000 human genomes)• Meta-genomes• Now useful for drug development

Page 34: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

New resources at NCBI

Page 35: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

New genomic resources at NCBI

Page 36: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

New resources at NCBI

Page 37: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

GenomeBioProject

Taxonomy

Nucleotide

Assembly

BioSample

Why do we need new databases?

Page 38: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

BioProject, Genome, Assembly

• BioProject is an administrative object (defined by goal, target, funding, collaboration)

• Genome is a biological object defining an organism at molecular level

• Genome assembly is a complex data structure that defines the structure, relative position (scaffold) and chromosome placement of DNA sequences originated from a single sample

Page 39: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

What is a Genome project?• Genome project is a scientific endeavor that

ultimately aims to determine the complete genomesequence of an organism and …

Aims to annotate protein-coding genes and other important genome-encoded features and …

Aims to understand the biology, physiology, and evolution of the organism.

Page 40: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Genome Project -> BioProject

Genome sequencing

Transcriptome sequencing

Targeted sequencingRandom survey

Assembly

Annotation

Epigenomics

Variant Discovery

Metagenome

Population genomics

Ecosystem genomics

Proteomics

Page 41: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Target

Material Method

Mono-isolateMulti-isolateMulti-species

Environmental

Scope Objective Capture

Mono-isolateMulti-isolateMulti-species

Environmental

Mono-isolateMulti-isolateMulti-species

Environmental

DNARNA

Protein

sequencingarray

proteomics

BioProject data model

Page 42: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Why do we need a database of genome assemblies?

• We are in a period of extraordinary growth in genomics data.

• To get the full benefit from all this data, it is important that users can integrate data from different sources. Integration only works, if users know whether or not the different data were reported in the same coordinate system.

Page 43: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

TB H37Rv Sanger vs. Broad

Sanger assembly (NC_000962)

Broa

d as

sem

bly

(NC

_018

143)

Page 44: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium genomes at NCBI

Page 45: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium tuberculosis genomes

Page 46: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium tuberculosis overview

Page 47: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium tuberculosis genome annotation

Page 48: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium tuberculosis H37Rv

Page 49: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium tuberculosis H37Rv browser

From the Gene record

Page 50: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

Mycobacterium tuberculosis H37Rv GenePlot

Page 51: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

BioProjectSingle isolate

NucleotideBioSample

Genome Assembly

BioSample

GenomeAssembly

AssemblyAssembly

Assembly Nucleotide

Nucleotide

BioSample

BioSample

Genome

GenomeBioSample

BioSampleBioSample

BioProjectSingle isolate

BioProjectSingle isolate

BioProjectMulti isolate

AssemblyAssembly

Assembly NucleotideAssembly

SRA

BioProject, BioSample, Genome, Assembly, Nucleotide

Page 52: El Proyecto Genoma Humano...More complex genome structures – (chromosomes, organelles, plasmids) Genome sequencing – NextGen sequencing. More complex genome assembly – (chromosomes,

BioProject

BioSampleCommon SubmissionInterface

metadata

Sequence data

SRA

Contigs

Genome Collection

GenBank

NCBI genome submission dataflow