Recherche dans des bases de données de séquences biologiques

Preview:

DESCRIPTION

Using BLAST to Search Sequence Databases. Recherche dans des bases de données de séquences biologiques. Cédric Notredame. Using BLAST to Search Sequence Databases. Recherche dans des bases de données de séquences biologiques. Cédric Notredame. Outline. -Evolution and Sequence Similarity. - PowerPoint PPT Presentation

Citation preview

Recherche dans des bases de données de séquences

biologiques

Using BLAST to Search Sequence

Databases

Cédric Notredame

Recherche dans des bases de données de séquences

biologiques

Using BLAST to Search Sequence

Databases

Cédric Notredame

-The inside of BLAST

-Using BLAST

-Adapting BLAST to your needs

Outline

-Evolution and Sequence Similarity

-Searching Protein Domains with BLAST

-Digging Genomes

Two Minutes of the

Evolutionnary Clock…

An Alignment is a STORY

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

An Alignment is a STORY

ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN

Mutation

InsertionDeletion

ADKPKRPLSAYMLWLN

ADKPKRPKPRLSAYMLWLNADKPRRPLS-YMLWLN

ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN

Mutations+

Selection

How Do Sequences Evolve ?

In a structure, each Amino Acid plays a Special Role

OmpR, Cter Domain

In the core, SIZE MATTERS

On the surface, CHARGE MATTERS

--+

Why Does It Make Sense To Align Sequences ?

SameSequence

Same Function

Same 3D Fold

Same Origin

How Can We Compare Sequences ?The Twilight Zone

Length

%Sequence Identity

100

Same 3D Fold

Twilight Zone

Similar SequenceSimilar Structure

30%

Different SequenceStructure ????

30

Different molecular clocks for different proteins--another prediction

A few Basic Definitions

A few Definitions

Query : Your sequence

Subject: The database against which you search

Heuristic: Algorithm that does not guaranty the optimal solution

Other Important DefinitionsIdentity

Proportion of IDENTICAL residues between two sequences. Depends on the Alignment. Unit: the % id

Homology Sequences SIMILAR enough are sometimes HOMOLOGOUSHOMOLOGY COMMON ANCESTORUnit: Yes or No!DIFFERENT sequences can also be Homologous

SimilarityProportion of SIMILAR residuesTwo residues are similar if their substitution cost is higher than 0. Depends on the matrix Unit: the %similarity

More Important Definitions

HitA sequence that matches your sequence and reported by BLAST.

E-ValueExpectation valueHow many times would you expect to find a hit by chance only?

Depends on the alignment.Depends on the matrixDepends on the databaseSensitive to Low complexity regions

Unit: must be lower than 0.0001 to mean something

A Good Hit Is Something You

Would Not Expect by Chance

What is BLAST ?

BLAST

BLAST is a Program Designed for

RAPIDLY Comparing Your Sequence

With every Sequence in a database

and REPORT the most SIMILAR

sequences

Basic Local Alignment Search Tool

Database Search

1-Query

3-Database

4-Statistical Evaluation (E-Value)

PROBLEM: LOCAL ALIGNMENT (SW)TOO SLOW

2-Comparison Engine

LOCAL Alignment

Database Search

1.10e-20

10

1.10e-100

1.10e-2

1.10e-1

10

3

1

3

6

1.10e-2

1

20

15

13

SWQ

BLAST

BLAST

BLAST is a Heuristic Smith and Waterman

Basic Local Alignment Search Tool

BLAST = 3 STEPS

1-Decide who will be compared

This is where Blast SAVES TIME

This is where it LOSES HITS

Most BLAST parameters refer to this step

BLAST

BLAST is a Heuristic Smith and Waterman

Basic Local Alignment Search Tool

BLAST = 3 STEPS

1-Decide who will be compared

2-Check the most promising Hits

3-Compute the E-value of the most interesting Hits

Heuristic Algorithms

Smith and Waterman • Exact Local Dynamic Programming, 1981

FASTA • Lipman and Pearson, 1985• Looks for similar words (k-tup) on the same diagonal.• Comparison on the sequences one by one…

BLAST• Altschul et al., 1990• The most widely cited tool in Biology• www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html

BLASTA Bit of History

The Inside of BLAST

Inside BLAST

Step 1: finding the worthy words

RELQuery

RSLRSL

AAAAACAAD

YYY

AAAAACAAD

YYY

List of all the 3AA words thatCan be found in the database

...

ACT

RSL

TVF

ACT

RSL

TVF

Words with a score > T

score > T

...

...

LKPLKP

LKPLKP

score < T

Inside BLAST

ACT

RSL

TVF

ACT

RSL

TVF

List of « interesting » words > T

...

...

Step 2: Eliminate the database sequences that do not contain any interesting word

ACTACTACT

RSL

RSL TVF

RSLRSL

RSLRSL TVFTVF

Sequences within the database

Sequences containing interesting words (Hits)

Sequences containing interesting words (Hits)

Look for «interesting»

words

Inside BLAST: the end

Step 3: Extension of the Hits

Database sequence

Qu

er

y

X

•2 "Hits" on the same diagonal distant by less than X

Database sequence

Qu

er

y

X

Extension by limited Dynamic Programming

The Statistics in BLAST

Evaluation of the score •Raw Score

Sum of the substitutions and gap penalties.

Not very informative

BLAST Statistics: Raw Score

BLAST Statistics: P Values

Derived Statistics•p-value

Probability of finding an alignment with such a score, by chance.

The lower, the better

Just as the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution.

normal distribution Extreme value distribution(Gumbel)

BLAST Statistics: P-Values

P-Value: Probability that a random alignments obtainsa score superior or Equal to X

K must be calibrated with the database compositionLambda is calibrated with the matrix being used

BLAST Statistics: P-Values

Derived Statistics•E-value

Number of alignments expected by chance

The lower, the better: <0.00001

For Values Lower than 0.0001, E-Value ~ P-Value

The E-Values are easier to compare than P-Values

BLAST Statistics: E-Values

•Bit ScoreEvaluates the amount of information in

the alignmentMakes it possible to compare

alignments

BLAST Statistics: Bit-Score

BLAST Statistics: Booby Trap!

The E-Value depends on N, theDatabase size.

If N increases, some Hits can be lost

P31383 Vs YEAST

P31383 Vs UniProt

The Many Flavorsof

BLAST

Database Against Database:« Farm-Blast »

Ideal for finding Orthologues

Genome 1

Genome 2

The Classics

1 SequenceVs

A sequence Db

Program Query Database

blastp protein protéine

blastn nucleotide nucleotide

tblastn

protein protein

nucleotide

VS

blastx

protein

nucleotide

proteinVS

tblastx

protein

nucleotide

protein

nucleotide

VS

The Many Flavors of BLAST

Program Query Database

Psi-blast protein protein

RPS-blast protein Domain

The Many Flavors of BLAST

DART-blast protein protein

mega-blast DNA Large DNA

If your Sequence is a Protein

If your Sequence is made of DNA

BLASTing with DNA: Asking the right question.

Keeping an Eye on the Public Servers.

Using BLAST:The Basic Way

Database Search

Database Search Result=Prediction

Protein X IS or IS NOT homologous to the QUERRY.

Submitting your Query

Understanding the BLAST Output

Graphic Display

Hit List

Alignments

Understanding the Graphic Display

Understanding the Hit List

Understanding the Alignments

Low Complexity

Low Complexity Regions

Regions with a single residue repeated many times (like the AFGP) can produce meaningless alignments.

The statistics expect ALL the regions to look the same « on average ».

By default, BLAST replaces these regions with Xs

Reproducing The Experiment

Everything you need to know to reproduce your search is at the bottom.

BLAST searches are notoriously difficult to reproduce

Database Searches:A few Guidelines

DataBase Search According to Pearson

DataBase Search According to Pearson

DataBase Search According to Pearson

Using Weak Matches To Identify Domains

RNA Recognition Motif

Three Short-Sighted Witnesses

are more Informative than a single eagle-eye

witness

Using BLAST:Trouble Shooting

Domain 2

Domain 1

No Overlap

Advanced Blast on the EMBnet

www.ch.embnet.org/software/aBLAST.html

• More choice on the databases• Change all the parameters

Adapting BLAST To your Problem

Domain-FlavoredBLAST

Psi-BLAST

BLAST latest Flavor

PSI-BLAST

-Position Specific Iterated Version of BLAST.

-Uses Profiles.

-More Sensitive.

Psi-BLAST Iteration

C C

C C

C CC C

C SC C

C CC C

C SC C

Psi-BLAST Iteration

C C

C C

C CC C

C SC C

C CC C

C SC C

Psi-BLAST Iteration

C C

C C

C CC C

C SC C

C CC C

C SC C

BLAST PSSM or weight matrix

M Y C E Q U E N C E S . .A 0 2 -1 0 0 0 0 -1 0 -1 3 S -1 -1 -1 0 -1 0 0 0 5 -1 -1 C -1 -1 10 1 -1 0 0 5 5 4 -1 ..Y -1 6 -1 -1 -1 0 -1 -1 -1 -1 -1V -1 1 -1 -1 -1 0 -1 -1 -1 1 -1

Asking a Question With Psi-BLAST

Asking a Question With Psi-BLAST

Is the Leghemoglobin related to the Human Hemoglobin ?

Asking a Question With Psi-BLAST

Asking a Question With Psi-BLAST

Asking a Question With Psi-BLAST

Which Domain Organisation

For Your Protein:

(Reverse PSI-BLAST)

Asking a Question With RPS-BLAST

PSI-BLAST: Discovering Domains

RPS-BLAST: Which KNOWN Domain in my protein ?

DomainDatabase

Sequence

Asking a Question With RPS-BLAST

False Hits caused by the domain low complexity (see E-values)

RPS-BLAST:Filtering Or Not Filtering Low

COmplexity

How Many Proteins Have the same

Domain Structure as Mine ?

(CDART)

Asking a Question With CDART

CDART:

Conserved Domain Architecture Retrieval Tool

Finds the proteins that contain the same domains as your protein.

Asking a Question With CDART

PSI-BLAST: Discovering Domains

RPS-BLAST: Which known Domain in my protein ?

CDART:

Which domains are COMMONLY ASSOCIATED with the domain I am interested in ?

-Which proteins have the SAME DOMAIN ORGANIZATION as my proteins ?

Filtering:

-By Domain

-By Species

-I want to Find all the Insect proteins containing a June/Fos organisation.

Asking a Question With CDART

-I want to see all the Insect proteins containing a June/Fos organisation.

Asking a Question With CDART

-I want to see all the Insect proteins containing a June/Fos organisation.

Asking a Question With CDART

-I want to see all the Insect proteins containing a June/Fos organisation.

Genome FlavoredBLAST

Standard Blastn with long word size

MegaBLAST=Longer Words

Faster BUT Less sensitive

RELQuery

RSLRSL

AAAAACAAD

YYY

AAAAACAAD

YYY

List of all the 3AA words thatCan be found in the database

...

ACT

RSL

TVF

ACT

RSL

TVF

Words with a score > T

score > T

...

...

LKPLKP

LKPLKP

score < T

The NcBi BlAsT GEnoMe SecTion is MesSy

Makes it possible to select predicted proteomes

Venter-BLAST

When it comes toBLASTingEukaryotic Genomes:

WWW.ENSEMBL.ORG

Asking a Question With ENSEMBL-BLAST

ENSEMBL:

WHERE are located the genes coding for Homologues of my protein

CONCLUSION

-

-BLAST is a fast approximation for the Full Local Dynamic Programming. It is convenient to scan Databases.

-BLAST computes the Statistical Significance of the Alignments (E-Value, P-Value).

Searching Databases

-The main pitfall to avoid are low complexity regions

-

Searching Databases

-USE Psi-Blast to find remote homologues

-USE blastp the best educated blast to discover the function of your protein

-USE RPS-Blast to find domains in your protein (Interpro for EBI)

-USE ENSEMBL-Blast for the human Genome

A few Extra Ressources

Tunning BLAST

BLAST Tunning

Recommended