Le contrôle qualité sur les données fastq

Le contrôle qualité sur les données fastq

TP detection exome

Plan

● Théorie 1: le format FastQ et l'encodage des qualités

● Session pratique 1: conversion des qualités (fichier illumina.fastq)

● Théorie 2: le contrôle qualité et l'outil FastQC● Session pratique 2: le nettoyage des données

(dataset pickrel exon chr12.fastq)

FastQ

● 1 séquence = 4 lignes dans le fichier●

●

● 1 ère ligne = identifiant de la séquence

Qualité

● 4ème ligne = Qualité

● Qualité = score calculé● 2 calculs de scores existent●

●

● Pe: estimated probability of error

Encoder la qualité

● Les scores sont encodés en ASCII (ex: '%' => 37 )

● Il existe différents encodages: S,X,I,J,L

● Chaque encodage correspond à un score calculé selon la formule PHRED ou SOLEXA

● A ce score est ajouté 33 ou 64● La valeur obtenue est convertie en ASCII et inscrite dans le

fichier● Ex pour un encodage Sanger: de la proba au caractère ASCII:

– 90% (proba)

– 10 (score phred)

– 10 + 33 = 43

– 43: '+' ASCII

● Différents encodages mais les outils en acceptent qu'un seul format

● FASTQ Groomer: pour convertir les qualités

Session pratique 1

● Cliquez sur “shared data” puis “publish histories”

● Cliquez sur “TP-QC Olivier”● Cliquez sur “Import history”● Visualisez (avec l'oeil) le contenu du fichier

illumina.fastq● Quel est l'encodage pour ces données ?

Verification de l'encodage

● Dans la boîte de recherche d'outils tapez “FastQC”

● Selectionnez l'outil “FastQC: Read QC”● Selectionnez le fichier illumina.fastq

● Dans la boîte de recherche d'outils tapez “FastQ Groomer”

● Selectionnez “FASTQ Groomer convert between various FASTQ quality formats”

● Pour le fichier “File to groom” selectionnez le fichier illumina.fastq

● Quel est maintenant l'encodage pour ces données ?

FastQC

● “A quality control tool for high throughput sequence data”

● Contrôle qualité sur les données sequencées● Différentes analyses sur les données, pour

chaque analyse:

Per base sequence quality

● X = Position in read● Y = Quality score● Box Whisker● Green, orange, red● Quality of calls will

degrade as the run process ...

Per base sequence quality

● lower quartile for any base < 10, or if the median for any base < 25

● the lower quartile for any base < 5, or if the median for any base < 20

Per Sequence Quality Scores

● X = Quality scores● Y = Nb sequences● See if a proportion of

sequences in a run have low quality => indicate a systematic pb (one end of a flowcell, ...)

Per Sequence Quality Scores

● most frequently observed < 27 (O.2% error rate)

● most frequently observed < 20 (1% error rate)

Per Base Sequence Content

● X = position in read● Y = Sequence content (%T, %C,

%A, %G)● In a random library: little to no

difference between the different bases of a sequence run

● Detect overexpressed sequence (contamination)

Per Base Sequence Content

● Differences between A and T or G and C > 10%

● Differences between A and T or G and C > 20%

Per Base GC Content

● X = position in read● Y = Sequence content (%GC)● In a random library: little to no

difference between the different bases of a sequence run

● Detect overexpressed sequence (contamination)

Per Base GC Content

● GC content of any base > 5% from the mean GC content

● GC content of any base > 10% from the mean GC content

Per Sequence GC Content

● X = mean GC content● Y = nb sequence● Compute a normal

distribution (blue) ● Plot raw data (red)● An unusually shaped

distribution could indicate a contaminated library or some other kinds of biased subset

Per Sequence GC Content

● the sum of the deviations from the normal distribution > 15% of the reads

● the sum of the deviations from the normal distribution > 30% of the reads

Per Base N Content

● X = position in read● Y = N content● It's not unusual to see a very low

proportion of Ns appearing in a sequence, especially nearer the end of a sequence.

● However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.

Per Base N Content

● any position shows an N content of >5%

● any position shows an N content of >20%

Sequence Length Distribution

● X = sequence length● Y = nb sequence● Detect sequences

trimmed by the pipelines (to remove poor quality)

● all sequences are not the same length

● any of the sequences have zero length

Sequence duplication level

● X = sequence duplication level● Y = proportion of non-unique v.s.

unique ● low level of duplication may

indicate a very high level of coverage

● high level of duplication indicate some kind of enrichment bias

Sequence duplication level

● non-unique sequences make up more than 20% of the total

● non-unique sequences make up more than 50% of the total

Overrepresented Sequences

● lists all of the sequence which make up more than 0.1% of the total

● look for matches in a database of common contaminants

Overrepresented Sequences

● any sequence is found to represent more than 0.1% of the total

● any sequence is found to represent more than 1% of the total

Overrepresented Kmers

● Kmers ? (5 mers)● long sequences and poor

quality: reduce the counts for exactly duplicated sequences.

● a partial sequence which is appearing at a variety of places (won't be seen by per base content plot or the duplicate sequence analysis).

● a graph for the top 6 hits: enrichment of that Kmer across the length of your reads.

● This will show if you have a general enrichment, or if there is a pattern of bias at different points over your read length.

● based on the base content of the library: calculates an expected level at which this k-mer should have been seen

● uses the actual count to calculate an observed/expected ratio for that k-mer

● any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position

● k-mer is enriched more than 10 fold at any individual base position

Session pratique 2

● Attention ! “chr12 exon:” tps de traiement conséquents● Les résultats sont disponibles dans l'historique “groomer +

fastQC on chr12”● Cliquez sur “shared data” puis “publish histories”● Cliquez sur “groomer + fastQC on chr12”● Cliquez sur Import history● A partir de quel outil le dataset n°2 a t-il été obtenu ?● Visualisez les résultats

Dataset

• Public data: exome sequenced by the International HapMap Project

• Single-end reads of 100bp, Illumina Genome Analyzer IIx

• RNA-seq data of this exome available (Pickrell et al., Nature, 2010)

● A partir du dataset 9, visualisez l'outil qui a été utilisé

● Que signifie une taille de fenêtre à 1 ?● Pourquoi la valeur de qualité 28 a t elle été

choisie ?● Identifiez des reads “trimmés”● Quelles sont le valeurs de qualité qui ont été

enlevées ?

FastQ Quality Trimmer

Simple Trimming of the ends

● ATCCTTTATAAATAATTAATA●

● ATCCTTTATAAATAATTAAT●

● ATCCTTTATAAATAATTAA●

● ...

Min qual <= 28 ?

Min qual <= 28 ?

Min qual <= 28 ?

Min qual <= 28 ?

Quality scores after trimming

Documents

Le contrôle qualité sur les données fastq