Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Le contrôle qualité sur les données fastq
TP detection exome
Plan
● Théorie 1: le format FastQ et l'encodage des qualités
● Session pratique 1: conversion des qualités (fichier illumina.fastq)
● Théorie 2: le contrôle qualité et l'outil FastQC● Session pratique 2: le nettoyage des données
(dataset pickrel exon chr12.fastq)
FastQ
● 1 séquence = 4 lignes dans le fichier●
●
● 1 ère ligne = identifiant de la séquence
Qualité
● 4ème ligne = Qualité
● Qualité = score calculé● 2 calculs de scores existent●
●
● Pe: estimated probability of error
Encoder la qualité
● Les scores sont encodés en ASCII (ex: '%' => 37 )
● Il existe différents encodages: S,X,I,J,L
● Chaque encodage correspond à un score calculé selon la formule PHRED ou SOLEXA
● A ce score est ajouté 33 ou 64● La valeur obtenue est convertie en ASCII et inscrite dans le
fichier● Ex pour un encodage Sanger: de la proba au caractère ASCII:
– 90% (proba)
– 10 (score phred)
– 10 + 33 = 43
– 43: '+' ASCII
● Différents encodages mais les outils en acceptent qu'un seul format
● FASTQ Groomer: pour convertir les qualités
Session pratique 1
● Cliquez sur “shared data” puis “publish histories”
● Cliquez sur “TP-QC Olivier”● Cliquez sur “Import history”● Visualisez (avec l'oeil) le contenu du fichier
illumina.fastq● Quel est l'encodage pour ces données ?
Verification de l'encodage
● Dans la boîte de recherche d'outils tapez “FastQC”
● Selectionnez l'outil “FastQC: Read QC”● Selectionnez le fichier illumina.fastq
● Dans la boîte de recherche d'outils tapez “FastQ Groomer”
● Selectionnez “FASTQ Groomer convert between various FASTQ quality formats”
● Pour le fichier “File to groom” selectionnez le fichier illumina.fastq
● Quel est maintenant l'encodage pour ces données ?
FastQC
● “A quality control tool for high throughput sequence data”
● Contrôle qualité sur les données sequencées● Différentes analyses sur les données, pour
chaque analyse:
Per base sequence quality
● X = Position in read● Y = Quality score● Box Whisker● Green, orange, red● Quality of calls will
degrade as the run process ...
Per base sequence quality
● lower quartile for any base < 10, or if the median for any base < 25
● the lower quartile for any base < 5, or if the median for any base < 20
Per Sequence Quality Scores
● X = Quality scores● Y = Nb sequences● See if a proportion of
sequences in a run have low quality => indicate a systematic pb (one end of a flowcell, ...)
Per Sequence Quality Scores
● most frequently observed < 27 (O.2% error rate)
● most frequently observed < 20 (1% error rate)
Per Base Sequence Content
● X = position in read● Y = Sequence content (%T, %C,
%A, %G)● In a random library: little to no
difference between the different bases of a sequence run
● Detect overexpressed sequence (contamination)
Per Base Sequence Content
● Differences between A and T or G and C > 10%
● Differences between A and T or G and C > 20%
Per Base GC Content
● X = position in read● Y = Sequence content (%GC)● In a random library: little to no
difference between the different bases of a sequence run
● Detect overexpressed sequence (contamination)
Per Base GC Content
● GC content of any base > 5% from the mean GC content
● GC content of any base > 10% from the mean GC content
Per Sequence GC Content
● X = mean GC content● Y = nb sequence● Compute a normal
distribution (blue) ● Plot raw data (red)● An unusually shaped
distribution could indicate a contaminated library or some other kinds of biased subset
Per Sequence GC Content
● the sum of the deviations from the normal distribution > 15% of the reads
● the sum of the deviations from the normal distribution > 30% of the reads
Per Base N Content
● X = position in read● Y = N content● It's not unusual to see a very low
proportion of Ns appearing in a sequence, especially nearer the end of a sequence.
● However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.
Per Base N Content
● any position shows an N content of >5%
● any position shows an N content of >20%
Sequence Length Distribution
● X = sequence length● Y = nb sequence● Detect sequences
trimmed by the pipelines (to remove poor quality)
● all sequences are not the same length
● any of the sequences have zero length
Sequence duplication level
● X = sequence duplication level● Y = proportion of non-unique v.s.
unique ● low level of duplication may
indicate a very high level of coverage
● high level of duplication indicate some kind of enrichment bias
Sequence duplication level
● non-unique sequences make up more than 20% of the total
● non-unique sequences make up more than 50% of the total
Overrepresented Sequences
● lists all of the sequence which make up more than 0.1% of the total
● look for matches in a database of common contaminants
Overrepresented Sequences
● any sequence is found to represent more than 0.1% of the total
● any sequence is found to represent more than 1% of the total
Overrepresented Kmers
● Kmers ? (5 mers)● long sequences and poor
quality: reduce the counts for exactly duplicated sequences.
● a partial sequence which is appearing at a variety of places (won't be seen by per base content plot or the duplicate sequence analysis).
● a graph for the top 6 hits: enrichment of that Kmer across the length of your reads.
● This will show if you have a general enrichment, or if there is a pattern of bias at different points over your read length.
● based on the base content of the library: calculates an expected level at which this k-mer should have been seen
● uses the actual count to calculate an observed/expected ratio for that k-mer
● any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position
● k-mer is enriched more than 10 fold at any individual base position
Session pratique 2
● Attention ! “chr12 exon:” tps de traiement conséquents● Les résultats sont disponibles dans l'historique “groomer +
fastQC on chr12”● Cliquez sur “shared data” puis “publish histories”● Cliquez sur “groomer + fastQC on chr12”● Cliquez sur Import history● A partir de quel outil le dataset n°2 a t-il été obtenu ?● Visualisez les résultats
Dataset
• Public data: exome sequenced by the International HapMap Project
• Single-end reads of 100bp, Illumina Genome Analyzer IIx
• RNA-seq data of this exome available (Pickrell et al., Nature, 2010)
● A partir du dataset 9, visualisez l'outil qui a été utilisé
● Que signifie une taille de fenêtre à 1 ?● Pourquoi la valeur de qualité 28 a t elle été
choisie ?● Identifiez des reads “trimmés”● Quelles sont le valeurs de qualité qui ont été
enlevées ?
FastQ Quality Trimmer
Simple Trimming of the ends
● ATCCTTTATAAATAATTAATA●
● ATCCTTTATAAATAATTAAT●
● ATCCTTTATAAATAATTAA●
● ...
Min qual <= 28 ?
Min qual <= 28 ?
Min qual <= 28 ?
Min qual <= 28 ?
Quality scores after trimming