Statistical Analysis of Microarray Data · Data from Ogawa et al. New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed

Visualization

Statistical Analysis of Microarray Data

Jacques van Helden

[email protected] Aix-Marseille Université, France

Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090)

http://jacques.van-helden.perso.luminy.univ-amu.fr/

FORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique

Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/

Heat maps

!  Eisen (1998) introduced a visualization tool which allows to display the expression profiles of many genes.

"  Each row represents one gene, each column one chip.

"  Gene profiles can be aligned along the dendrogram resulting from hierarchical clustering.

!  This visualization mode combines clustering and expression profiles.

!  Problem of isomorphism: "  The two outgoing branches from each

intermediate node can be swapped arbitrarily.

"  The distance between two genes is represented on the horizontal axis (depth of the first parent node)

"  The vertical distance between two genes does not reflect the calculated distance. Some genes are direct neighbours on the vertical axis whereas they are very distant.

MCM

CLB2

SIC1 MAT

CLN2

Y'

MET

Alpha cdc15 cdc28 Elu

Spellman et al. (1998). Mol Biol Cell 9(12), 3273-97.

Reduction in data dimension

Statistical Analysis of Microarray Data

Jacques van Helden

[email protected] Aix-Marseille Université (AMU), France

Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090)

http://jacques.van-helden.perso.luminy.univmed.fr/

Why to reduce dimensionality ?

!  A series of microarrays can be represented as a N x p matrix, where "  each one of the p columns contains information about an experiment (different conditions,

treatments, tissues) "  each one of the N rows contains information about a spot (gene)

!  Object dimensions "  Each gene can be considered as a p-dimensional object (one dimension per experiment). "  Each experiment can be considered as a N-dimensional object (one dimension per gene).

!  Visualization "  Visualization devices are restricted to 2 (printer) or at best 3 (space explorer) dimensions. "  One would thus like to display objects in 2D or 3D, whilst retaining the maximum of information. "  After reduction of dimensions, some clusters may already appear in the data set.

!  Analysis "  Some analysis methods loose their accuracy when there are too many vriables (over-fitting). "  Reducing the data to a subset of dimensions will allow a trade-of between the loss of information

and the gain in accuracy. In this case, the appropriate number of dimensions may be higher than 3, its choice depends on the data itself (e.g. number of objects per training group).

How to reduce dimensionality ?

!  Several methods are available for reducing the number of dimensions of a data set "  Principal Component Analysis "  Singular Value Decomposition "  Spring embedding

Principal component analysis – Principle of the method

!  Gilbert, D., Schroeder, M. & van Helden, J. (2000). Trends in Biotechnology 18) 487-495.

A.  Multidimensional data "  n objects, p variables (in this case p=2)

B.  Principal components "  n objects, p factors "  Each factor is a linear combination of

variables

C.  Reduction in dimensions "  Selection of a subset of principal

components "  q factors, with q < p (in this case, q=1)

Factor 2!

Factor 1!

Variable 2!

Variable 1!

Factor 1!A

B

C

PCA example: phosphate data (Ogawa 2000)

!  Data from Ogawa et al. New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Mol Biol Cell (2000) vol. 11 (12) pp. 4309-21.

!  Script: in statistics for bioniformatics, multivariate_analysis.R

!  The original data set contains 5783 objects (genes) characterized by 8 variables (samples).

!  The plot shows the first (PC1 and second (PC2) principal components.

!  Colors highlight some genes known to be regulated by phosphate (blue), methionine (green) or none of them(black).

!  The first component visibly separates several of the phosphate-responding genes from the other genes.

PCA on Golub data set

!  PCA applied on a selection of 367 genes showing a significant difference in expression between cancer types (AML vs ALL).

!  The first component "  Explains 70% of the variance ! "  Perfectly separates the two main patient types

(AML and ALL)

!  The second component reveals two sub-groups among patients of the the ALL type.

PCA – Den Boer (2009) – PC1 versus PC2

9

−60 −40 −20 0 20

−40

−20

020

PCA; Den Boer (2009); 190 samples * 22283 genes

PC1

PC2

T

TTT T

T

T

TT

TT

T

T

T

T

T

T

T

T

TT

TT

T T

T

T

TT

T

TTT

T

T

T

BtBt

Bt

Bt

Bt

BtBt

BtBt

BtBt

BtBt

Bt

BtBt

Bt

Bt

BtBt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

BtBt

BtBt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

BtBt

Bth

Bh

BhBh

Bh

Bh

BhBh

BhBhBhBh

Bh

BhBh

Bh

Bh

Bh

BhBh

Bh

Bh

Bh

Bh

Bh

Bh

Bh

Bh

BhBh

BhBh

Bh

BhBh

BhBh

Bh

Bh

Bh

BhBh

Bh

BhBh

BEBEp

BEpBEp

BEpBEp

BEp

BEp

BEp

BEs

BEs

BEs

BEs

Bc

Bc

Bc

Bc

Bch

BM

BM

BMBM

BoBo

Bo

Bo

Bo

Bo

BoBo

Bo

BoBo

BoBo

Bo

Bo

Bo

Bo

BoBo

BoBo

Bo

BoBo

Bo

Bo

Bo

Bo

Bo

Bo BoBo

Bo

BoBo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

Bo

●

●

●

●

●

●

●

●

●

●

●

BCR−ABL + hyperdiploidyBCR−ABLE2A−rearranged (E)E2A−rearranged (E−sub)E2A−rearranged (EP)MLLT−ALLTEL−AML1 + hyperdiploidyTEL−AML1hyperdiploidpre−B ALL


10

−40 −20 0 20

−40

−20

020

40


PC3

PC4

TT

T

T

TT

T

T

T

T

TT

TT

T

T

T

T

T

T

T

T

T T

T

T

T

T T

TT

T TT

T

T Bt

Bt

BtBtBt

Bt

Bt Bt

BtBt

Bt

Bt

Bt

Bt

BtBt

Bt

Bt

Bt

Bt

Bt

BtBtBt

Bt

BtBt

BtBt BtBt

BtBt

Bt

Bt

Bt

Bt

Bt

Bt Bt

BtBt

Bt

BthBh

BhBh

Bh

Bh

BhBh

Bh

Bh

Bh

BhBh Bh

Bh

Bh

Bh

Bh

Bh

Bh

BhBh

Bh

Bh

BhBh

BhBh

BhBh

BhBhBh

Bh

Bh

BhBh

Bh

BhBh

Bh

Bh

Bh

Bh

Bh

BE

BEp

BEpBEp

BEp BEpBEp

BEp

BEpBEsBEs BEs

BEs

Bc

Bc

BcBc

Bch

BM

BM

BM

BM

BoBo

Bo

Bo Bo

Bo

Bo

BoBo

Bo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

BoBo

Bo

BoBo Bo

Bo

Bo

Bo

BoBo

BoBoBo

Bo

Bo

Bo

Bo

Bo Bo

Bo

BoBo Bo

Bo

Bo

●

●

●

●

●

●

●

●

●

●

●



11

−20 −10 0 10 20 30

−20

−10

010

2030

40


PC5

PC6 T

T

T

T

T

TT

T

T

TT

T TT

T

T

T

T

T

T

T T TT

T

TT

TT

T

T

T

TTT

T

Bt

BtBt

Bt

Bt

Bt BtBtBt

BtBt

Bt

Bt

Bt

Bt Bt

Bt

BtBt

Bt

Bt

Bt

Bt

Bt

Bt

Bt

BtBt

BtBt Bt

Bt

Bt

Bt

BtBtBt

Bt

Bt

Bt

Bt

Bt

Bt

BthBh

Bh

Bh

Bh

BhBhBh

BhBh BhBh Bh

BhBh

Bh

Bh

Bh

Bh

BhBh

Bh

Bh

Bh

Bh

Bh

Bh

BhBhBhBh

BhBhBhBh

BhBhBh Bh

Bh

Bh

BhBh Bh

Bh

BE

BEp

BEp

BEp

BEp

BEp

BEp

BEp

BEp

BEsBEsBEs

BEs

Bc

Bc

Bc

Bc

Bch

BM

BMBMBM

Bo

Bo

BoBo

Bo

Bo

BoBo

BoBo

Bo

Bo

Bo

Bo

Bo

Bo

Bo

Bo

Bo

BoBo

Bo

Bo

Bo

Bo

BoBo

Bo Bo

Bo

Bo

Bo

Bo

Bo

BoBo

Bo

BoBoBo

Bo

Bo

Bo

Bo

●

●

●

●

●

●

●

●

●

●

●


PCA – Den Boer (2009)

12

Singular value decomposition

Gilbert et al. (2000). Trends Biotech. 18(Dec), 487-495.

MCM

CLB2

SIC1 MAT

CLN2

Y'

MET



Singular value decomposition - Cell cycle

!  Calculate a distance matrix between objects "  in this case Pearson's coefficient of correlation

!  Assign 2D-coordinates which reflect at best the distances

Cell cycle data Randomized data

Data reduction with principal components

!  Data from Gasch (2000). Growth on alternate carbon sources (13 chips).

!  Selection of 224 genes which are significantly regulated in at least one chip.

!  The plot represents the two first components after PCA transformation.

!  Colours represent 15 clusters obtained with K-means clustering.

Singular value decomposition - Carbon sources

!  Data from Gasch (2000). Growth on alternate carbon sources (13 chips).

!  Subset of 224 genes significantly regulated in at least one chip.

!  Singular value decomposition (SVD) on correlation matrix.

!  The clusters are better separated than with PCA.

!  The proximity between two dots reflects their correlation (within the constraints of the 2D space)

Singular value decomposition

Gilbert et al. (2000). Trends Biotech. 18(Dec), 487-495.

MCM

CLB2

SIC1 MAT

CLN2

Y'

MET



Singular value decomposition - Cell cycle

!  Calculate a distance matrix between objects "  in this case Pearson's coefficient of correlation

!  Assign 2D-coordinates which reflect at best the distances

Cell cycle data Randomized data

Principal component analysis!

Raw data!

Multivariate data matrix!•  n objects"•  p variables"

Distance matrix!•  n x n distances"•  symmetrical"

Pairwise distance measurement!

Normalization!•  mean!•  variance!•  covariance!

Coordinates"•  n elements"•  d dimensions"

Multidimensional scaling!•  PCoA!•  spring embedding!

Clustering!

Normalized table"•  n elements"•  p dimensions"

Reduction!to significant!dimensions!

Process!Data" display!

Visualization!

Tree drawing!

Space explorer!(VRML)!

Dendrogram!•  rooted"•  unrooted"•  n leaves!

Euclidian space!•  1D to 3D"•  n dots"•  coloring"•  dot volume"•  interactive"

Matrix!•  n rows"•  p columns"•  coloring"

Matrix viewer!

Processing!

Clusters,Tree"

Ordering (optional)!•  row swapping!•  column swapping!

Coloring!(optional)!

Adapted from Gilbert et al. (2000). Trends Biotech. 18(Dec), 487-495.

Documents

Statistical Analysis of Microarray Data · Data from Ogawa et al. New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed