57
Reconstructing ancient genomes architectures A la recherche des g´ enomes perdus Cedric Chauve Departement of Mathematics, Simon Fraser University VanBUG, September 8, 2011 C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 1 / 39

A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing ancient genomes architecturesA la recherche des genomes perdus

Cedric Chauve

Departement of Mathematics, Simon Fraser University

VanBUG, September 8, 2011

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 1 / 39

Page 2: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

1 Introduction and background

2 Reconstructing the ancestral boreoeutherian genome: CARs

3 Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

4 Conflicts in non-C1P matrices

5 Handling missing markers: the sandwich-C1P

6 Conclusion and perspectives

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 2 / 39

Page 3: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Introduction and background

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 3 / 39

Page 4: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Genome evolution

Figure 4. Boreoeutherian and teleost ancestral genomes. The two ancestral genomes are placed in a simplified cladistic tree of chordateevolution. Colour codes are according to human chromosomes, shown at the bottom. The duplicated teleost genome displays a differentcolour code, to illustrate the distribution of each duplicated chromosome in the extant teleost and medaka genomes. Although notrepresentative of all boreoeutherian, the mouse, gibbon and human karyotypes illustrate how the latter has remained relatively similar totheir last common ancestor. Colours in the teleost ancestral genome do no reflect the actual position of human segments relative to eachother, simply their relative abundance. Data were collected from Ref 31 for the boreoeutherian ancestor except for one edition (see text),from Ensembl v47 (http://www.ensembl.org) for orthologous genes between the mouse and the human genomes, from the Chromhomedatabase (http://www.chromhome.org) for the Zoo-FISH data between human and gibbon of Ref. 18 and fromRef. 9 for the preduplicationancestor and its anchorage in extant fish genomes.

Review articles

130 BioEssays 30.2

[Muffato and Roest-Croellius, 2008]

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 4 / 39

Page 5: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Computational paleogenomics

Reconstruct the genome (architecture) of extinct species.

[Murphy et al, 2005] Dynamics of mam-malian chromosome evolution inferred frommultispecies comparative maps. [Kemkemeret al, 2009] Gene synteny comparisons be-tween different vertebrates provide new in-sights into breakage and fusion events dur-ing mammalian karyotype evolution. [Ma etal, 2006] Reconstructing contiguous regionsof an ancestral genome. [Alekseyev andPevzner, 2009] Breakpoint graphs and an-cestral genome reconstructions. [Nakataniet al, 2007] Reconstruction of the ver-tebrate ancestral genome reveals dynamicgenome reorganization in early vertebrates.[Salse et al, 2009] Improved criteria andcomparative genomics tool provide new in-sights into grass paleogenomics. [Rascol etal, 2007] Ancestral animal genome recon-struction. [Faraut 2008] Addressing chro-mosome evolution in the whole-genome se-quence era. [Muffato and Roest-Crollius,2008] Paleogenomics, or the recovery of lostgenomes from the mist of times.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 5 / 39

Page 6: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Applications

Evolutionary genomics

Infer evolutionary genome rearrangements, correlation between GC content andchromosome size [Romiguier et al, 2010] , evolution of long-range regulation[Mongin et al, 2009], . . .

Comparative genome assembly and mapping

Anchoring contigs in partially assembled genomes, designing genome maps.

Gar Lepisosteida

Number of known species

Genome duplication

Sardine Clupeomorpha 357

Zebra!sh*

Carp

Cat!sh†

Salmon

Trout†

Cod† Paracanthopterygii 1212

Medaka*

Halibut

Stickleback* Gasterosteiformes 257

Tilapia† Perciformes 9293

Fugu*

Tetraodon*

Protacanthopterygii 312

Atherinomorpha 1283

Ostariophysi 6375

Pleuronectiformes 570

Tetraodontiformes 339

Family

Outgroup

[Davidson et al, 2010]

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 6 / 39

Page 7: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Cytogenetics (FISH and local parsimony)

Data acquisition: homology detection by Zoo-FISH

8 CHAPITRE 2. ÉTAT DE L’ART DES MÉTHODES DE RECONSTRUCTION

et d’ordre conservé. Le langage courant utilise néanmoins le terme de «segment de syn-ténie» pour désigner un bloc de gènes dont l’ordre est conservé entre deux espèces. Entredeux espèces, la synténie et l’ordre des gènes sont presque complètement conservés aumoment de leur spéciation d’un ancêtre commun, et se dégradent progressivement au gréde l’évolution et des réarrangements du génome.

2.1 CytogénétiqueLes premières expériences de génomique comparative datent de plus de 30 ans, et

ont ouvert la possibilité de comparer les chromosomes (morphologie et bandes chromoso-miques) à partir de cellules en métaphase. Ces techniques ont fourni les premières don-nées pour permettre de déterminer les réarrangements ancestraux de chromosomes chezles vertébrés [Rumpler et Dutrillaux, 1976, Yunis et Prakash, 1982]. Plus récemment,des expériences d’hybridation fluorescente in-situ (FISH : fluorescence in situ hybridi-sation) entre espèces ont été développées et ont permis d’améliorer fortement la préci-sion et la portée de ces approches. Dans une expérience typique, l’ADN d’un chromosomedonné d’une espèce de référence (souvent l’homme) est purifié, marqué par fluorescence,découpé, puis hybridé sur tous les chromosomes d’une autre espèce, cible. Il est mêmepossible d’étudier la répartition de plusieurs chromosomes de l’espèce référence à la foisen utilisant plusieurs types de marqueurs fluorescents. Cette technique s’appelle Chro-mosome painting ou encore Zoo-FISH [Scherthan et al., 1994, Wienberg et al., 1990].L’analyse des images au microscope (Figure 2.1) permet de découvrir des différences chro-mosomiques de grande échelle (de l’ordre du mégabase), comme les fusions, les fissions,ou les translocations d’une grande région. En revanche, les translocations de petites ré-gions ne sont pas détectables, et les inversions (intra-chromosomique) sont, par principede l’expérience, impossibles à repérer.

A : B : C :

FIGURE 2.1 – Exemples d’études de cytogénétique (hybridation) tirés de Svartman et al.[2006]. Dans chaque expérience, deux chromosomes humains sont hybridés simultané-ment. A : Les chromosomes 13 et 18 humains sont chacun d’un seul tenant chez le tatouà neuf bandes, Dasypus novemcinctus, ce qui laisse supposer que c’était aussi le cas chezleur dernier ancêtre commun. B : Les chromosomes humains 2 et 6 sont fragmentés endeux morceaux chacun dans le génome du fourmilier à collier, Tamandua tetradactyla, cequi révèle des réarrangements ancestraux de translocations, fusions ou fissions. C : Leschromosomes humains 3 et 21 s’hybrident sur le même chromosome du paresseux d’Hoff-mann, Choleopus Hoffmanii, ce qui peut signifier une association ancestrale 3/21. Pour Bet C, il faudra une troisième espèce pour décider de la configuration ancestrale.

Classiquement, les expériences révèlent le nombre de segments chromosomiques del’espèce cible qui correspondent à un chromosome (entier) de l’espèce de référence. Si ce

(a) human/armadillo (b) human/anteater (c) human/sloth[Svartmann et al, 2006]

Ancestral characters inferrence: Dollo parsimony

If two chromosomes of a reference genome hybridize together in two targetgenomes whose path contain an ancestor, they are likely to define a chromosomalassociation in this ancestor.

Characteristics

High resolution (1Mb). Limited to recent ancestors (mammalians). Can use alarge number of extant species. Two-stage process.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 7 / 39

Page 8: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Cytogenetics (FISH and local parsimony)

Data acquisition: homology detection by Zoo-FISH

8 CHAPITRE 2. ÉTAT DE L’ART DES MÉTHODES DE RECONSTRUCTION

et d’ordre conservé. Le langage courant utilise néanmoins le terme de «segment de syn-ténie» pour désigner un bloc de gènes dont l’ordre est conservé entre deux espèces. Entredeux espèces, la synténie et l’ordre des gènes sont presque complètement conservés aumoment de leur spéciation d’un ancêtre commun, et se dégradent progressivement au gréde l’évolution et des réarrangements du génome.

2.1 CytogénétiqueLes premières expériences de génomique comparative datent de plus de 30 ans, et

ont ouvert la possibilité de comparer les chromosomes (morphologie et bandes chromoso-miques) à partir de cellules en métaphase. Ces techniques ont fourni les premières don-nées pour permettre de déterminer les réarrangements ancestraux de chromosomes chezles vertébrés [Rumpler et Dutrillaux, 1976, Yunis et Prakash, 1982]. Plus récemment,des expériences d’hybridation fluorescente in-situ (FISH : fluorescence in situ hybridi-sation) entre espèces ont été développées et ont permis d’améliorer fortement la préci-sion et la portée de ces approches. Dans une expérience typique, l’ADN d’un chromosomedonné d’une espèce de référence (souvent l’homme) est purifié, marqué par fluorescence,découpé, puis hybridé sur tous les chromosomes d’une autre espèce, cible. Il est mêmepossible d’étudier la répartition de plusieurs chromosomes de l’espèce référence à la foisen utilisant plusieurs types de marqueurs fluorescents. Cette technique s’appelle Chro-mosome painting ou encore Zoo-FISH [Scherthan et al., 1994, Wienberg et al., 1990].L’analyse des images au microscope (Figure 2.1) permet de découvrir des différences chro-mosomiques de grande échelle (de l’ordre du mégabase), comme les fusions, les fissions,ou les translocations d’une grande région. En revanche, les translocations de petites ré-gions ne sont pas détectables, et les inversions (intra-chromosomique) sont, par principede l’expérience, impossibles à repérer.

A : B : C :

FIGURE 2.1 – Exemples d’études de cytogénétique (hybridation) tirés de Svartman et al.[2006]. Dans chaque expérience, deux chromosomes humains sont hybridés simultané-ment. A : Les chromosomes 13 et 18 humains sont chacun d’un seul tenant chez le tatouà neuf bandes, Dasypus novemcinctus, ce qui laisse supposer que c’était aussi le cas chezleur dernier ancêtre commun. B : Les chromosomes humains 2 et 6 sont fragmentés endeux morceaux chacun dans le génome du fourmilier à collier, Tamandua tetradactyla, cequi révèle des réarrangements ancestraux de translocations, fusions ou fissions. C : Leschromosomes humains 3 et 21 s’hybrident sur le même chromosome du paresseux d’Hoff-mann, Choleopus Hoffmanii, ce qui peut signifier une association ancestrale 3/21. Pour Bet C, il faudra une troisième espèce pour décider de la configuration ancestrale.

Classiquement, les expériences révèlent le nombre de segments chromosomiques del’espèce cible qui correspondent à un chromosome (entier) de l’espèce de référence. Si ce

(a) human/armadillo (b) human/anteater (c) human/sloth[Svartmann et al, 2006]

Ancestral characters inferrence: Dollo parsimony

If two chromosomes of a reference genome hybridize together in two targetgenomes whose path contain an ancestor, they are likely to define a chromosomalassociation in this ancestor.

Characteristics

High resolution (1Mb). Limited to recent ancestors (mammalians). Can use alarge number of extant species. Two-stage process.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 7 / 39

Page 9: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Cytogenetics (FISH and local parsimony)

Data acquisition: homology detection by Zoo-FISH

8 CHAPITRE 2. ÉTAT DE L’ART DES MÉTHODES DE RECONSTRUCTION

et d’ordre conservé. Le langage courant utilise néanmoins le terme de «segment de syn-ténie» pour désigner un bloc de gènes dont l’ordre est conservé entre deux espèces. Entredeux espèces, la synténie et l’ordre des gènes sont presque complètement conservés aumoment de leur spéciation d’un ancêtre commun, et se dégradent progressivement au gréde l’évolution et des réarrangements du génome.

2.1 CytogénétiqueLes premières expériences de génomique comparative datent de plus de 30 ans, et

ont ouvert la possibilité de comparer les chromosomes (morphologie et bandes chromoso-miques) à partir de cellules en métaphase. Ces techniques ont fourni les premières don-nées pour permettre de déterminer les réarrangements ancestraux de chromosomes chezles vertébrés [Rumpler et Dutrillaux, 1976, Yunis et Prakash, 1982]. Plus récemment,des expériences d’hybridation fluorescente in-situ (FISH : fluorescence in situ hybridi-sation) entre espèces ont été développées et ont permis d’améliorer fortement la préci-sion et la portée de ces approches. Dans une expérience typique, l’ADN d’un chromosomedonné d’une espèce de référence (souvent l’homme) est purifié, marqué par fluorescence,découpé, puis hybridé sur tous les chromosomes d’une autre espèce, cible. Il est mêmepossible d’étudier la répartition de plusieurs chromosomes de l’espèce référence à la foisen utilisant plusieurs types de marqueurs fluorescents. Cette technique s’appelle Chro-mosome painting ou encore Zoo-FISH [Scherthan et al., 1994, Wienberg et al., 1990].L’analyse des images au microscope (Figure 2.1) permet de découvrir des différences chro-mosomiques de grande échelle (de l’ordre du mégabase), comme les fusions, les fissions,ou les translocations d’une grande région. En revanche, les translocations de petites ré-gions ne sont pas détectables, et les inversions (intra-chromosomique) sont, par principede l’expérience, impossibles à repérer.

A : B : C :

FIGURE 2.1 – Exemples d’études de cytogénétique (hybridation) tirés de Svartman et al.[2006]. Dans chaque expérience, deux chromosomes humains sont hybridés simultané-ment. A : Les chromosomes 13 et 18 humains sont chacun d’un seul tenant chez le tatouà neuf bandes, Dasypus novemcinctus, ce qui laisse supposer que c’était aussi le cas chezleur dernier ancêtre commun. B : Les chromosomes humains 2 et 6 sont fragmentés endeux morceaux chacun dans le génome du fourmilier à collier, Tamandua tetradactyla, cequi révèle des réarrangements ancestraux de translocations, fusions ou fissions. C : Leschromosomes humains 3 et 21 s’hybrident sur le même chromosome du paresseux d’Hoff-mann, Choleopus Hoffmanii, ce qui peut signifier une association ancestrale 3/21. Pour Bet C, il faudra une troisième espèce pour décider de la configuration ancestrale.

Classiquement, les expériences révèlent le nombre de segments chromosomiques del’espèce cible qui correspondent à un chromosome (entier) de l’espèce de référence. Si ce

(a) human/armadillo (b) human/anteater (c) human/sloth[Svartmann et al, 2006]

Ancestral characters inferrence: Dollo parsimony

If two chromosomes of a reference genome hybridize together in two targetgenomes whose path contain an ancestor, they are likely to define a chromosomalassociation in this ancestor.

Characteristics

High resolution (1Mb). Limited to recent ancestors (mammalians). Can use alarge number of extant species. Two-stage process.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 7 / 39

Page 10: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Bioinformatics (global parsimony)

Global parsimony.

Given a set of extant genomes and genomic markers, together with a species tree,find the ancestors that minimize the number of genome rearrangements along thespecies tree.

An old problem

From manual inspection [Sturtevant and Novitski, 1941] to large-scalecomputational analysis [Murphy et al, 2005], through a well definedcombinatorial optimization problem [Sankoff, 1992].

Characteristics

Low resolution (100kb). Limited to fully assembled genomes. Limited to 1-to-1orthologous markers. Two-stage process (inferring markers, inferring evolutionaryscenarios). Pitfall: large number of optimal or slightly sub-optimal solutions, nosampling methods available.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 8 / 39

Page 11: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Bioinformatics (global parsimony)

Global parsimony.

Given a set of extant genomes and genomic markers, together with a species tree,find the ancestors that minimize the number of genome rearrangements along thespecies tree.

An old problem

From manual inspection [Sturtevant and Novitski, 1941] to large-scalecomputational analysis [Murphy et al, 2005], through a well definedcombinatorial optimization problem [Sankoff, 1992].

Characteristics

Low resolution (100kb). Limited to fully assembled genomes. Limited to 1-to-1orthologous markers. Two-stage process (inferring markers, inferring evolutionaryscenarios). Pitfall: large number of optimal or slightly sub-optimal solutions, nosampling methods available.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 8 / 39

Page 12: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

Bioinformatics (global parsimony)

Global parsimony.

Given a set of extant genomes and genomic markers, together with a species tree,find the ancestors that minimize the number of genome rearrangements along thespecies tree.

An old problem

From manual inspection [Sturtevant and Novitski, 1941] to large-scalecomputational analysis [Murphy et al, 2005], through a well definedcombinatorial optimization problem [Sankoff, 1992].

Characteristics

Low resolution (100kb). Limited to fully assembled genomes. Limited to 1-to-1orthologous markers. Two-stage process (inferring markers, inferring evolutionaryscenarios). Pitfall: large number of optimal or slightly sub-optimal solutions, nosampling methods available.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 8 / 39

Page 13: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

The two parsimonies

!

" !

"

!"#

$

%

&

& %

#

$

%

$

"!

&

#

'()*+,-,.(/-01)2,3(4561,7(1-**8(-93251

!"#"$"%&'()*+"#","#-$(,.%#./0,(0*-0(

-+.(-12-)./0(3/(.40-/0(,5.)3.,(6.+.(

5+"7-7$&(-12-)./0(3/(0*.(-/).,0+-$(%./"#.8

%

$

&

#

!

"

&

%

#

$

9.-++-/%.#./0'()"#5:0.(-(5-+,3#"/3":,(

+.-++-/%.#./0(,)./-+3";(3/(-(%3<./(#"1.$(

"=(+.-++-/%.#./08

!

" !

"

!"#

$

%

&

& %

#

$

%

$

&

#

%

$

&

#

!

"

>*3)?./ !:#-/ @"% >*3)?./ !:#-/ @"%

The global parsimony approach

!

" !

"

!"#

$

%

&

& %

#

$

%

$

"!

&

#

'()*+,-,.(/-01)2,3(4561,7(1-**8(-93251

!"#"$"%&'()*+"#","#-$(,.%#./0,(0*-0(

-+.(-12-)./0(3/(.40-/0(,5.)3.,(6.+.(

5+"7-7$&(-12-)./0(3/(0*.(-/).,0+-$(%./"#.8

%

$

&

#

!

"

&

%

#

$

9.-++-/%.#./0'()"#5:0.(-(5-+,3#"/3":,(

+.-++-/%.#./0(,)./-+3";(3/(-(%3<./(#"1.$(

"=(+.-++-/%.#./08

!

" !

"

!"#

$

%

&

& %

#

$

%

$

&

#

%

$

&

#

!

"

>*3)?./ !:#-/ @"% >*3)?./ !:#-/ @"%

The local/Dollo parsimony approach

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 9 / 39

Page 14: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Introduction and background

2006: one ancestor, two approaches, two architecturesTwo approaches ... and two different architectures

for the same ancestral genome

7

Deux reconstructions di!érentes

du même génome (protoboreoeutherien)

Bourque, Tesler & Pevzner Genome Research 2006

Boreoeutherian ancestor[Froenicke et al, 2006; Bourque et al, 2006]

[Froenicke et al, 2006]: cytogenetics and bioinformatics (globalparsimony) disagree, but cytogenetics is right.

[Bourque et al, 2006]: global parsimony methods are not intended toinfer ancestors.

[Rocchi et al, 2006]: the computational method of [Ma and al,

2006] agrees with cytogenetics.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 10 / 39

Page 15: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral boreoeutherian genome: CARs

Reconstructing the ancestral boreoeutherian genome:Contiguous Ancestral Regions

[C. and Tannier, PLoS Comput Biol, 2008]

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 11 / 39

Page 16: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral boreoeutherian genome: CARs

From genomes to binary matrices

Definition

Each column of M is a gene/marker (orthology block), assumed to bepresent in a single occurrence in the ancestor.

Each row of M is a set of genes assumed to be contiguous in the ancestralgenome (common interval, adjacency).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 12 / 39

Page 17: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral boreoeutherian genome: CARs

The Consecutive Ones Property (C1P)

Definition [Fulkerson and Gross, 1965]

Can we arrange the columns of M is such a way that, on each row, all entries 1are contiguous.

!"#$"#%&%'($'()(*#+#,-*%-.%+//%$-)),0/(%-'1(')%,*%+%234#'((

567%8

567%9

567%:

567%;

Fundamental property

If all rows of M are correct, then M isC1P. In such a case, all possible C1Pcolumn orders are encoded in a PQ-treethat define Contiguous AncestralRegions (CARs).

Otherwise, rows from M need to bediscarded or gaps need to be allowed.

Remark

A classical approach in physical mapping of genomes [Alizadeh et al, 2005].

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 13 / 39

Page 18: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral boreoeutherian genome: CARs

Results

[C. and Tannier, 2008]

Properties

Dataset of 8 mammaliangenomes (plus opossum andchicken as outgroups).

M: 824 columns (orthologyblocks, at resolution 200kb),1431 rows.

Agrees with cytogenetics.

Remark

The matrix is almost C1P (14discarded rows).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 14 / 39

Page 19: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral boreoeutherian genome: CARs

Conclusion

Summary

A three-stages process: markers, matrix, PQ-tree.

A flexible framework.

Extends and formalizes the method of [Ma et al, 2006].

Combines well defined principles from cytogenetics and physical mapping.

Limitations

One-to-one orthologous markers.

Detects a strict signal of contiguity.

Matrices from real data are not C1P: false positive column/rows

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 15 / 39

Page 20: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral boreoeutherian genome: CARs

Conclusion

Summary

A three-stages process: markers, matrix, PQ-tree.

A flexible framework.

Extends and formalizes the method of [Ma et al, 2006].

Combines well defined principles from cytogenetics and physical mapping.

Limitations

One-to-one orthologous markers.

Detects a strict signal of contiguity.

Matrices from real data are not C1P: false positive column/rows

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 15 / 39

Page 21: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

Reconstructing the ancestral amniote genome:Ancestral Syntenic Groups

[Ouangraoua, Boyer, McPherson, Tannier and C., ISBRA 2009]

[Ouangraoua, Tannier and C., Bioinformatics, 2011]

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 16 / 39

Page 22: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

A similar and very different problem

Adapted from[Muffato and Roest-Croellius, 2008]

A much older ancestor.

Loss of the contiguity signal :there are 163 CARs.

!"#$%&'()*#$%'+#,*)-$./0,

."%-%+*#-(,*(+,$)1$*"#$%'+#,*-%2$./0,

3$456$./0,

3$7)$+"-)&),)&%2$,8'*#'(#,$1)-$+"(+9#'

!"#$%&'

No non-duplicated assembledoutgroup (teleost fishes).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 17 / 39

Page 23: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

A complementary approach

1. Inferring CARs.Connected components of CARs

2. Mining for a weaker ancestral signal :groups of syntenic CARs.

Doubly Conserved Synteny (DCS)3. Comparing ingroups (amniotes) andoutgroups (teleost fishes) using a combi-natorial model integrating a WGD in out-groups [Jaillon et al, 2004].

!"#$%&'()*%#('"+,-*%(.*/'-0,-*),("#,-1*%*2!3*-,)#,(+*3*'-*-450*

+0%+6*

! Minimum number of genes having orthologs on A (resp. B): 7

! Maximum gap of genes having no ortholog on A or B: 1

! SA and S

B do overlap.

3+&'(),(+*.,+,5+'"(*"/*2!3*-')(%7

84#%(*-,)#,(+*93:

;,.%<%*-,)#,(+*9=:

;,.%<%*-,)#,(+*9>:

3>

3=

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 18 / 39

Page 24: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

Illustration!"#$%&'%(#)%&*+,$-./0-#(

!"#$%&' ($)*

+,-.$/0&$%#

1%))

2'0330#4'"'#&

2'0330#4'"'#&

! 5#.67080$.09.'7/%""%#74'#%"$/7"03:'3)7

9'&;''#70"#$%&')70#<7=$)*')>74'#')?7

@A#)'"9.7BBC?

! D%%37/%#)'380&$%#7%=7/%#&$4,$&67)$4#0.?

! D%))$9.'7&%7<'&'/&707+%,9.67E%#)'38'<7

F6#&'#67@+EFC7)$4#0.7 D3$#/$-.'7%=7+EF7>7G'..$)7'&70.?7HIIJ

!"#$%&'%(#)%&*+,$-./0-#(

!"#$%&' ($)*

+,-.$/0&$%#

1%))

2'0330#4'"'#&

2'0330#4'"'#&

! 5#.67080$.09.'7/%""%#74'#%"$/7"03:'3)7

9'&;''#70"#$%&')70#<7=$)*')>74'#')?7

@A#)'"9.7BBC?

! D%%37/%#)'380&$%#7%=7/%#&$4,$&67)$4#0.?

! D%))$9.'7&%7<'&'/&707+%,9.67E%#)'38'<7

F6#&'#67@+EFC7)$4#0.7 D3$#/$-.'7%=7+EF7>7G'..$)7'&70.?7HIIJ

Figure 8: Examples of Double Conserved Syntenies mapped onto chromosomes 3 to 10 ofhuman genome. There are four lines for each human chromosome. On the first line areindicated the ancestral markers from our dataset. On the three last line are mapped thedouble syntenies with the three teleost fishes: in each box are represented the genes, mappingon either one segment (top side of the box), or the other (bottom side of the box) of theconsidered teleost fish.

26

!"#$%&$'()$"*+,-.$/%$0#+0%12$(34"

!"#$%&'()!*(+, !"#$%&'()!*(+-

.)/(%#(01"2#(304%"%5%"'(6

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 19 / 39

Page 25: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

Detecting DCS

!"#$%&'()*%#('"+,-*%(.*/'-0,-*),("#,-1*%*2!3*-,)#,(+*3*'-*-450*

+0%+6*

! Minimum number of genes having orthologs on A (resp. B): 7

! Maximum gap of genes having no ortholog on A or B: 1

! SA and S

B do overlap.

3+&'(),(+*.,+,5+'"(*"/*2!3*-')(%7

84#%(*-,)#,(+*93:

;,.%<%*-,)#,(+*9=:

;,.%<%*-,)#,(+*9>:

3>

3=

Matching an amniote segment S on two teleost chromosomes A an d B.

Minimum number of genes on S , and minimum number of genes on A (resp .B): handled by a statistical test of synteny [Sankoff and Durand, 2003].

Maximal gap on S : 1. Maximal gap on A,B: ∞SA and SB overlap.

Parameters chosen to maximize the covering of amniote genomes by DCS.

Results match the ”manual” results of [Jaillon et al, 2004].

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 20 / 39

Page 26: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

Results

[Kohn et al, 2006] [Nakatani et al, 2007] [Ouangraoua, et al, 2009] [Ouangraoua, C. and Tannier, 2011]

18-19-27 18-27 18-271-24 1-24

21-23-26-32 26-32 21-26 21-26Z-22 Z-22

17-Z 10-13-17-Z 17-Z2-9-16 2-9

1-3-14-18, 3-14 3-144-22, 8-28

5-10, 1-7, 3-52-12, 1-8, 4-20, 9-19

1-2

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 21 / 39

Page 27: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

Conclusion

Summary

A five-stages process: markers, matrix, PQ-tree, DCS detection, DCSpost-processing.

Important contribution: an original method to detect DCS.

Relies on well defined principles, each associated to a specific evolutionarysignal.

Additional remarks

Validation: no simulation but methodological principles, stringentimplementations, statitiscal tests and extensive robustness analysis.

The gain from relying on a weaker signal (synteny) comes at the expense ofthe mathematical model for representing the ancestor (graph).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 22 / 39

Page 28: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Reconstructing the ancestral amniote genome: Ancestral Syntenic Groups

Conclusion

Summary

A five-stages process: markers, matrix, PQ-tree, DCS detection, DCSpost-processing.

Important contribution: an original method to detect DCS.

Relies on well defined principles, each associated to a specific evolutionarysignal.

Additional remarks

Validation: no simulation but methodological principles, stringentimplementations, statitiscal tests and extensive robustness analysis.

The gain from relying on a weaker signal (synteny) comes at the expense ofthe mathematical model for representing the ancestor (graph).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 22 / 39

Page 29: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

Conflict in non-C1P matrices

[C., Haus, Stephen, You, J Comput Biol, 2010]

[C., Jones, Stephen and Tamayo, work in progress]

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 23 / 39

Page 30: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

False positives

Probleme statement

Binary matrices generated from real data (boreoeutherian, amniotes) are notC1P, but almost C1P (signal quite clear).

Non-C1P matrices: some rows do not represent groups of markers that werecontiguous in the ancestor. False positives.

Fundamental question. Which ones and why?

A counting point of view [Bouvel, C., Mishna and Rossin, 2011]

Under the null hypothesis of random genomes, one expects to find only 2 commonintervals, both of size 2 (Poisson law, with mean 2) [Kaplanski, 1945; Xu et

al, 2008].

Question

Can some small non-ancestral common intervals (resulting from assemblymistakes or convergent evolution for example) create C1P obstructions?

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 24 / 39

Page 31: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

False positives

Probleme statement

Binary matrices generated from real data (boreoeutherian, amniotes) are notC1P, but almost C1P (signal quite clear).

Non-C1P matrices: some rows do not represent groups of markers that werecontiguous in the ancestor. False positives.

Fundamental question. Which ones and why?

A counting point of view [Bouvel, C., Mishna and Rossin, 2011]

Under the null hypothesis of random genomes, one expects to find only 2 commonintervals, both of size 2 (Poisson law, with mean 2) [Kaplanski, 1945; Xu et

al, 2008].

Question

Can some small non-ancestral common intervals (resulting from assemblymistakes or convergent evolution for example) create C1P obstructions?

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 24 / 39

Page 32: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

False positives

Probleme statement

Binary matrices generated from real data (boreoeutherian, amniotes) are notC1P, but almost C1P (signal quite clear).

Non-C1P matrices: some rows do not represent groups of markers that werecontiguous in the ancestor. False positives.

Fundamental question. Which ones and why?

A counting point of view [Bouvel, C., Mishna and Rossin, 2011]

Under the null hypothesis of random genomes, one expects to find only 2 commonintervals, both of size 2 (Poisson law, with mean 2) [Kaplanski, 1945; Xu et

al, 2008].

Question

Can some small non-ancestral common intervals (resulting from assemblymistakes or convergent evolution for example) create C1P obstructions?

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 24 / 39

Page 33: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

C1P obstructions

Combinatorics: Tucker patterns [Tucker 1972]

Five families of submatrices that characterize exactly the non-C1P binary matrices.

Figure 1: The five Tucker patterns

of M is an undirected graph GM = (V, E), whose vertices are pairs (ci, cj)(for i, j = 1, . . . , n, i != j). Two vertices (ci, cj) and (cj, ck) are adjacent, ifone of the following holds:

1. ci = ck.2. There exists a row rl in M such that Mli, Mlk = 1 but Mlj = 0.

We sometimes refer to these edges as type 1 or type 2, as appropriate. Theseedges represent incompatible pairs of orderings, i.e. each edge correspondsto two relative orderings of the columns that cannot appear simultaneouslyin a consecutive ones ordering of the matrix. McConnell noted that theincompatibility graph is bipartite if and only if the matrix is C1P. Thus oddcycles in the incompatibility graph certify that a matrix is not C1P.

The forcing graph FM = (V, E!) is an undirected graph whose vertex set

is same as that of GM and whose edge set is a set of all pairs ((ci, cj), (ck, cj))where ((ci, cj), (cj, ck)) is an edge of GM . It is not hard to see that theincompatibility graph and the forcing graph both have n(n" 1) vertices andare symmetric. As with the incompatibility graph, the forcing graph can beused to certify that a matrix is not C1P: a path in this graph from (ci, cj) to(ci, cj) for and i, j, represents a chain of implications (“forcings”) leading toa contradiction.

In fact, McConnell observed that these certificates are almost the same:such a path in the forcing graph can be transformed to a cycle in the incom-

3

Property. Each false positive row that creates a C1P-conflict belongs to suchconfiguration.

Algorithms [C., Stephen and Tamayo, 2011] .

An output-sensitive algorithm to enumerate all Tucker patterns.

Application: Spotting potential false-positives. Rows (or columns) thatbelong to many Tucker patterns are more likely to be false positives.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 25 / 39

Page 34: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

C1P obstructions

Combinatorics: Tucker patterns [Tucker 1972]

Five families of submatrices that characterize exactly the non-C1P binary matrices.

Figure 1: The five Tucker patterns

of M is an undirected graph GM = (V, E), whose vertices are pairs (ci, cj)(for i, j = 1, . . . , n, i != j). Two vertices (ci, cj) and (cj, ck) are adjacent, ifone of the following holds:

1. ci = ck.2. There exists a row rl in M such that Mli, Mlk = 1 but Mlj = 0.

We sometimes refer to these edges as type 1 or type 2, as appropriate. Theseedges represent incompatible pairs of orderings, i.e. each edge correspondsto two relative orderings of the columns that cannot appear simultaneouslyin a consecutive ones ordering of the matrix. McConnell noted that theincompatibility graph is bipartite if and only if the matrix is C1P. Thus oddcycles in the incompatibility graph certify that a matrix is not C1P.

The forcing graph FM = (V, E!) is an undirected graph whose vertex set

is same as that of GM and whose edge set is a set of all pairs ((ci, cj), (ck, cj))where ((ci, cj), (cj, ck)) is an edge of GM . It is not hard to see that theincompatibility graph and the forcing graph both have n(n" 1) vertices andare symmetric. As with the incompatibility graph, the forcing graph can beused to certify that a matrix is not C1P: a path in this graph from (ci, cj) to(ci, cj) for and i, j, represents a chain of implications (“forcings”) leading toa contradiction.

In fact, McConnell observed that these certificates are almost the same:such a path in the forcing graph can be transformed to a cycle in the incom-

3

Property. Each false positive row that creates a C1P-conflict belongs to suchconfiguration.

Algorithms [C., Stephen and Tamayo, 2011] .

An output-sensitive algorithm to enumerate all Tucker patterns.

Application: Spotting potential false-positives. Rows (or columns) thatbelong to many Tucker patterns are more likely to be false positives.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 25 / 39

Page 35: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

Illustration with the amniote ancestor

Summary

Large 1746× 1861 binary matrix. Almost C1P (10 rows discarded), due to thehigh similarity of chicken and zebrafinch genomes. Contains 1374 Tucker patterns.

Example of conflict

Bos tauruschr16. . . . -7 -6 -5 -4 -3 -2 -1 -8 . . .Canis familiarischr2. . . . -9 -8 -7 -6 chr5. . . . 1 2 3 4 5 . . .Equus caballuschr2. . . . -9 -8 -7 -6 -5 -4 -3 -2 -1 . . .Primates+Rodentschr1/4/5. 1 2 3 4 5 6 7 8 9 . . .

Monodelphis domesticachr4. . . . -9 x x -7 x -6 -5 2 -8 3 . . . 4Taeniopygia guttatachr21. 9 7 8 -3 -2 -1 -6 -5 -4Gallus galluschr21. -3 -2 -1 4 5 6 -8 -7 -9

The short common interval {2, 3, 8} is clearly a dubious ancestral character, within alarger well supported common interval {1, 2, 3, 4, 5, 6, 7, 8, 9} (22Mb of human genome).It is a row of the matrix that belongs to all the Tucker patterns that could contain it.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 26 / 39

Page 36: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

Conclusion

Summary

An alternative to combinatorial optimization to deal with false positive signal.

Relies on a ”deep” combinatorial understanding of non-C1P matrices.

Perspectives: characterizing false positives

Characterization of potential convergent evolution (probabilistic modelling,gap modelling?).

Characterization of potential assemblies mistakes (inversions/translocations).

Characterization of wrong orthologous markers.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 27 / 39

Page 37: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conflicts in non-C1P matrices

Conclusion

Summary

An alternative to combinatorial optimization to deal with false positive signal.

Relies on a ”deep” combinatorial understanding of non-C1P matrices.

Perspectives: characterizing false positives

Characterization of potential convergent evolution (probabilistic modelling,gap modelling?).

Characterization of potential assemblies mistakes (inversions/translocations).

Characterization of wrong orthologous markers.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 27 / 39

Page 38: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Handling missing markers: the sandwich-C1P

[Gavranovic, C., Salse and Tannier, Bioinformatics, 2011]

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 28 / 39

Page 39: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Missing (lost) markers: plants and vertebrates genomes

The impact of gene losses following a whole-genome duplication (plant genomes)

Ohnologous chromosomes: speciationRice/Sorghum posterior to a whole-genomeduplication (orthology Rice 1/Sorghum 3and Rice 5/Sorghum 9).

“Random” loss of many duplicated copiesfollowing the WGD, together with high geneorder shuffling.

The impact of an increasing number of genomes

15 genomes (12 mammals, 3 outgroups),markers (100kb) cover 24% of human genome.

Issues: Whole-genome alignment and one-to-oneorthology blocks inference[Pham and Pevzner,

2010].

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 29 / 39

Page 40: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Missing (lost) markers: plants and vertebrates genomes

The impact of gene losses following a whole-genome duplication (plant genomes)

Ohnologous chromosomes: speciationRice/Sorghum posterior to a whole-genomeduplication (orthology Rice 1/Sorghum 3and Rice 5/Sorghum 9).

“Random” loss of many duplicated copiesfollowing the WGD, together with high geneorder shuffling.

The impact of an increasing number of genomes

15 genomes (12 mammals, 3 outgroups),markers (100kb) cover 24% of human genome.

Issues: Whole-genome alignment and one-to-oneorthology blocks inference[Pham and Pevzner,

2010].

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 29 / 39

Page 41: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Missing (lost) markers: plants and vertebrates genomes

The impact of gene losses following a whole-genome duplication (plant genomes)

Ohnologous chromosomes: speciationRice/Sorghum posterior to a whole-genomeduplication (orthology Rice 1/Sorghum 3and Rice 5/Sorghum 9).

“Random” loss of many duplicated copiesfollowing the WGD, together with high geneorder shuffling.

The impact of an increasing number of genomes

15 genomes (12 mammals, 3 outgroups),markers (100kb) cover 24% of human genome.

Issues: Whole-genome alignment and one-to-oneorthology blocks inference[Pham and Pevzner,

2010].

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 29 / 39

Page 42: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Missing (lost) markers: plants and vertebrates genomes

The impact of gene losses following a whole-genome duplication (plant genomes)

Ohnologous chromosomes: speciationRice/Sorghum posterior to a whole-genomeduplication (orthology Rice 1/Sorghum 3and Rice 5/Sorghum 9).

“Random” loss of many duplicated copiesfollowing the WGD, together with high geneorder shuffling.

The impact of an increasing number of genomes

15 genomes (12 mammals, 3 outgroups),markers (100kb) cover 24% of human genome.

Issues: Whole-genome alignment and one-to-oneorthology blocks inference[Pham and Pevzner,

2010].

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 29 / 39

Page 43: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Modelling missing markers

Principle

Indicate missing markers by an X instead of a 0

A toy example

We see two chromosomes following a genome-duplication (Chr1A and Chr1B) andone non-duplicated chromosome from another genome (Chr2).Row 1: B, D and G absent from Chr1A, H from Chr2, E and C “consecutive” inboth chromosomes.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 30 / 39

Page 44: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

The sandwich C1P (SC1P)

Definition.

Given a ternary (0, 1,X ) matrix M, can we order its columns in such a way thatthere is no 0 with a 1 on its left and on its right (bad zero, bad row)?

Combinatorial objectives

1 Decision. Deciding if M satisfies the SC1P is NP-complete [Golumbic and

Wasserman, 1998; Golumbic, 1998]. No PQ-tree concept.

2 Optimization. Order the columns of M to minimize the number of bad rowsand/or bad zeros (and variants): already NP-hard for the C1P.

Contribution

1 A heuristic for the decision problem based on partition refinement.

2 A reduction to the TSP (and local search methods).

3 A lower bound on the number of bad rows.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 31 / 39

Page 45: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

The sandwich C1P (SC1P)

Definition.

Given a ternary (0, 1,X ) matrix M, can we order its columns in such a way thatthere is no 0 with a 1 on its left and on its right (bad zero, bad row)?

Combinatorial objectives

1 Decision. Deciding if M satisfies the SC1P is NP-complete [Golumbic and

Wasserman, 1998; Golumbic, 1998]. No PQ-tree concept.

2 Optimization. Order the columns of M to minimize the number of bad rowsand/or bad zeros (and variants): already NP-hard for the C1P.

Contribution

1 A heuristic for the decision problem based on partition refinement.

2 A reduction to the TSP (and local search methods).

3 A lower bound on the number of bad rows.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 31 / 39

Page 46: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

The sandwich C1P (SC1P)

Definition.

Given a ternary (0, 1,X ) matrix M, can we order its columns in such a way thatthere is no 0 with a 1 on its left and on its right (bad zero, bad row)?

Combinatorial objectives

1 Decision. Deciding if M satisfies the SC1P is NP-complete [Golumbic and

Wasserman, 1998; Golumbic, 1998]. No PQ-tree concept.

2 Optimization. Order the columns of M to minimize the number of bad rowsand/or bad zeros (and variants): already NP-hard for the C1P.

Contribution

1 A heuristic for the decision problem based on partition refinement.

2 A reduction to the TSP (and local search methods).

3 A lower bound on the number of bad rows.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 31 / 39

Page 47: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Results on the boreoeutherian ancestor

Data

Twelve ingroups (primates, rodents, ferungulate, carnivores) and threeoutgroups (opossum, birds).

1764 Markers: alignments from Ensembl that were present in at least 2species whose evolutionary path goes by the boreoeutherian ancestor.”No” pre-processing for constructing orthology blocks.

Results

Ternary matrix: 89023 rows (1s: 0.16%, X s: 12.62%).

316 bad rows (0.3%) and 2753 bad zeros.

26 CARs, almost all in agreement with the expected ancestor (12-22-10).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 32 / 39

Page 48: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Results on the boreoeutherian ancestor

Data

Twelve ingroups (primates, rodents, ferungulate, carnivores) and threeoutgroups (opossum, birds).

1764 Markers: alignments from Ensembl that were present in at least 2species whose evolutionary path goes by the boreoeutherian ancestor.”No” pre-processing for constructing orthology blocks.

Results

Ternary matrix: 89023 rows (1s: 0.16%, X s: 12.62%).

316 bad rows (0.3%) and 2753 bad zeros.

26 CARs, almost all in agreement with the expected ancestor (12-22-10).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 32 / 39

Page 49: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Conclusion

Summary

A generalized C1P model well adapted to handling unequal markers content.

A very limited knowledge of the corresponding combinatorial framework.

Combinatorial optimization algorithms (TSP).

Perspectives

Understanding obstructions to the sandwich-C1P (no Tucker patterns).

More generally: understanding the combinatorics of matrices that are almostsandwich-C1P.

Breaking the arbitrary distinction between orthology blocks (localrearrangements) and CARs (large rearrangements) and defining markers onlyfrom significant pairwise alignments.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 33 / 39

Page 50: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Handling missing markers: the sandwich-C1P

Conclusion

Summary

A generalized C1P model well adapted to handling unequal markers content.

A very limited knowledge of the corresponding combinatorial framework.

Combinatorial optimization algorithms (TSP).

Perspectives

Understanding obstructions to the sandwich-C1P (no Tucker patterns).

More generally: understanding the combinatorics of matrices that are almostsandwich-C1P.

Breaking the arbitrary distinction between orthology blocks (localrearrangements) and CARs (large rearrangements) and defining markers onlyfrom significant pairwise alignments.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 33 / 39

Page 51: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

Conclusion and perspectives

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 34 / 39

Page 52: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

Summary

Combining two fundamental problems

Inferring ancestral genomic characters (evolutionary genomics).

Linear ordering of genomic characters for unassembled genomes: (assembly,mapping, . . . ).

Contributions: more craftmanship than high throughput.

Clear methodology, with a well defined combinatorial framework for two ofthem: orthologous markers, CARs (C1P), synteny (DCS/CSAM).

Combines methodological principles from cytogenetics, physical mapping,genome rearrangements.

Efficient methods for reconstructing ancestral genomes architectures, appliedto mammalians, amniotes, yeasts, plants ancestors.

Importance of combinatorial understanding versus combinatorialoptimization: exploring the combinatorial and algorithmic properties of theC1P in relation to ancestral genome reconstrution.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 35 / 39

Page 53: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

Summary

Combining two fundamental problems

Inferring ancestral genomic characters (evolutionary genomics).

Linear ordering of genomic characters for unassembled genomes: (assembly,mapping, . . . ).

Contributions: more craftmanship than high throughput.

Clear methodology, with a well defined combinatorial framework for two ofthem: orthologous markers, CARs (C1P), synteny (DCS/CSAM).

Combines methodological principles from cytogenetics, physical mapping,genome rearrangements.

Efficient methods for reconstructing ancestral genomes architectures, appliedto mammalians, amniotes, yeasts, plants ancestors.

Importance of combinatorial understanding versus combinatorialoptimization: exploring the combinatorial and algorithmic properties of theC1P in relation to ancestral genome reconstrution.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 35 / 39

Page 54: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

Perspectives (1)

Integrating the history of other genomic markers: gene families, mammaliansegmental duplications, functional DNA.

DCS: a simple and elegant principle, a nightmare to implement.

!"#$%&'()*%#('"+,-*%(.*/'-0,-*),("#,-1*%*2!3*-,)#,(+*3*'-*-450*

+0%+6*

! Minimum number of genes having orthologs on A (resp. B): 7

! Maximum gap of genes having no ortholog on A or B: 1

! SA and S

B do overlap.

3+&'(),(+*.,+,5+'"(*"/*2!3*-')(%7

84#%(*-,)#,(+*93:

;,.%<%*-,)#,(+*9=:

;,.%<%*-,)#,(+*9>:

3>

3=

An almost completely open problem. Several vari-ants of interest (quadruple conserved syntenies).

Ancestral Syntenic Groups: no linear ordering constraints.

Inferring quantitative information such as ancestral dis-tances between markers (gapped-C1P, back to Morgan andSturtevant approaches?)

Probabilistic evolutionary models of qualitative (adjacencies) or quantitative(distances) genomic characters.

Comparison of different ancestral genomes proposals.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 36 / 39

Page 55: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

Perspectives (2)

Variants of the C1P.

Telomeres and the C1P with multiplicity [C.,

Manuch, Patterson and Wittler, 2011].

Integrating gaps (for matrices that are almost C1P) [Murray Patterson

PhD, 2011].

Integrating genome rearrangement (global parsimony) methods.

!

" !

"

!"#

$

%

&

& %

#

$

%

$

"!

&

#

'()*+,-,.(/-01)2,3(4561,7(1-**8(-93251

!"#"$"%&'()*+"#","#-$(,.%#./0,(0*-0(

-+.(-12-)./0(3/(.40-/0(,5.)3.,(6.+.(

5+"7-7$&(-12-)./0(3/(0*.(-/).,0+-$(%./"#.8

%

$

&

#

!

"

&

%

#

$

9.-++-/%.#./0'()"#5:0.(-(5-+,3#"/3":,(

+.-++-/%.#./0(,)./-+3";(3/(-(%3<./(#"1.$(

"=(+.-++-/%.#./08

!

" !

"

!"#

$

%

&

& %

#

$

%

$

&

#

%

$

&

#

!

"

>*3)?./ !:#-/ @"% >*3)?./ !:#-/ @"%

Using characters that are present in all solutions tomedian problems [Zhao and Bourque, 2009] [C.

Gavranovic, Ouangraoua and Tannier, 2010;

Ouangraoua, Tannier and C., 2011].

Computing distances and rearrangement events between partially resolvedgenomes (PQ-trees [Jiang, C. and Zhu, 2010]).

Several concepts are likely to have application for other assembly/mappingproblems (tumor genomes, pangenomes, metagenomes, . . . ).

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 37 / 39

Page 56: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

Acknowledgments

Collaborators

Eric Tannier, Aıda Ouangraoua, Haris Gavranovic, Tamon Stephen, MathildeBouvel, Dominique Rossin, Marni Mishna, Roland Wittler, Jano Manuch, BinhaiZhu.

Students

Andrew McPherson, Murray Patterson, Vivija You, Maria Tamayo, RosemaryMcCloskey, Brad Jones.

Funding

NSERC, SFU, CNRS, ANR, FFCR.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 38 / 39

Page 57: A la recherche des g enomes perdus - VanBUG · deux morceaux chacun dans le g nome du fourmilier collier, Tamandua tetradactyla , ce qui r v le des r arrangements ancestraux de translocations,

Conclusion and perspectives

References

A. Bergeron, M. Blanchette, A. Chateau and C. Chauve. Reconstructing Ancestral Gene Orders UsingConserved Intervals. Proceedings of WABI 2004.

C. Chauve and E. Tannier. A methodological framework for the reconstruction of contiguous regions ofancestral genomes and its application to mammalian genomes. PLoS Comptutational Biology, vol. 4,no. 11. 2008.

A. Ouangraoua, F. Boyer, A. McPherson, E. Tannier and C. Chauve. Prediction of Contiguous Regionsin the Amniote Ancestral Genome. Proceedings of ISBRA 2009.

C. Chauve, J. Manuch and M. Patterson. On the Gapped Consecutive-Ones Property. Electronic Notesin Discrete Mathematics, vol. 34. 2009.

C. Chauve, H. Gavranovic, A. Ouangraoua and E. Tannier. Yeast ancestral genome reconstructions: thepossibilities of computational methods II. Journal of Computational Biology, vol. 17, no. 9. 2010.

C. Chauve, U.-U. Haus, T. Stephen and V. P. You. Minimal Conflicting Sets for the Consecutive OnesProperty in Ancestral Genome Reconstruction. Journal of Computational Biology, vol. 17, no. 9. 2010.

H. Jiang, C. Chauve and B. Zhu. Breakpoint Distance and PQ-Trees. Proceedings of CPM 2010.

H. Gavranovic, C. Chauve, J. Salse and E. Tannier. Mapping ancestral genomes with massive gene loss:a matrix sandwich problem. Bioinformatics, vol. 27, no. 13. 2011.

C. Chauve, J. Manuch, M. Patterson and R. Wittler. Tractability results for the Consecutive-OnesProperty with multiplicity. Proceedings of CPM 2011.

A. Ouangraoua, E. Tannier and C. Chauve. Reconstructing the architecture of the ancestral amniotegenome. To appear in Bioinformatics. 2011.

M. Bouvel, C. Chauve, M. Mishna and D. Rossin. Average-Case Analysis of Perfect Sorting byReversals. To appear in Discrete Mathematics Algorithms and Applications. 2011.

C. Chauve (Math. SFU) Looking for lost genomes VanBUG 2011 39 / 39