Méthodes statistiques pour la classification de données de maintien

Université Paris DescartesLaboratoire MAP5 UMR CNRS 8145

École doctorale Mathématiques Paris Centre

THÈSEprésentée pour obtenir le grade de

Docteur de l’Université Paris Descartes

Spécialité : Mathématiques

Présentée par

Christophe DENIS

Méthodes statistiquespour la classification

de données de maintien postural

Soutenue le 29 novembre 2012 devant le jury composé de:

M. Philippe BESSE (Examinateur)M. Gérard BIAU (Rapporteur)M. Antoine CHAMBAZ (Directeur de thèse)Mme. Adeline SAMSON (Directrice de thèse)M. Jérôme SARACCO (Examinateur)M. Nicolas VAYATIS (Rapporteur)M. Pierre-Paul VIDAL (Examinateur)

RésuméL’objet de cette thèse est de contribuer à l’élaboration d’une notion de style postural

à travers la mise en place de procédures de classification de patients selon leur maintienpostural. Les troubles du maintien postural sont susceptibles d’entraîner la chute, qui estl’une des premières causes de mortalité chez les personnes âgées. L’objectif à plus longterme est la mise au point de protocoles d’identification de troubles du maintien posturalet l’adaptation au cas par cas des protocoles de rééducation fonctionnelle.

Les travaux menés dans cette thèse s’appuient sur une étude clinique dédiée à l’évalua-tion du maintien postural. La cohorte étudiée est constituée de 70 patients répartis danstrois groupes distincts : celui des patients hémiplégiques, celui des patients vestibulaireset, enfin, celui des patients qualifiés de sains. Chaque patient a suivi jusqu’à quatre dif-férents protocoles expérimentaux évaluant les caractéristiques du maintien de la posturevia l’enregistrement des points de pression maximale exercée par chaque pied tandis quele maintien du patient est artificiellement perturbé.

Dans la première partie de cette thèse nous proposons et étudions deux procédures declassification des patients selon leur maintien postural. Nous considérons d’abord le sous-problème consistant à classer des patients hémiplégiques ou sains puis celui, encore plusambitieux, consistant à classer des patients hémiplégiques, vestibulaires ou sains. A cettefin, nous développons un éventail de techniques qui mettent en jeu une grande variété dethèmes statistiques. Ainsi, nous établissons un classement des protocoles expérimentauxselon la pertinence des informations qu’ils fournissent dans une approche relevant duparadigme de la statistique semi-paramétrique. Nous exploitons une modélisation parprocessus stochastiques et leur inférence pour extraire des traits caractéristiques de petitedimension des trajectoires de maintien postural qui sont au contraire de grande dimension.De plus, nous enrichissons cette approche paramétrique d’une procédure de détection deruptures qui s’avère l’améliorer grandement. Le principe d’agrégation de prédicteurs selonun principe de validation croisée est par ailleurs largement mis en œuvre afin de mettreau point les classifieurs aux meilleures performances.

La seconde partie de cette thèse est consacrée à des questions connexes. Nous y menonsl’étude théorique du classifieur selon la paire de meilleur score (“top scoring pair classi-fier” en anglais) qui joue un rôle important dans la première partie. Conçue à l’originepar Geman et al. (2004) à des fins de classification sur la base de profils génétiques, cetteprocédure de classification n’a pas été étudiée théoriquement à notre connaissance. Nousla détournons de son cadre original d’application et explorons son comportement asymp-totique. Par ailleurs, nous y proposons une procédure d’estimation des instants de rupturepour les processus Cox-Ingersoll-Ross observés à temps discrets. Nous y introduisons etétudions aussi une extension du principe d’agrégation appelé super-learning (van der Laanet al., 2007) au cadre de la classification multi-classe.

Mots-clefs : maintien postural ; classification supervisée ; validation croisée ; top scoringpair classifier ; super-learning ; détection de rupture.

3

AbstractOur study contributes to the search of a notion of postural style, focusing on the issue

of classifying subjects in terms of postural maintenance. A deficit in postural maintenanceoften results in falling, which is particularly hazardous in elderly people. The long termgoal is to develop protocols to identify such deficits and to adapt on a case by case basisthe medical protocols for functional rehabilitation.

Our work relies on a clinical study dedicated to evaluating postural maintenance. Sev-enty subjects are enrolled in the cohort. Each subject belongs to one of three groups ofhemiplegic, vestibular and normal patients. Every subject undergoes up to four exper-imental protocols that were designed to grab the caracteristics of the subject’s strategyto maintain posture. The data notably include the complex trajectories of the pointswhere the maximal pressure is exerted by each foot as the subject’s balance is artificiallypertubed.

In the first part of the manuscript, we propose and study two classification proceduresin terms of postural maintenance. We first address the problem of classifying hemiplegicversus normal subjects. Then we tackle the more ambitious problem of the classificationof subjects as hemiplegic, vestibular or normal. To reach our goal, we develop a variety ofstatistical techniques relying on a wide spectrum of statistical principles. For instance, weformulate and solve the issue of ranking the experimental protocols by decreasing orderof relevance relative to postural maintenance in the paradigm of semiparametric models.We exploit a stochastic processes model and its inference to extract small-dimensionalrelevant features from the high-dimensional trajectories derived from the experimentalprotocols. On top of that, we add a layer of abrupt changes detection to enhance thelatter procedure of extraction of small-dimensional features. On several occasions wedraw advantage from the principle of aggregation of predictors to obtain better classifiers.

The second part of this manuscript is devoted to connex topics. We study the asymp-totic behavior of the top scoring pair classifier which plays a great role in the first part.Originally designed by Geman et al. (2004) to classify diseases based on genetic pro-files, this classification procedure had not been theoretically studied before, to the bestof our knowledge. Furthermore, we propose an inference procedure of abrupt changes forCox-Ingersoll-Ross processes observed at discrete times. We also introduce and study anextension of the aggregation principle called super-learning (van der Laan et al., 2007) tothe framework of classification in more than two classes.

Keywords : postural maintenance; supervised classification; cross-validation; top scor-ing pair classifier; super-learning; change point estimation.

4

Remerciements

Mes premiers remerciements sont pour Adeline et Antoine mes directeurs de thèse. Jetiens à vous remercier pour votre encadrement parfait, et ce dès mon stage de Master 2avec Adeline. Merci d’avoir accompagné mes premiers pas dans le monde de la rechercheet de m’avoir fait découvrir l’univers passionnant des statistiques et de leurs applications.Merci pour votre soutien sans faille qui m’a permis d’avancer sereinement durant ces troisannées de thèse. Merci pour votre optimisme, votre dynamisme et votre rigueur qui sontdes modèles pour moi. Enfin, je tiens à vous remercier de m’avoir initié à ce problèmepassionnant qu’est l’étude du maintien postural.

Je tiens à remercier Gérard Biau et Nicolas Vayatis pour l’intérêt qu’ils ont porté àmon travail en acceptant de rapporter ma thèse. Je tiens également à remercier PhilippeBesse, Jérôme Saracco et Pierre-Paul Vidal de me faire l’honneur de participer à monjury de thèse.

Je tiens à remercier Annie Raoult, directrice du MAP5, et Christine Graffigne, direc-trice de l’U.F.R. Math-Info, pour leur accueil et leur aide décisive lors de ma demande dedétachement.

Je remercie Fabienne pour sa sympathie, sa bienveillance et son aide toujours précieuselors de mes démarches administratives.

Je remercie Valentine pour sa gentillesse et sa disponibilité pour répondre à certainesde mes interrogations.

Je remercie Avner pour sa sympathie, nos discussions enrichissantes et pour m’avoirintégré au MAP5.

Je remercie Jean-Stéphane Dhersin pour la confiance qu’il m’a accordée durant monstage de Master 2.

Mes trois années au MAP5 ont été riches et très agréables. Je tiens à remercier chaleu-reusement tous les membres du laboratoire pour leur bonne humeur et leur sympathie ettout particulièrement les doctorants, post-doctorants et A.T.E.R. Je tiens aussi à adresserune mention spéciale à Marie-Hélène pour être toujours disponible.

Je remercie également l’encadrement pédagogique de la licence 2 de mathématiquesde l’Université Pierre et Marie Curie au sein duquel j’ai effectué mon monitorat.

Je tiens aussi à remercier les jeunes statisticiens rencontrés durant, voire avant pourcertains d’entre eux, ces trois années et avec lesquels il m’a été bien agréable de partagerdes moments de détentes autour d’un repas, d’un verre (et même plusieurs), d’un café ...Un grand merci à Gaëlle, Thierry, Olivier, Robin, Sébastien, Florian, Benjamin, Sarah,Aurélie, Maud, Sylvain, Robbi, Pierre, Imen, Djeneba, Adriana, Caroline, Solange, Cheikh,Laureen, Cyrielle et tous les autres !

Je tiens à adresser un remerciement particulier à Laure, amie statisticienne de lapremière heure. Merci pour toutes nos discussions enrichissantes (mathématiques ou non)et ces moments où on a bien rigolé ! J’en profite ici pour remercier nos complices desoirées : Jérémy, Geoffrey, Matthias et Mélanie.

6

7

Je remercie également tous mes amis pour les bons moments passés ensemble.Je remercie toute ma famille. Merci à mes parents Lilianne et Robert pour leur soutien

durant toutes ces années, je vous dois beaucoup. Merci à mon frère Frédéric joyeux drilledevant l’éternel. Merci à mes grands-parents Renée, Jean et Suzanne pour leur grandesagesse. Merci à Philippe et François, mes cousins, pour nos parties de cartes endiablées.Merci à mon oncle Jean-Claude et ma tante Andrée pour tous ces bons repas partagésensemble. Je remercie Annie et Joël pour leur accueil et leur gentillesse et Étienne pournos sorties à vélo incongrues !

Enfin, merci à toi Anne-Claire, pour entre un milliard d’autres choses, ton humour, tajoie de vivre et tous ces beaux moments passés ensemble !

À Pépé

9

Table des matières

1 Introduction générale 171.1 Contexte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.2 Description des données . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.3 Formulation statistique du problème . . . . . . . . . . . . . . . . . . . . . 251.4 Procédure de classification pour des données de maintien postural . . . . . 281.5 Résumé des résultats de la thèse . . . . . . . . . . . . . . . . . . . . . . . . 35

2 Classification selon le style postural 372.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.3 Classification procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.5 Application to the real dataset . . . . . . . . . . . . . . . . . . . . . . . . . 512.6 Supplement to “Classification in postural style” . . . . . . . . . . . . . . . 54

3 Processus de diffusions pour la classification en maintien postural 613.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.2 Data and modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4 Étude du top scoring pair classifier 774.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.3 Empirical TSP classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.4 Cross-validated TSP classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 864.5 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

11

12 Table des matières

5 Compléments 995.1 Estimation des instants de rupture . . . . . . . . . . . . . . . . . . . . . . 1015.2 Super-learning pour la classification multi-classe . . . . . . . . . . . . . . . 106

6 Conclusion 1136.1 Bilan et perspectives pour la classification en maintien postural . . . . . . 1156.2 Bilan et perspectives pour l’étude du TSP classifier . . . . . . . . . . . . . 115

Production scientifique liée à la thèseArticle accepté

[1] A. Chambaz et C. Denis (2012), Classification in Postural style, Annals OfApplied Statistics, 6(3), 977-993

Article Soumis

[2] C. Denis, Classification in postural style based on stochastic process mo-deling

Article en préparation

[3] C. Denis, Top scoring pair classifier: asymptotics and applications

Conférences internationales

• Dynstoch, Classification in postural maintenance based on stochastic pro-cess modeling, poster, Paris, juin 2012• Séminaire StatMathAppli, Classification in postural style, Fréjus, septembre2011

Conférences nationales

• Journées de la SFB, Classification en maintien postural, Paris, novembre 2012• Journées MAS, Classification in postural style, Clermont-Ferrand, août 2012• Journées de la SFDS, Processus stochastiques pour la classification selon lestyle postural, Bruxelles, mai 2012• Colloque des Jeunes probabilistes et statisticiens, Processus stochastiques pourla classification selon le style postural, Marseille, avril 2012• Journées de la SFDS, Classification en maintien postural, Tunis, mai 2011

Groupes de travail

• Séminaire des doctorants du LSTA, Processus stochastiques pour la classifi-cation selon le style postural, Paris, avril 2012• Séminaire TEST,Processus stochastiques pour la classification selon le stylepostural, Paris, novembre 2011• Groupe de travail de Statistiques du MAP5, Classification en maintien postu-ral, Paris, avril 2011• Séminaire des doctorants du MAP5, Estimation de paramètres dans un mo-dèle de transmission du virus de l’hépatite C, Paris, septembre 2009

14

16 Table des matières

Chapitre 1

Introduction générale

17

1.1. Contexte 19

Cette introduction générale est constituée de cinq parties. Nous introduisons le contextebiomédical en Section 1.1. L’ensemble des données constituant le point d’appui de ce tra-vail est décrit en Section 1.2. Nous abordons la formulation ainsi que les enjeux statistiquesdes objectifs de la thèse en Section 1.3. Les grandes lignes de la construction d’une procé-dure de classification pour des données de maintien postural sont exposées en Section 1.4.Finalement, nous décrivons l’organisation de la thèse en Section 1.5.

1.1 Contexte

La posture peut être définie comme "le maintien du corps ou d’un de ses segmentsdans une position donnée" (Bouisset and Maton, 1995). Elle est nécessaire pour touteactivité physique. La posture peut être vue comme une réponse à deux contraintes :garder l’équilibre et positionner son corps en fonction de l’action à réaliser. S’il existe unegrande variabilité parmi les postures adoptées, il est néanmoins acquis que chez l’homme,la posture de référence est la station debout (Paillard, 1971).

On peut définir le maintien (ou contrôle) postural comme la faculté d’interagir et des’adapter à son environnement, lors d’une situation donnée, afin de conserver une posturestable (Massion, 1994). Le maintien postural est fondé sur le traitement par le cerveaud’informations sensorielles encodées par les systèmes visuel, proprioceptif et vestibulaire.Le traitement de ces informations varie selon les individus et dépend de leur expériencesensorimotrice, ainsi que de différents facteurs tels que l’âge ou la condition physique. Unestratégie optimale de maintien postural peut être définie comme la capacité de traiter lesinformations sensorielles les plus pertinentes à un instant donné pour une situation donnée.

Nous décrivons en Section 1.1.1 la contribution de chacun des systèmes sensorielsau maintien postural. Les enjeux de l’étude du maintien postural sont exposés en Sec-tion 1.1.2. Finalement en Section 1.1.3, nous décrivons différentes procédures pour évaluerle maintien postural.

1.1.1 Rôle des systèmes sensoriels dans le maintien postural

Les informations provenant du système visuel contribuent à la définition de la positionet du mouvement. Elles permettent de définir l’orientation du regard par rapport aucorps et l’orientation de l’individu par rapport à son environnement (Genthon, 2006).Pour étudier l’influence du système visuel sur le maintien postural, différents moyens ontété proposés tels que la limitation, la perturbation ou encore le renforcement de l’influxvisuel (Asai et al., 1990; Mergner et al., 2005).

Le système proprioceptif est constitué des récepteurs sensoriels situés aux voisinagesdes muscles, des os, des articulations et de la peau. Les informations proprioceptivescontribuent à la définition de la position relative des segments corporels et du corps dansl’environnement. La contribution des informations proprioceptives au maintien postural

20 Chapitre 1. Introduction générale

a été étudiée au moyen de protocoles perturbateurs ou stimulateurs (Ribot et al., 1989;Fransson et al., 2003).

Le système vestibulaire est composé de récepteurs sensoriels situés dans l’oreille in-terne et du nerf vestibulocochléaire. Il est à la base du sens de l’équilibre. Les informationsvestibulaires contribuent à la perception du mouvement et de l’orientation du corps parrapport à la verticale. Les récepteurs sensoriels du système vestibulaire sont spécialisésdans la mesure des accélérations de la tête (Genthon, 2006). L’implication du systèmevestibulaire sur le contrôle postural a été étudié au moyen de stimulations galvaniques(stimulation électrique des neurones vestibulaires), caloriques (irrigation des canaux au-ditifs avec de l’eau froide) ou encore opto-cinétiques (Inglis et al., 1999; Fransson et al.,2003).

1.1.2 Enjeux de l’étude du maintien postural

Les troubles de la posture sont de natures diverses. Ils peuvent être causés par une dé-ficience de l’un des systèmes sensoriels contribuant au maintien postural. Ils peuvent aussirésulter d’une mauvaise utilisation des informations participant au contrôle postural, parexemple l’utilisation systématique d’un seul type d’informations (typiquement visuelles)au détriment des autres. Ceci implique alors une possible régression fonctionnelle des sys-tèmes sensoriels peu sollicités. Il est à noter que cette situation peut intervenir mêmechez les athlètes de haut niveau (Astrua et al., 1997). Sans une stimulation appropriée del’équilibre, tous les systèmes impliqués dans le maintien postural peuvent présenter unerégression fonctionnelle. L’une des principales conséquences d’un trouble de la postureest la chute, pouvant entraîner une perte d’autonomie et dans le pire des cas le décès del’individu (la chute et ses séquelles entraînent le décès de plus de 10 000 personnes par an).Les personnes âgées sont les plus touchées par ce type de problème. On peut d’ailleursnoter que la dégradation de la stabilité posturale est à l’origine de près de 20% des chuteschez les personnes âgées (Jacquot et al., 1999).

La rééducation, permettant d’augmenter/améliorer/faciliter l’intégration des facteurssensoriels, est l’un des principaux moyens de la prise en charge des troubles posturaux.Néanmoins, cette prise en charge repose sur l’identification des systèmes affectés afin depermettre au thérapeute d’orienter les patients atteints de troubles posturaux vers desprogrammes de physiothérapie adaptés. Ainsi, la détection des troubles posturaux et deleur nature est un enjeu crucial de santé publique.

1.1.3 Évaluation du maintien postural

L’identification des troubles posturaux nécessite de pouvoir évaluer le maintien pos-tural. Les modalités du contrôle de la posture sont nombreuses, présentant toutes desavantages et des inconvénients. On peut distinguer deux types d’évaluation : les évalua-tions cliniques et les évaluations instrumentales.

1.1. Contexte 21

Les évaluations de nature clinique sont constituées d’échelles d’évaluation d’équilibre etd’épreuves cliniques. Les échelles d’évaluation consistent en une série de tâches de difficultécroissante à effectuer par le patient (maintien ou changement de posture). L’intérêt de cetype d’évaluations est qu’elles ne nécessitent aucun équipement et reflètent les activitésquotidiennes du patient (Tinetti, 1986). Les épreuves cliniques consistent à demander lemaintien de la posture ou la recherche de la limite de stabilité au patient (Pinsault, 2009).Si ces tests cliniques présentent l’avantage d’être peu coûteux et faciles à mettre en œuvre,ils ne fournissent qu’une information limitée sur les potentiels troubles posturaux dontsouffre le patient.

Les évaluations instrumentales du maintien postural consistent en l’enregistrement desdéplacements du centre de gravité et/ou des centres de pression plantaire du patient dansdes conditions où la posture peut être perturbée. L’enregistrement des déplacements ducentre de gravité s’effectue au moyen de systèmes vidéographiques et accélérométriques.L’analyse des déplacements du centre de gravité fournit des informations concernant lanature des oscillations posturales. Néanmoins, la reconstruction des positions du centrede gravité est difficile à mettre en œuvre. L’enregistrement du déplacement des centresde pression plantaire s’effectue au moyen d’une plateforme de force (assimilable à unpèse-personne). Le centre de pression plantaire représente le point d’application de larésultante des forces de réaction exercées par le sujet pour maintenir son équilibre (Bouis-set and Maton, 1995). Par conséquent, l’étude des déplacements du centre de pressionplantaire fournit des informations sur le contrôle de l’équilibre du sujet. Des facteurscomme les instructions données au patients, la durée des enregistrements ainsi que lenombre d’essais réalisés peuvent influencer le degré de fiabilité des paramètres posturauxrecueillis (Pinsault, 2009).

1.1.4 Objectifs de la thèse

Le travail de cette thèse a pour objet d’apporter une contribution à l’élaborationd’une notion de style postural. Le style postural d’un individu peut être vu comme le typed’afférences sensorielles privilégiées par l’individu. Nous proposons dans cette thèse lamise au point de méthodes permettant l’identification et la caractérisation des troublesposturaux et de leur nature. Dans ce travail, l’identification et la caractérisation destroubles de la posture reposent sur l’ensemble de données que nous décrivons dans laprochaine Section 1.2. Dans ce cadre, le problème peut se formuler de la manière suivante :

- étant donné un patient souffrant de potentiels troubles posturaux ;

- étant données les trajectoires associées aux différents protocoles ainsi, éventuellement,qu’un ensemble de covariables (pour peu qu’elles soient disponibles) ;

- déterminer le/les trouble(s) de la posture qui l’affecte(nt).


1.2 Description des données

Le travail mené dans cette thèse repose sur une étude réalisée au CESEM (Centred’étude de la sensorimotricité, UMR CNRS 8194, Université Paris Descartes). Cette étudea été conduite sur une cohorte de 70 patients. Pour cet ensemble de patients, les modalitésde l’évaluation du contrôle postural ont été étudiées à l’aide d’enregistrements des centresde pression plantaire. Dans un premier temps, nous décrivons en Section 1.2.1 l’ensembledes patients ayant participé à l’étude. Les protocoles d’évaluation du maintien postu-ral considérés dans ce travail sont présentés en Section 1.2.2, tandis que les trajectoiresrésultantes de ces protocoles sont décrites en Section 1.2.3.

1.2.1 Description de l’ensemble des patients

L’ensemble des patients est subdivisé en trois groupes. Le premier groupe est composéde 22 patients hémiplégiques suite à un accident vasculaire cérébral. Un deuxième groupeest composé de 16 patients vestibulaires (souffrant d’une lésion du système vestibulaire).Enfin un troisième groupe est composé de 32 patients, qui après examen clinique ontété déclarés comme ne souffrant d’aucun trouble de la posture. Dans la suite, ces pa-tients seront dénommés sains. Parmi l’ensemble des troubles associés à l’hémiplégie, lestroubles posturaux sont le premier facteur limitant la réalisation des tâches quotidiennes.Les patients hémiplégiques souffrent de troubles proprioceptifs engendré par la paralysiepartielle de la moitié du corps. Bonan et al. (1996) montrent aussi que les patients atteintsd’hémiplégie développent une préférence très prononcée pour l’afférence visuelle. Pour lesgroupes composés des patients hémiplégiques et sains, un ensemble de covariables a étérecueilli : âge, poids, genre, taille et latéralité.

1.2.2 Description des protocoles d’évaluation du maintien postu-ral

Chacun des patients a subi différents protocoles expérimentaux d’évaluation du contrô-le postural par enregistrement des centres de pression plantaire. Le patient se tient deboutsur une plateforme de force. L’appareil enregistre à intervalles réguliers les positions suc-cessives des points où le pied gauche et le pied droit du patient exercent séparément unepression maximale. Les déplacements sont enregistrés toutes les δ = 25 millisecondes.Chaque protocole dure 70 secondes et se décompose en trois phases :

- une première phase de 15 secondes où le maintien de la posture n’est pas perturbé ;

- une deuxième phase de 35 secondes où le maintien de la posture est perturbé parstimulation d’un ou plusieurs systèmes sensoriels ;

- une troisième phase de 20 secondes où le maintien de la posture n’est plus perturbé.

1.2. Description des données 23

Les stimulations envisagées pour la perturbation de la posture sont diverses et visent àexplorer une ou plusieurs des trois afférences sensorielles impliquées dans le maintien pos-tural. Les stimulations peuvent être de nature visuelle (fermeture des yeux), vestibulaire(stimulations galvaniques, opto-cinétiques) et proprioceptive (stimulations vibratoires) ouencore proprioceptive et visuelle (fermeture des yeux et stimulations vibratoires). Chacundes patients a subi plusieurs protocoles (une dizaine pour certains). Néanmoins, un petitnombre d’entre eux seulement a été suivi par tous les participants de l’étude. Quatre pro-tocoles ont été suivis par les groupes composés des sujets hémiplégiques et sains. Parmi cesquatre protocoles, deux ont été suivis par les patients vestibulaires. La Table 1.1 fournitune description des quatre protocoles.

protocole 0→15s 15→50s 50→70s1 yeux fermés2 pas de perturbation stimulation musculaire pas de perturbation3 fermeture des yeux

stimulation musculaire4 stimulation opto-cinétique

Table 1.1 – Description des quatre protocoles exploités pour l’étude du maintien postural.En gras, nous indiquons les stimulations relatives aux deux protocoles subis par l’ensemble despatients. Chaque protocole est divisé en trois phases : une première phase sans perturbationdu maintien postural est suivie d’une deuxième phase avec perturbation, puis d’une troisièmesans perturbation. Les protocoles sont de différentes natures : visuelle (fermeture des yeux),proprioceptive (stimulation musculaire) ou vestibulaire (stimulation opto-cinétique).

Finalement, des données initialement produites, nous tirons deux ensembles de don-nées :

- un premier ensemble de données correspond aux covariables et données trajecto-rielles des 54 patients hémiplégiques et sains ;

- un deuxième ensemble de données correspond aux données trajectorielles des 70patients associées aux deux protocoles suivis par eux tous.

1.2.3 Trajectoires d’intérêts issues des protocoles

Pour chaque patient, les données issues d’un protocole résultent en une trajectoireobservée à temps discret que nous notons (Li, Ri)i≤N , où (Li, Ri) est l’observation autemps ti = iδ, i = 1, . . . , N = 2800. Pour tout i = 1, . . . , N , Li = (L1

i , L2i ) ∈ R2 (Ri =

(R1i , R

2i ) ∈ R2, respectivement) correspond à la position du centre de pression maximale

exercée par le pied gauche (droit, respectivement) au temps iδ. La Figure 1.1 montre deuxexemples de trajectoires (Li, Ri)i≤N obtenues.

Les trajectoires (Li, Ri)i≤N résultant des différents protocoles considérés sont com-plexes et semblent difficiles à analyser en l’état. Suivant l’étude menée par Chambaz et al.


Figure 1.1 – Trajectoires i 7→ Li (gauche) et i 7→ Ri (droite) des positions du centre de pressionmaximale de chacun des pieds issues du protocole 3. En haut, un patient qualifié de sain, en basun sujet hémiplégique.

(2009), nous proposons un premier résumé des trajectoires brutes en considérant deuxtrajectoires construites à partir de (Li, Ri)i≤N . Nous considérons dans un premier tempsla trajectoire (Bi = 1

2(Li +Ri))i≤N . Heuristiquement, (Bi)i≤N représente la trajectoire de

la projection du centre de gravité du patient sur la plateforme de force. Cette trajectoirefournit des informations pertinentes sur l’orientation du patient au cours du protocoleconsidéré sur la plateforme de force. Nous considérons aussi la trajectoire unidimension-nelle (Ci)i≤N que l’on définit pour tout i = 1, . . . , N par :

Ci = ‖Bi − b‖2,

où b est défini comme la valeur médiane de (Bi)i≤400. Cette trajectoire fournit des infor-mations pertinentes sur l’évolution de la distance du patient à un point de référence. Nousprésentons en Figure 1.2 une représentation visuelle des trajectoires (Bi)i≤N et (Ci)i≤N .

1.3. Formulation statistique du problème 25

Figure 1.2 – Représentation des trajectoires i 7→ Bi (en haut) et i 7→ Ci (en bas) correspondantà deux protocoles suivis par un même patient hémiplégique (protocole 1 à gauche, protocole 3 àdroite).

1.3 Formulation statistique du problèmeLe problème exposé en Section 1.1.4 peut être naturellement abordé par le statisticien

du point de vue de la classification supervisée. Nous présentons en Section 1.3.1 le cadrestatistique et les objectifs de la classification supervisée. En Section 1.3.2, nous décri-vons le principe de la classification supervisée dans le cadre du jeu de données décrit enSection 1.2. Nous abordons en Section 1.3.3 une problématique sous-jacente à l’objectifde classification des patients : le classement des protocoles. Finalement, nous discutonsl’approche envisagée dans cette thèse en Section 1.3.4.

1.3.1 Classification supervisée

La classification supervisée s’inscrit dans le cadre général de l’apprentissage statistique.Nous renvoyons à Hastie et al. (2009) pour une introduction à la théorie de l’apprentissagestatistique. Pour n ≥ 1 un entier, nous considérons Dn = {(X1, Y1), . . . , (Xn, Yn)} une


suite de vecteurs aléatoires indépendants et identiquement distribués de même loi P0,supposée inconnue, qu’un vecteur aléatoire (X, Y ). Nous supposons que la variable X,appelée variable d’entrée, est un élément d’un espace mesurable X . Nous supposons aussique la variable Y , appelée variable réponse, est un élément d’un espace mesurable Y . Enclassification supervisée, la variable Y est discrète et désigne la classe d’appartenance dela variable X. Ici, Y = {1, . . . , K}, où K désigne le nombre de classes. Il est à noter quelorsque K = 2 (classification binaire), on considère souvent Y = {0, 1} ou Y = {−1, 1}.

Pour une nouvelle entrée (X0, Y0) de loi P0 indépendante de Dn et pour laquelle laréponse Y0 n’est pas observée, l’objectif de la classification supervisée est de prédire laréponse Y0. Sur la base de l’échantillon Dn, on construit une règle empirique de classifica-tion (ou classifieur) g(Dn) telle que g(Dn) : X → Y soit mesurable. La prédiction associéeà l’entrée X0 est alors donnée par g(Dn)(X0).

Pour la suite de l’introduction, nous notons gn ≡ g(Dn). Nous notons l’ensemble desrègles de classifications G = {g : X → Y , g mesurable}.

1.3.2 Classification supervisée pour le maintien postural

Dans le cadre de cette thèse, nous abordons le problème de classification superviséesuivant :- étant données les différentes trajectoires associées aux protocoles ainsi que d’éventuelles

covariables ;- déterminer si le patient est hémiplégique, vestibulaire ou sain.

Suivant le cadre défini en Section 1.3.1, nous pouvons reformuler ce problème commesuit :- étant donné un échantillon Dn = {(Xi, Yi), i = 1, . . . , n} tel que pour i = 1, . . . , n, la

variable d’entrée Xi, est constituée des trajectoires associées aux différents proto-coles ainsi que d’éventuelles covariables, tandis que la variable réponse Yi indique laclasse du patient (hémiplégique, vestibulaire ou sain) ;

- construire une règle de classification gn, telle que pour un nouveau patient pour le-quel on dispose de l’observation X0, gn(X0) indique si le patient est hémiplégique,vestibulaire ou sain.

La construction d’une procédure de classement selon le maintien postural sur la seulefoi des trajectoires et des éventuelles covariables s’avère une tâche très difficile ! En effet,les covariables sont peu informatives à cet égard. En outre, les données trajectoriellesconstituent une source d’information potentiellement riche mais très complexe à exploitercar elles sont à la fois de très grande dimension et à structure de dépendance forte etintriquée. Ainsi, l’application directe des méthode classiques non conçues pour produireune classification sur la base de données trajectorielles ne peut être sérieusement envisagée,à cause du “fléau de la dimension” (Donoho, 2000). Concrètement, les résultats théoriquesqui décrivent leurs performances, notamment en termes de vitesses de convergence, ne sont

1.3. Formulation statistique du problème 27

pas garantis. Et si malgré tout on tentait l’aventure, les temps de calcul des algorithmesde classification seraient prohibitivement longs.

Ainsi, la construction de procédures robustes de classification dans le cadre des donnéesde maintien postural nécessite de s’appuyer sur une étude préalable des trajectoires issuesdes différents protocoles.

1.3.3 Classement des protocoles

Au Chapitre 2, nous abordons la question de la réduction du nombre de protocolesque doit suivre un patient pour la détection de troubles posturaux. Cette problématique,qui à notre connaissance n’est pas traitée dans la littérature, fait écho au fait que chacundes patients a suivi individuellement une dizaine de protocoles, une expérience qui pourcertains d’entre eux s’est avérée pénible. À cette fin, nous exploitons le premier ensemblede données correspondant aux 54 patients pour lesquels nous disposons de covariableset des données trajectorielles associées aux quatre protocoles. Nous proposons, dans unpremier temps, de classer les protocoles en terme de l’information qu’ils offrent relati-vement au maintien postural. Dans un second temps, nous construisons une procédurede classification selon les deux classes (hémiplégique et sain) en n’exploitant que le plusinformatif des protocoles, puis les deux, trois plus informatifs d’entre eux et enfin tous lesprotocoles.

Plus précisément, notons X = (X1, X2, X3, X4) le vecteur constitué des trajectoiresassociées aux quatre protocoles où, pour j = 1, 2, 3, 4,Xj représente le vecteur constitué dela trajectoire associée au protocole j. La procédure de classement des protocoles, que nousproposons au Chapitre 2, induit un vecteur (X(1), X(2), X(3), X(4)) où, pour j = 1, 2, 3, 4,X(j) représente le vecteur constitué de la trajectoire associée au protocole classé en jème

position par la procédure de classement selon l’information apportée relativement au mai-tien postural. Finalement, nous proposons de construire quatre classifieurs g1

n, g2n, g3

n, g4n

tels que, pour j = 1, 2, 3, 4, le classifieur gjn est construit sur la base des covariables et desdonnées trajectorielles X(1), . . . , X(j). Nous comparons alors les performances des classi-fieurs g1

n, g2n, g3

n, g4n afin d’évaluer la pertinence de la réduction du nombre de protocoles

pour la classification.

1.3.4 Discussion

De prime abord, classer des patients hémiplégiques, vestibulaires et sains dans uneapproche supervisée peut sembler aisé, voire vain. Pourtant, l’exercice dans ce contextese révèle à la fois très délicat et riche en enseignements. C’est en le résolvant que nousavons pu saisir la spécificité des données trajectorielles de maintien postural, que nousavons appris à en extraire des informations quant à la nature de celui-ci. Ce cadre offreaussi la possibilité de valider théoriquement et numériquement, selon des critères approu-vés consensuellement, les performances des méthodes que nous développons. En somme,


aborder la question du classement dans une approché supervisée nous permet de poser lespremiers jalons dans l’élaboration de la notion de style postural.

La formulation statistique du problème pourrait aussi être envisagée sous l’angle dela classification non supervisée. Cette formulation alternative permettrait par exemplede prendre en compte la notion de préférence sensorielle ; ainsi, il se pourrait bien quedes patients d’une même classe ne partagent pas nécessairement un unique style postural.Dans cet autre cadre, abordé brièvement dans (Chambaz et al., 2009), l’échantillon surlequel se base la prédiction d’une nouvelle observation n’est constitué que des variablesd’entrées. De plus, le nombre de classes, si tant est qu’il soit bien défini, n’est pas connuà priori et son estimation est très délicate.

Dans la suite de ce mémoire, nous écrirons “classification” pour classification supervi-sée.

1.4 Procédure de classification pour des données demaintien postural

La construction d’une règle de classification dans le cadre de notre étude du maintienpostural doit s’appuyer sur une étude préalable des trajectoires décrites en Section 1.2.3.Le cadre statistique dans lequel nous pouvons situer cette problématique est celui de laclassification de trajectoires. Nous décrivons en Section 1.4.1 les principes importants dela classification de trajectoires. Nous exposons en Section 1.4.2 le traitement statistiquedes trajectoires dans le cadre de données de maintien postural. En Section 1.4.3 nousesquissons les grandes lignes de la construction d’une procédure de classification dans lecadre de données de maintien postural telle que nous l’envisageons au cours des travaux decette thèse. Les Sections 1.4.4 et 1.4.5 sont consacrées à la description de deux procéduresde classification : les forêts aléatoires et le top scoring pair (TSP) classifier. Finalement,nous décrivons en Section 1.4.6 un principe d’agrégation de procédures de classification :le super-learning.

1.4.1 Classification de trajectoires

Dans de nombreuses applications, les données collectées sont considérées comme desobservations discrétisées de fonctions aléatoires. Ces observations peuvent représenterl’évolution de la température au cours du temps en un point du globe, l’évolution ducours d’une action en bourse, ou encore dans le cadre de notre étude, l’évolution au coursdu temps du déplacement d’un patient sur une plateforme de force. La très grande dimen-sion des observations ainsi que leur forte corrélation ne permet pas l’usage des algorithmesde classification usuels.

Dans ce cadre, il est donc pertinent de considérer les observations comme des fonctionsplutôt que comme des vecteurs de grande dimension. Là encore, les méthodes classiques ne

1.4. Procédure de classification pour des données de maintien postural 29

peuvent directement s’appliquer, les observations évoluant alors dans un espace de dimen-sion infinie. Abraham et al. (2006) montrent par exemple, que la règle de classification desnoyaux qui est universellement consistante en dimension finie ne l’est plus sans restrictiondans le cadre de la dimension infinie. Pour contrer ces difficultés, de nombreuses méthodesd’analyse de données fonctionnelles (ADF) ont été développées ces dernières années. Uneréférence clef dans ce domaine est le livre de Ramsay and Silverman (2005).

Le but de L’ADF est d’utiliser la nature fonctionnelle des données. De nombreux tra-vaux sur l’ADF se sont intéressés aux méthodes linéaires telles que l’analyse en composanteprincipale (Cardot, 2000) ou encore les modèles linéaires (Cardot, 1999). Des méthodesnon linéaires ont aussi été étudiées. Hall et al. (2001) proposent, par exemple, de combinerles méthodes d’analyse en composante principale avec des estimateurs à noyaux. Ferratyand Vieu (2003) proposent une méthode non-paramétrique basée sur des estimateurs ànoyaux de densité conditionnelle.

Pour pallier le problème de la dimension infinie, la plupart des méthodes d’ADF sontconstruites autour de deux principes : filtrage et régularisation. L’approche par filtrageconsiste à effectuer d’abord une étape préliminaire de réduction de la dimension pour seramener à travailler en dimension finie. Un exemple de méthode de filtrage est la mé-thode de projection. Cette technique consiste à réduire la dimension des observationsen ne considérant qu’un nombre fini de coefficients des données décomposées dans unebase appropriée (base trigonométrique, de splines ou encore d’ondelettes). Ce type deméthodes est étudiée par Biau et al. (2005). Abraham et al. (2003) en proposent uneapplication dans un contexte de clustering. L’approche par régularisation consiste à res-treindre la complexité des données en imposant des contraintes de régularisation (Frankand Friedman, 1993).

Ce type de méthodes peut d’ailleurs être envisagé pour le traitement des trajectoiresdécrites en Section 1.2.3. Nous avons évalué la performance d’une procédure de classifica-tion à partir de la première partie du jeu de données composé des 54 patients (hémiplé-giques et sains), basée sur une décomposition dans une base de B-splines des trajectoires(Ci)i≤N issues du protocole 3. La performance de cette procédure de classification, éva-luée avec la règle du leave-one-out, est de 72% de patients bien classés. Nous avons aussiévalué la performance d’une procédure de classification basée sur la méthode proposéepar Ferraty and Vieu (2003). La performance obtenue est de 59% de patients bien classés.Les résultats obtenus ne sont pas satisfaisants et montrent que l’utilisation telles quellesde méthodes de classification de trajectoires ne semble pas appropriée dans le cadre denotre étude. Il n’y a rien de surprenant à cela si l’on note que de nombreuses spécificitésliées aux trajectoires issues des protocoles, comme par exemple le principe inhérent à leurobtention de la phase de perturbation du maintien postural, ne sont pas prises en comptepar les méthode usuelles de classification de trajectoires. L’objectif principal de ce travailde thèse est de proposer une procédure de classification en termes de maintien posturalqui jouisse de bonnes performances, grâce à la prise en compte des spécificités des donnéestrajectorielles.


1.4.2 Traitement des trajectoires dans le cadre des données demaintien postural

C’est le principe du filtrage que nous exploitons pour le traitement des trajectoires.Cette étape, qui s’avère cruciale, consiste à extraire des trajectoires des mesures les ré-sumant. Le choix que nous faisons de ces mesures résumées est le fruit d’une réflexionappronfondie menée en amont sur la nature de l’expérience physique qui a abouti à leurobtention.

Un grand nombre de travaux est dédié à l’analyse et la modélisation de trajectoiresde même nature que celles décrites en Section 1.2.3. Les publications consacrées à l’étudedu maintien postural peuvent se diviser en deux catégories : d’une part, les travaux s’ap-puyant sur des quantités statiques telles que le déplacement moyen ou la vitesse moyenne,et d’autre part, les travaux s’appuyant au contraire sur la dynamique temporelle destrajectoires en exploitant différents types d’approches tels que des modèles de séries tem-porelles (Sabatini, 2000), des modèles de markov cachés (Chambaz et al., 2009) ou encoreune modélisation des trajectoires par mouvement Brownien (Collins and De Luca, 1995;Chiari et al., 2000; Bardet and Bertrand, 2007) ou processus de diffusion (Newell et al.,1997; Frank et al., 2000).

Cependant, aucune de ces modélisations n’a jamais été utilisée, à notre connaissance,pour un travail de classification de sujets en terme de maintien postural. Nous tirons icipartie des deux types d’extraction d’information, en établissant des mesures résuméespour part statiques et pour part reposant sur la dynamique temporelle des trajectoires.

1.4.3 Construction d’une procédure de classification pour des don-nées de maintien postural

Dans cette section, nous exposons le principe de construction des procédures de clas-sification telles que développé dans les travaux de cette thèse. La construction d’uneprocédure de classification pour les données décrites en Section 1.2 repose, comme vuen Section 1.4.2, sur une étude des trajectoires sur laquelle se fonde la construction demesures résumées destinées à la mise en place d’une règle de classification.

En classification supervisée, la qualité d’une règle se mesure à travers son erreur degénéralisation définie par :

R(g) = P0(g(X) 6= Y ).

Il est à noter que pour un classifieur gn, son erreur de généralisation est une variablealéatoire fonction de l’échantillon d’apprentissage Dn. Nous définissons la règle de classi-fication de Bayes :

g∗(x) = arg maxy∈Y

P0(Y = y|X = x), ∀x ∈ X .


La règle de classification de Bayes est caractérisée par :

R(g∗) = ming∈G

R(g).

Cette règle de classification est donc optimale selon l’erreur de généralisation R. Evidem-ment, la règle de Bayes n’est pas calculable en pratique car elle dépend de la loi P0 quiest inconnue.

Différentes stratégies peuvent être envisagées pour la construction de classifieurs. Laconstruction d’un classifieur peut se faire via l’estimation des probabilités conditionnellesd’appartenance aux classes : pour tout y ∈ Y , on construit sur la base de l’échantillond’apprentissage Dn un estimateur P (Y = y|X = ·) de P0(Y = y|X = ·), puis on considèrele classifieur caractérisé par :

gn(x) = arg maxy∈Y

P (Y = y|X = x), ∀x ∈ X .

La construction d’un classifieur peut aussi s’effectuer via la minimisation du risque em-pirique sur un sous-ensemble G ⊂ G. Le classifieur gn ainsi associé à G est caractérisépar

gn = arg ming∈G

1

n

n∑i=1

1{g(Xi) 6=Yi}.

La construction d’une règle de classification fait l’objet d’une immense littérature.Dans cette thèse, nous avons utilisé différents algorithmes de classification tels que lesforêts aléatoires (Breiman, 2001) dont nous décrivons le principe en Section 1.4.4, ouencore le top scoring pair classifier (Geman et al., 2004) dont nous décrivons le principeen Section 1.4.5 et pour lequel nous proposons une étude théorique au Chapitre 4. Enoutre, nous nous appuyons sur le super-learning une méthode d’apprentissage statistiquereposant sur un principe d’agrégation convexe par validation croisée V -fold (van der Laanet al., 2007). Cette méthode consiste à agréger différentes procédures de classification enune seule. Nous décrivons le super-learning en Section 1.4.6.

1.4.4 Les forêts aléatoires

Dans cette section, nous présentons le principe des forêts aléatoires introduit par Brei-man (2001). Pour une introduction détaillée en français nous renvoyons au travail dethèse de Genuer (2010). Les forêts aléatoires sont une méthode d’apprentissage statis-tique consistant en l’agrégation de prédicteurs obtenus par arbre. Les forêts aléatoiress’appliquent aussi bien dans le cadre de la régression que dans celui de la classification.Ici, nous présentons ce principe dans le cadre de la classification. Différentes versions desforêts aléatoires ont été étudiées (Genuer, 2010). Dans cette section, nous présentons laversion des forêts aléatoires la plus couramment utilisée en pratique : Random Forests-RI(Random Forest with Random Inputs) implémentée par Léo Breiman et Adele Cutler.


C’est la version implémentée dans le package R randomForest. Nous présentons le prin-cipe des arbres CART (Classification And Regression Trees) puis le principe des RandomForests-RI.

Principe des arbres de classification

Introduite par Breiman et al. (1984), CART désigne une méthode statistique consistanten la construction de prédicteurs décrits par des arbres. Nous présentons ce principe dansle cadre de la classification. Nous renvoyons au travail de thèse de Gey (2002) pour unedescription détaillée de la méthode en régression.

Les arbres de classification, binaires, correspondant à une partition dyadique de l’es-pace des variables d’entrée. À chaque étape, l’espace est découpé en deux sous-parties.Pour simplifier l’exposé, nous supposons que X = Rp. La racine de l’arbre est donc l’es-pace Rp. La racine est alors découpée en deux nœuds fils. Pour cela, on sélectionne uncouple (j, d) ∈ {1, . . . , p}×R. Une fois le couple (j, d) sélectionné, la règle de découpe estla suivante (on note Xj

i la jème coordonnée du vecteur Xi) :- l’ensemble des observations {(Xi, Yi) ∈ Dn, Xj

i ≤ d} constitue le nœud fils de gauche ;- tandis que l’ensemble des observations {(Xi, Yi) ∈ Dn, Xj

i > d} constitue le nœud filsde droite.

La sélection du couple (j, d) s’appuie sur l’indice de Gini des nœuds fils. L’indice de Ginid’un nœud t est défini par :

I = 1−K∑y=1

p2y(t),

où py(t) est la proportion des observations de la classe y dans le nœud t. Le couple (j, d)est celui pour lequel la somme des indices de Gini des nœuds fils résultant est minimale.Ainsi, minimiser l’indice de Gini revient à augmenter l’homogénéité des nœuds obtenus.Un nœud étant parfaitement homogène (ou pur) s’il ne contient que des observationsd’une même classe (son indice de Gini vaut alors 0).

Une fois la racine de l’arbre découpée, le même procédé est appliqué à chacun desnœuds fils et ainsi de suite jusqu’à atteindre un critère d’arrêt. Différents critères d’arrêtpeuvent être envisagés comme ne pas découper les nœuds contenant moins qu’un certainnombre d’observations. Les nœuds terminaux sont appelés feuilles de l’arbre. Une foisl’arbre obtenu, une dernière étape appelé élagage consiste à sélectionner le meilleur sous-arbre (par minimisation d’un critère pénalisé choisi au préalable) de l’arbre complet.

Principe des Random Forests-RI

Le principe des Random Forests-RI consiste à dans un premier temps générer plusieurséchantillons bootstraps DΘ1

n , . . . , DΘqn . Ensuite sur chaque échantillon DΘl

n , l = 1, . . . , q,on construit un arbre de classification suivant une version légèrement différente de celle


présentée précédemment : la découpe d’un nœud est effectuée sur un sous-ensemble dem ≤ p variables tirées de façon aléatoire. Plus précisément, pour chaque nœud, un sous-ensemble {j1, . . . , jm} ⊂ {1, . . . , p} est tiré uniformément et sans remise. La découpe dunœud est alors effectuée en ne considérant que les variables (Xjk

i , Yjki ), k = 1, . . . ,m

constituant l’échantillon DΘln considéré. L’arbre obtenu n’est pas élagué. La collection

d’arbre obtenue est alors agrégée par vote majoritaire pour donner le prédicteur RandomForests-RI.

De nombreux travaux sur l’analyse théorique des différentes versions des forêts aléa-toires ont été menés comme ceux de Lin and Jeon (2006); Biau et al. (2008). Il est à noterque l’utilisation pratique ainsi que le réglage des paramètres du package R randomForestsont étudiés par Breiman (2001); Díaz-Uriarte and Alvarez de Andrés (2006); Genuer(2010).

1.4.5 Top scoring pair classifier

Le top scoring pair (TSP) classifier est une règle de classification originellement pro-posée par Geman et al. (2004) dans un cadre d’analyse de données post-génomiques.Le principe de la procédure consiste à déterminer des paires de gènes dont les dif-férences d’expression apportent une information significative pour discriminer les pa-tients sains et les patients atteints d’une certaine maladie. Ci-dessous, nous décrivonsla procédure proposée par Geman et al. (2004). Nous supposons (X, Y ) ∈ RG × {0, 1}.Nous notons X = (X1, . . . , XG) et Dn = {(Xk, Y k), k = 1, . . . , n}. Nous notons aussiJ = {J = (i, j) ∈ {1, . . . , G}2, i < j}, ZJ = 1{Xi<Xj} pour J ∈ J et pour y = 0, 1,I(y) = {k ≤ n, Y k = y} et N(y) = card(I(y)). La procédure du TSP classifier consisteà :

1) calculer pour tout J ∈ J le score ∆J = |pJ(1)− pJ(0)|, où pour y = 0, 1,pJ(y) =

1{N(y)>0}N(y)

∑k∈I(y) Z

kJ (avec la convention 0/0 = 0) ;

2) déterminer J = arg maxJ∈J ∆J ;3) considérer le classifieur ΦJ défini par

ΦJ(X) = ΦJ(ZJ) = 1{pJ

(1)>pJ

(0)}1{ZJ

=1} + 1{pJ

(1)≤pJ

(0)}1{ZJ

=0}.

Une implémentation du TSP classifier est proposée dans le package R tspair. Nousproposons en Figure 1.3 une illustration, pour une paire J = (i, j) ∈ J , de la règleempirique de classification ΦJ(X) = ΦJ(ZJ) = 1{pJ (1)>pJ (0)}1{ZJ=1}+1{pJ (1)≤pJ (0)}1{ZJ=0}.

À notre connaissance aucune étude théorique n’a été menée sur le TSP classifier. Nousen proposons une au Chapitre 4. Dans cette étude, nous montrons que le TSP classifiertel que proposé par Geman et al. (2004) n’est pas optimal pour l’erreur de généralisation.Nous en proposons une version modifiée, optimale pour l’erreur de généralisation.


Figure 1.3 – Illustration de la règle empirique de classification ΦJ Pour une paire J = (i, j).Dans cet exemple nous avons N(1) = 10 et N(0) = 5. nous avons pJ(1) = 1/2 et pJ(0) = 3/5.Pour une nouvelle observation X, le classifieur ΦJ est donc défini par ΦJ(X) = 0 si Xj > Xi etΦJ(X) = 1 si Xj ≤ Xi De plus, le calcul du score de cette paire donne ∆J = |1/2− 3/5| = 1/10.

1.4.6 Super-learning

Le super-learning est une procédure d’apprentissage statistique reposant sur un prin-cipe d’agrégation convexe par validation croisée V -fold proposée par van der Laan et al.(2007). L’objectif de cette méthode est d’estimer un paramètre statistique h∗ ∈ H, où Hest un ensemble de paramètres statistiques, défini comme

h∗ = arg minh∈H

EP0 [L((X, Y ), h)] ,

pour L une fonction de perte. van der Laan et al. (2007) proposent cette méthode dansle cadre de l’estimation, sur la base de l’échantillon Dn, du paramètre défini par h∗(X) =EP0

[Y∣∣X] avec L((X, Y ), h) = (Y − h(X))2, pour h ∈ H. Ci-dessous, nous décrivons le

super-learning dans le cadre de la classification binaire, où Y = {0, 1} et X = Rp. Il est ànoter que dans ce cas h∗(X) = P0(Y = 1

∣∣X).Nous considérons H = {h1, . . . , hM} un ensemble de procédures d’estimation du para-

mètre h∗. On entend par procédure d’estimation, une fonction de la distribution empiriquequi renvoie un estimateur de P0(Y = 1|X = ·). Nous notons Pn =

∑ni=1 Dirac(Xi, Yi) la

distribution empirique de Dn. Le principe, fondé sur la validation croisée, est le suivant :1) pour tout m = 1, . . . ,M , on calcule hm(Pn) ;

1.5. Résumé des résultats de la thèse 35

2) on considère (Bv)1≤v≤V une partition régulière de {1, . . . , n} avec V ≥ 2 un entier ;pour v = 1, . . . , V , on note D(v)

n = (Xi, Yi)i∈Bv et D(−v)n = (Xi, Yi)i/∈Bv dont la

distribution empirique est donnée par P (−v)n = 1

n−Card(Bv)

∑i 6∈Bv Dirac(Xi, Yi) ;

3) pour v = 1, . . . , V et m = 1, . . . ,M , on calcule hm(P(−v)n ) ;

4) pour v = 1, . . . , V et m = 1, . . . ,M , on calcule hm(P(−v)n )(Xi) = H i,m pour tout

i ∈ Dvn ;

5) finalement, on dispose d’une matrice H = (H i,m)i≤n,m≤M ∈Mn,M(R)6) on calcule alors

α = arg minα∈CM

n∑i=1

(Yi − (H.α)i)2 ,

avec CM = {α ∈ [0, 1]M ,∑M

m=1 αm = 1} ;– l’estimateur super-learning h est alors défini comme

h =M∑m=1

αmhm(Pn).

Le package R SuperLearner propose une implémentation du super-learning. Au Cha-pitre 5, nous proposons une version du super-learning pour la classification multi-classe.

1.5 Résumé des résultats de la thèseCette section est dédiée à la présentation générale des travaux effectués durant la

thèse sous forme de quatre chapitres. Deux chapitres sont consacrés à des études sur lesdonnées de maintien postural décrites en Section 1.2. Nous les présentons en Section 1.5.1.Les deux chapitres suivants sont consacrés à des travaux sur différentes notions abordéesau cours des deux premiers chapitres, ils sont présentés en Section 1.5.2.

1.5.1 Études sur les données de maintien postural

Le Chapitre 2 est consacré à une étude qui s’appuie sur la première partie du jeu dedonnées constituée des 54 patients (hémiplégiques et sains), des trajectoires issues desquatre protocoles décrit en Section 1.2 et des covariables. Dans cette étude, nous pro-posons de classer les patients suivant deux classes (hémiplégiques et sains) sur la basedes covariables et des trajectoires issues des quatre protocoles suivis par l’ensemble despatients. Dans ce travail, nous considérons les trajectoires (Bi)i≤N et (Ci)i≤N décritesen Section 1.2.3. La construction des mesures résumées des trajectoires s’appuie sur desquantités statiques. En amont, nous proposons une méthode de classement des protocolesdu plus informatif au moins informatif (en terme de maintien postural). Le classementdes protocoles est fondé sur un ensemble de tests statistique. Une fois le classement des


protocoles effectué, nous construisons quatre procédures de classification exploitant res-pectivement les informations issues d’un, deux, trois ou quatre des protocoles les plusinformatifs. Ceci nous permet d’évaluer les performances des procédures de classificationqui ne nécessitent pas forcément qu’un patient à classer subisse tous les protocoles. Il fautsavoir en effet que l’expérience peut se révéler pénible, même pour un patient sain. Cetravail fait l’objet d’une publication parue dans Annals of Applied Statistics.

Le Chapitre 3 est consacré à une étude qui s’appuie sur la seconde partie du jeu dedonnées constituée des 70 patients (hémiplégiques, vestibulaires et sains) et des trajec-toires associées aux protocoles 2 et 3. Dans ce travail, nous construisons une procédurede classification des patients, suivant les trois classes (hémiplégique, vestibulaire et sain),reposant sur l’étude des trajectoires (Ci)i≤N issues des deux protocoles. Nous construisonsles mesures résumées à partir d’une modélisation des trajectoires (Ci)i≤N par processusCox-Ingersoll-Ross. Ce travail de modélisation nous permet de prendre en compte la dy-namique temporelle des trajectoires. En outre, nous proposons d’introduire dans notremodélisation la notion d’instant de rupture. L’introduction des ruptures a pour objet deprendre en compte le fait qu’un patient ne réagit pas immédiatement aux perturbationsdes protocoles, et que de plus ce temps de réaction n’est pas le même pour l’ensemble despatients. Les résultats obtenus dans cette étude montrent que la modélisation proposéeest pertinente, l’introduction des ruptures apportant notamment des informations impor-tantes pour la classification. Ce travail a fait l’objet d’un article actuellement soumis.

1.5.2 Travaux connexes au maintien postural

Dans le Chapitre 4, nous étudions les performances théoriques du TSP classifier pro-posé par Geman et al. (2004) et utilisé dans la procédure de classification proposée auChapitre 2. Dans cette étude nous prouvons un résultat de consistance pour le TSPclassifier proposé par Geman et al. (2004). Nous proposons aussi des variantes de cetteprocédure pour lesquelles nous donnons aussi des résultats de consistance. Ce travail faitl’objet d’un article en préparation.

Le Chapitre 5 est consacré à deux compléments. Le premier complément a trait à l’es-timation des instants de rupture dans le cadre des processus Cox- Ingersoll-Ross observésà temps discrets. Nous proposons une procédure d’estimation des instants rupture dans lecas où le nombre de ruptures est inconnu. La procédure proposée permet en outre de dé-terminer les paramètres affectés par les ruptures. Nous évaluons les performances de cetteprocédure par simulation. Les résultats obtenus sont encourageants et permettent d’en-visager une étude théorique des performances de cette procédure. Le second complémentpropose une application du super-learning dans le cadre de la classification multi-classe.Le super-learning est exploité pour la mise en place des procédures de classification auChapitre 2. Dans ce travail, nous donnons un résultat de consistance pour le super-learningdans la cadre de la classification multi-classe. Les performances de cette procédure sontétudiées sur trois jeux de données réelles.

Chapitre 2

Classification selon le style postural

37

2.1. Introduction 39

Classification in postural styleCe chapitre qui est une version modifiée de l’article Chambaz and Denis (2012) a été

écrit en collaboration avec Antoine Chambaz

abstract:

This article contributes to the search for a notion of postural style, focusing on the issueof classifying subjects in terms of how they maintain posture. Longer term, the hopeis to make it possible to determine on a case by case basis which sensorial informationis prevalent in postural control, and to improve/adapt protocols for functional rehabil-itation among those who show deficits in maintaining posture, typically seniors. Here,we specifically tackle the statistical problem of classifying subjects sampled from a two-class population. Each subject (enrolled in a cohort of 54 participants) undergoes fourexperimental protocols which are designed to evaluate potential deficits in maintainingposture. These protocols result in four complex trajectories, from which we can extractfour small-dimensional summary measures. Because undergoing several protocols can beunpleasant, and sometimes painful, we try to limit the number of protocols needed for theclassification. Therefore, we first rank the protocols by decreasing order of relevance, thenwe derive four plug-in classifiers which involve the best (i.e., more informative), the twobest, the three best and all four protocols. This two-step procedure relies on the cutting-edge methodologies of targeted maximum likelihood learning (a methodology for robustand efficient inference) and super-learning (a machine learning procedure for aggregatingvarious estimation procedures into a single better estimation procedure). A simulationstudy is carried out. The performances of the procedure applied to the real dataset (andevaluated by the leave-one-out rule) go as high as a 87% rate of correct classification (47out of 54 subjects correctly classified), using only the best protocol.

Keywords : aggregation; classification; cross-validation; postural style; targeted minimumloss learning (TMLE); top-scoring pairs classifier; super-learning.

2.1 IntroductionThis article contributes to the search for a notion of postural style, focusing on the

issue of classifying subjects in terms of how they maintain posture.Posture is fundamental to all activities, including locomotion and prehension. Posture

is the fruit of a dynamic analysis by the brain of visual, proprioceptive and vestibularinformations. Proprioceptive information stems from the ability to sense the position,location, orientation and movement of the body and its parts. Vestibular informationroughly relates to the sense of equilibrium. Every individual develops his/her own prefer-ences according to his/her sensorimotor experience. Sometimes, a sole kind of information(usually visual) is processed in all situations. Although this kind of processing may be

40 Chapitre 2. Classification selon le style postural

efficient for maintaining posture in one’s usual environment, it is likely not adapted toreacting to new or unexpected situations. Such situations may result in falling, the con-sequences of a fall being particularly bad in seniors. Longer term, the hope is to makeit possible to determine on a case by case basis which sensorial information is prevalentin postural control, and to improve/adapt protocols for functional rehabilitation amongthose who show deficits in maintaining posture, typically seniors.

As in earlier studies (Bertrand et al., 2001; Chambaz et al., 2009, and referencestherein), our approach to characterizing postural control involves the use of a force-platform. Subjects standing on a force-platform are exposed to different perturbations,following different experimental protocols (or simply protocols in the sequel). The force-platform records over time the center-of-pressure of each foot, that is “the position of theglobal ground reactions forces that accommodates the sway of the body” (Newell et al.,1997). A protocol is divided into three phases: a first phase without perturbation, fol-lowed by a second phase with perturbation, followed by a last phase without perturbation.Different kind of perturbations are considered. They can be characterized either as visual,or proprioceptive or vestibular, depending on which sensorial system is perturbed.

We specifically tackle the statistical problem of classifying subjects sampled from atwo-class population. The first class regroups subjects who do not show any deficit inpostural control. The second class regroups hemiplegic subjects, who suffer from a propri-oceptive deficit. Even though differentiating two subjects from the two groups is relativelyeasy by visual inspection, it is a much more delicate task when relying on some generalbaseline covariates and the trajectories provided by a force-platform. Furthermore, sinceundergoing several protocols can be unpleasant, and sometimes painful (some sensitivesubjects have to lie down for 15 minutes in order to recover from dizziness after a seriesof protocols), we also try to limit the number of protocols used for classifying.

Our classification procedure relies on cutting-edge statistical methodologies. In par-ticular, we propose a nice preliminary ranking of the four protocols (in view of how muchwe can learn from them on postural control) which involves the targeted maximum like-lihood methodology (van der Laan and Rubin, 2006; Rose and van der Laan, 2011), astatistical procedure for robust and efficient inference The targeted maximum likelihoodmethodology relies on the super-learning procedure, a machine learning methodology foraggregating various estimation procedures (or simply estimators) into a single better es-timation procedure/estimator (van der Laan et al., 2007; Rose and van der Laan, 2011).In addition to being a key element of the targeted maximum likelihood ranking of theprotocols, the super-learning procedure plays also a crucial role in the construction of ourclassification procedure.

We show that it is possible to achieve a 87% rate of correct classification (47 out of 54subjects correctly classified; the performance is evaluated by the leave-one-out rule), usingonly the more informative protocol. Our classification procedure is easy to generalize(we actually provide an example of generalization), so we reasonably hope that evenbetter results are within reach (especially considering that more data should soon augment

2.2. Data description 41

our small dataset). The interest of the article goes beyond the specific application. Itnicely illustrates the versatility and power of the targeted maximum likelihood and super-learning methodologies. It also shows that retrieving and comparing small-dimensionalsummary measures from complex trajectories may be convenient to classify them.

The article is organized as follows. In Section 2.2, we describe the dataset which is atthe core of the study. The classification procedure is formally presented in Section 2.3,and its performances, evaluated by simulations, are discussed in Section 2.4. We reportin Section 2.5 the results obtained by applying the latter classification procedure to thereal dataset. We relegate to the 2.6 a self-contained presentation of the super-learningprocedure as it is used here, and the description of an estimation procedure/estimatorthat will play a great role in the super-learning procedure applied to the construction ofour classification procedure.

2.2 Data description

The dataset, collected at the Center for the study of sensorimotor functioning (CE-SEM, Université Paris Descartes), is described in Section 3.2.1. We motivate the intro-duction of a summarized version of each observed trajectory, and present its constructionin Section 2.2.2.

2.2.1 Original dataset

Each subject undergoes four protocols that are designed to evaluate potential deficitsin maintaining posture. The specifics of the latter protocols are presented in Table 2.1.Protocols 1 and 2 respectively perturb the processing of visual data and proprioceptiveinformation by the brain. Protocol 3 cumulates both perturbations. Protocol 4 relieson perturbing the processing of vestibular information by the brain through a visualstimulation.

A total of n = 54 subjects are enrolled. For each of them, the age, gender, laterality(the preference that most humans show for one side of their body over the other), heightand weight are collected. Among the 54 subjects, 22 are hemiplegic (due to a cerebrovas-cular accident), and therefore suffer from a proprioceptive deficit in postural control.Initial medical examinations concluded that the 32 other subjects show no pronounceddeficits in postural control. We will refer to those subjects as normal subjects.

For each protocol, the center-of-pressure of each foot is recorded over time. Thus eachprotocol results in a trajectory (Xt)t∈T = (Lt, Rt)t∈T where Lt = (L1

t , L2t ) ∈ R2 (respec-

tively, Rt = (R1t , R

2t )) gives the position of the center-of-pressure of the left (respectively,

right) foot on the force-platform at time t, for each t in T = {kδ : 1 ≤ k ≤ 2800} wherethe time-step δ = 0.025 seconds (the protocol lasts 70 seconds). We represent in Fig-ure 3.1 two such trajectories (Xt)t∈T associated with a normal subject and a hemiplegic


Table 2.1: Specifics of the four protocols designed to evaluate potential deficits in posturalcontrol. A protocol is divided into three phases: a first phase without perturbation ofthe posture is followed by a second phase with perturbations, which is followed by a lastphase without perturbation. Different kind of perturbations are considered. They canbe characterized either as visual (closing the eyes), or proprioceptive (muscular stimula-tion) or vestibular (optokinetic stimulation), depending on which sensorial information isperturbed.

protocol 1st phase (0→15s) 2nd phase (15→50s) 3rd phase (50→70s)1 eyes closed2 no perturbation muscular stimulation no perturbation3 eyes closed

muscular stimulation4 optokinetic stimulation

subject, both undergoing the third protocol (see Table 2.1). Note that we do not takeinto account the first few seconds of the recording that a generic subject needs to reach astationary behavior.

Figure 3.1 confirms the intuition that the structure of a generic trajectory (Xt)t∈T iscomplicated, and that a mere visual inspection is, at least on this example, of little help fordifferentiating the normal and hemiplegic subjects. Although several articles investigatehow to model and use such trajectories directly (Bertrand et al., 2001; Chambaz et al.,2009), we rather choose to rely on a summary measure of (Xt)t∈T instead of relying on(Xt)t∈T .

2.2.2 Constructing a summary measure

The summary measure that we construct is actually a summary measure of a one-di-mensional trajectory (Ct)t∈T that we initially derive from (Xt)t∈T . First, we introducethe trajectory of barycenters, (Bt)t∈T = (1

2(Lt +Rt))t∈T . Second, we evaluate a reference

position b which is defined as the componentwise median value of (Bt)t∈T∩[0,15] (that is,the median value over the first phase of the protocol). Third, we set Ct = ‖Bt − b‖2 forall t ∈ T , the Euclidean distance between Bt and the reference position b, which providesa relevant description of the sway of the body during the course of the protocol. We plotin Figure 3.2 two examples of (Ct)t∈T corresponding to two different protocols undergoneby a hemiplegic subject.

Because the most informative features can be found at the start and end of the secondphase, we use the following finite-dimensional summary measure of (Xt)t∈T (through

2.2. Data description 43

Figure 2.1: Sequences t 7→ Lt (left) and t 7→ Rt (right) of positions of the center-of-pressure over T of both feet on the force-platform, associated with a normal subject (top)and a hemiplegic subject (bottom), who undergo the third protocol (see Table 2.1).

(Ct)t∈T ):

Y = (C+1 − C−1 , C−2 − C+

1 , C+2 − C−2 ) (2.1)

where

C−1 =δ

5

∑t∈T∩[10,15[

Ct, C+1 =

δ

5

∑t∈T∩]15,20]

Ct,

C−2 =δ

5

∑t∈T∩[45,50[

Ct, C+2 =

δ

5

∑t∈T∩]50,55]

Ct

are the averages of Ct computed over the intervals [10, 15[, ]15, 20], [45, 50[ and ]50, 55](that is over the last/first 5 seconds before/after the beginning/ending of the secondphase of the protocol of interest). We arbitrarily choose this 5-second threshold. Notethat C−2 − C−1 = Y2 + Y1, C+

2 − C−1 = Y3 + Y2, C+2 − C+

1 = Y1 + Y2 + Y3 are linearcombinations of the components of Y . We refer to Figure 3.3 for a visual representationof the definition of the summary measure Y .


Figure 2.2: Representing the trajectories t 7→ Ct over T which correspond to two differentprotocols undergone by a hemiplegic subject (protocol 1 on the left, protocol 3 on theright).

2.3 Classification procedureWe describe hereafter our two-step classification procedure. We formally introduce the

statistical framework that we consider in Section 2.3.1. The first step of the classificationprocedure consists in ranking the protocols from the most to the less informative withrespect to some criterion, see Section 2.3.2. The second step consists of the classification,see Section 2.3.3.

2.3.1 Statistical framework

The observed data structure O writes as O = (W,A, Y 1, Y 2, Y 3, Y 4), where– W ∈ R × {0, 1}2 × R2 is the vector of baseline covariates (corresponding to initial

age, gender, laterality, height and weight, see Section 3.2.1);– A ∈ {0, 1} indicates the subject’s class (with convention A = 1 for hemiplegic

subjects and A = 0 for normal subjects);– for each j ∈ {1, 2, 3, 4}, Y j ∈ R3 is the summary measure (as defined in (2.1))

associated with the jth protocol.We denote by P0 the true distribution of O. Since we do not know much about P0,

we simply see it as an element of the non-parametric setM of all possible distributionsof O.

We need a criterion to rank the four protocols from the most to the less informativein view of the subject’s class. To this end, we introduce the functional Ψ :M→ R12 suchthat, for any P ∈M, Ψ(P ) = (Ψj(P ))1≤j≤4 where

Ψj(P ) =(EP{EP [Y j

i |A = 1,W ]− EP [Y ji |A = 0,W ]

})1≤i≤3

.

The component Ψji (P ) is known in the literature as the variable importance measure of

A on the summary measure Y ji controlling for W (Rose and van der Laan, 2011). Under

2.3. Classification procedure 45

Figure 2.3: Visual representation of the definition of the finite-dimensional summarymeasure Y of (Xt)t∈T . The four horizontal segments (solid lines) represent, from left toright, the averages C−1 , C+

1 , C−2 , C+2 of (Ct)t∈T over the intervals [10, 15[, ]15, 20], [45, 50[,

]50, 55]. The three vertical segments (solid lines ending by an arrow) represent, from topto bottom, the components Y1, Y2 and Y3 of Y . Two additional vertical lines indicate thebeginning and ending of the second phase of the considered protocol.

causal assumptions, it can be interpreted as the effect of A on Y ji . More generally, we

are interested in Ψji (P0) because the further it is from zero, the more knowledge on A

we expect to gain from the observation of W and the summary measure Y ji (i.e., by

comparing the averages of (Ct)t∈T computed over the time intervals corresponding toindex i, see Section 2.2.2). For instance, say that Ψ2

1(P0) > 0: this means that (in P0-average), the variation in mean of the mean postures C−1 and C+

1 of a hemiplegic subjectcomputed before and after the beginning of the muscular perturbation is larger thanthat of a normal subject. In words, the postural control of a hemiplegic subject is moreaffected by the beginning of the muscular perturbation than the postural control of anormal subject.

2.3.2 Targeted maximum likelihood ranking of the protocols

Our ranking of the four protocols relies on testing the null hypotheses

“Ψji (P0) = 0”, (i, j) ∈ {1, 2, 3} × {1, 2, 3, 4},

against their two-sided alternatives. Heuristically, rejecting “Ψji (P0) = 0" tells us that the

value of the ith coordinate of the summary measure Y j provides helpful information forthe sake of determining whether A = 0 or A = 1.

We consider tests based on the targeted maximum likelihood methodology (van derLaan and Rubin, 2006; Rose and van der Laan, 2011). Because presenting a self-containedintroduction to the methodology would significantly lengthen the article, we provide belowonly a very succinct description of it. The targeted maximum likelihood methodologyrelies on the super-learning procedure, a machine learning methodology for aggregating


various estimators into a single better estimator (van der Laan et al., 2007; Rose andvan der Laan, 2011), based on the cross-validation principle. Since super-learning alsoplays a crucial role in our classification procedure (see Section 2.3.3), and because it ispossible to present a relatively short self-contained introduction to the construction of asuper-learner, we propose such an introduction in the 2.6.

Let O(1), . . . , O(n) be n independent copies of O. For each (i, j) ∈ {1, 2, 3}×{1, 2, 3, 4},we compute the targeted maximum likelihood estimator (TMLE) Ψj

i,n of Ψji (P0) based

on O(1), . . . , O(n) and an estimator σji,n of its asymptotic standard deviation σji (P0). Themethodology applies because Ψj

i is a “smooth” parameter. It notably involves the super-learning of the conditional means Qj

i (P0)(A,W ) = EP0(Yji |A,W ) and of the conditional

distribution g(P0)(A|W ) = P0(A|W ) (the collection of estimators aggregated by super-learning is given in the 2.6). Under some regularity conditions, the estimator Ψj

i,n of Ψji (P0)

is consistent when either Qji (P0) or g(P0) is consistently estimated, and it satisfies a central

limit theorem. In addition, if g(P0) is consistently estimated by a maximum-likelihoodbased estimator, then σji,n is a conservative estimator of σji (P0). Thus, we can consider inthe sequel the test statistics T ji,n =

√nΨj

i,n/σji,n (all (i, j) ∈ {1, 2, 3} × {1, 2, 3, 4}).

Now, we rank the four protocols by comparing the 3-dimensional vectors of test statis-tics (T j1,n, T

j2,n, T

j3,n) for 1 ≤ j ≤ 4. Several criteria for comparing the vectors were consid-

ered. They all relied on the fact that the larger is |T ji,n| the less likely the null “Ψji (P0) = 0”

is true. Since the results were only slightly affected by the criterion, we focus here on asingle one. Thus, we decide that protocol j is more informative than protocol j′ if

3∑i=1

(T j′

i,n)2 <3∑i=1

(T ji,n)2.

This rule is motivated by the fact that, if σj1,n, σj2,n, σ

j3,n are consistent estimators of

σj1(P0), σj2(P0), σj3(P0), then∑3

i=1(T ji,n)2 asymptotically follows the χ2(3) distributionunder Hj

0 : “Ψj(P0) = 0”.By definition of O and by construction of the TMLE procedure, this rule yields almost

surely a final ranking of the four protocols from the more to the less informative for thesake of determining whether A = 0 or A = 1.

2.3.3 Classifying a new subject

We now build a classifier φ for determining whether A = 0 or A = 1 based on thebaseline covariates W and summary measures (Y 1, Y 2, Y 3, Y 4). To study the influence ofthe ranking on the classification, we actually build four different classifiers φ1, φ2, φ3, φ4

which respectively use only the best (more informative) protocol, the two best, the threebest, and all four protocols. So φj is a function of W and of j among the four vectorsY 1, Y 2, Y 3, Y 4.

2.4. Simulation study 47

Say that J ⊂ {1, 2, 3, 4} has J elements. First, we build an estimator hJn(W,Y j, j ∈J ) of P0(A = 1|W,Y j, j ∈ J ) based on O(1), . . . , O(n), relying again on the super-learning methodology (the collection of estimators involved in the super-learning is givenin the 2.6). Second, we define

φJ(W,Y j, j ∈ J ) = 1{hn(W,Y j, j ∈ J ) ≥ 12}

and decide to classify a new subject with information (W,Y j, j ∈ J ) into the groupof hemiplegic subjects if φJ(W,Y j, j ∈ J ) = 1 or into the group of normal subjectsotherwise.

Thus, the classifier φJ relies on a plug-in rule, in the sense that the Bayes decisionrule 1{P0(A = 1|W,Y j, j ∈ J ) ≥ 1

2} is mimicked by the empirical version where one

substitutes an estimator of P0(A = 1|W,Y j, j ∈ J ) for the latter regression function. Suchclassifiers can converge with fast rates under a complexity assumption on the regressionfunction and the so-called margin condition (Audibert and Tsybakov, 2007).

2.4 Simulation study

In this section, we carry out and report the results of a simulation study of the per-formances of the classification procedure described in Section 2.3. The details of thesimulation scheme are presented in Section 2.4.1, and the results are reported and evalu-ated in Section 2.4.2.

2.4.1 Simulation scheme

Instead of simulating (W,A) and the four complex trajectories (X1t )t∈T , (X2

t )t∈T ,(X3

t )t∈T , (X4t )t∈T associated with four fictitious protocols, we generate directly (W,A) and

the summary measures Y 1, Y 2, Y 3, Y 4 that one would have derived from the trajectories(X1

t )t∈T , (X2t )t∈T , (X3

t )t∈T , (X4t )t∈T . Three different scenarios/probability distributions

P 10 , P

20 , P

30 are considered. They only differ from each other with respect to the conditional

distributions g(P 10 ), g(P 2

0 ), g(P 30 ) (see Table 2.2 for their characterization).

For each k = 1, 2, 3, an observation O = (W,A, Y 1, Y 2, Y 3, Y 4) drawn from P k0 meets

the following constraints:

1. W is drawn from a slightly perturbed version of the empirical distribution of W asobtained from the original dataset (the same for all k = 1, 2, 3);

2. conditionally on W , A is drawn from g(P k0 )(·|W );

3. conditionally on (A,W ) and for each (i, j) ∈ {1, 2, 3} × {1, 2, 3, 4}, Y ji is drawn

from the Gaussian distribution with mean Qji (A,W ) (the same for all k = 1, 2, 3;

see Table 2.3 for the definition of the conditional means) and common standarddeviation σ ∈ {0.5, 1}.


Although that may not be clear when looking at Table 2.2, the difficulty of the clas-sification problem should vary from one scenario to the other. When using the firstconditional distribution g(P 1

0 ), the conditional probability of A = 1 given W is concen-trated around 1

2, as seen in Figure 2.4 (solid line), with P 1

0 (g(P 10 )(1|W ) ∈ [0.48, 0.54]) ' 1.

In words, the covariate provides little information for predicting the class A. On the con-trary, estimating g(P 1

0 ) from the data is easy since logit g(P 10 )(A = 1|W ) is a simple linear

function ofW . The conditional probabilities of A = 1 givenW under g(P 20 ) and g(P 3

0 ) areless concentrated around 1

2, as seen in Figure 2.4 (dashed and dotted lines, respectively).

Thus, the covariates may provide valuable information for predicting the class. But thistime, logit g(P 2

0 ) and logit g(P 30 ) are tricky functions of W .

Likewise, the family of conditional means Qji (A,W ) of Y j

i given (A,W ) that we usein the simulation scheme is meant to cover a variety of situations with regard to howdifficult it is to estimate each of them and how much they tell about the class prediction.Instead of representing the latter conditional means, we find it more relevant to providethe reader with the values (computed by Monte-Carlo simulations) of

Sj(P k0 ) =

3∑i=1

(Ψji (P

k0 )

σji (Pk0 )

)2

for (j, k) ∈ {1, 2, 3, 4} × {1, 2, 3} and σ ∈ {0.5, 1}, see Table 2.4. Indeed, nSj(P k0 ) should

be interpreted as a theoretical counterpart to the criterion∑3

i=1(T ji,n)2. In particular, wederive from Table 2.4 the theoretical ranking of the protocols: for every scenario P k

0 andσ ∈ {0.5, 1}, the protocols ranked by decreasing order of informativeness are protocols 3,2, 1, 4.

2.4.2 Leave-one-out evaluation of the performances of the classi-fication procedure

We rely on the leave-one-out rule to evaluate the performances of the classificationprocedure. We acknowledge that they usually result in overly optimistic error rates.Specifically, we repeat independently B = 100 times the following steps for k = 1, 2, 3:

1. Draw independently O(1,b), . . . , O(n,b) from P k0 , with n = 54; we denote by A(`,b) the

group membership indicator associated with O(`,b), and by O′(`,b) the observed datastructure O(`,b) deprived of A(`,b).

2. For each ` ∈ {1, . . . , n},(a) set S(`,b) = {O(`′,b) : `′ 6= `, `′ ≤ n};(b) based on S(`,b), rank the protocols (see Section 2.3.2) then build four classifiers

φ1(`,b), φ

2(`,b), φ

3(`,b) and φ

4(`,b) (see Section 2.3.3), which respectively use only the

best (more informative), the two best, the three best and all four protocols

2.4. Simulation study 49

Table 2.2: Characterization of the three conditional distributions g(P k0 ), k = 1, 2, 3, as

considered in the simulation scheme.

scenario 1: logit g(P 10 )(A = 1|W ) =

W1

50+W2

50− W3

10− W4

2000+W5

scenario 2: logit g(P 20 )(A = 1|W ) = cos(W1 +W5) + sin(W1 +W5)

scenario 3: logit g(P 30 )(A = 1|W ) = b10 cos(W1 +W3)c

+√

5 cos(W1 +W3)− b5 cos(W1 +W3)c π50

sin(10 cos(W1 +W3))

Table 2.3: Conditional means Qji (A,W ) of Y j

i given (A,W ), (i, j) ∈ {1, 2, 3}×{1, 2, 3, 4},as used in the three different scenarios of the simulation scheme.

fictitiousprotocol

conditional means

j = 1

Q11(A,W ) = 2[A sin(W1 +W4) + (1− A) cos(W1 +W5)]

Q12(A,W ) = 3[(1− 6A)X5 − AX4 +X3 − (1− A

2)X2 + AX]

where X = (1−2A)W5

160+ A

4

Q13(A,W ) = A tan(W4) + (1− A) tan(W5 +W1W2)

j = 2

Q21(A,W ) = 1

120[A+W1 +W2 +W3 +W5 +W1W2

+(1− A)W5 +W2W3W4]Q2

2(A,W ) = 5[A sin(W1 +W4) + (1− A) cos(W1 +W4)]Q2

3(A,W ) = 120

[A(2W1 + 32W3) + (1− A)W5]

j = 3

Q31(A,W ) = A log(2W1 + 3

2W3) + (1− A) log(W5)

Q32(A,W ) = 1

45(X + 7)(X + 2)(X − 7)(X − 3)

where X = W4+W5

145+AW1

Q33(A,W ) = π[A sin(X)(b2Xc+

√2X − b2Xc)

+(1− A) cos(X)(b2Xc+√

2X − b2Xc)]where X = cos(W3 +W4 +W5)

j = 4

Q41(A,W ) = 1

100(2X3 +X2 −X − 1)where X = AW2+W4+W5

30

Q42(A,W ) = 1

60(A+W1 +W2 +W3 +W5)

Q43(A,W ) = 1

1000[W1W3W4

3+ (1− A)(W1 +W3W4) + AW2W5]


Figure 2.4: Visual representation of the three conditional distributions considered in thesimulation scheme. We plot the empirical cumulative distribution functions of {gk(A =1|W(`)) : ` = 1, . . . , n} for k = 1 (solid line), k = 2 (dashed line) and k = 3 (dotted line),where W(1), . . . ,W(L) are independent copies of W drawn from the marginal distributionof W under P k

0 (which does not depend on k), and L = 105.

(thus φJ(`,b) is a function of the covariate W and of J among the four vectorsY 1, Y 2, Y 3, Y 4);

(c) classify O(`,b) according to the four classifications φ1(`,b)(O

′(`,b)), φ

2(`,b)(O

′(`,b)),

φ3(`,b)(O

′(`,b)), φ

4(`,b)(O

′(`,b)).

3. Compute PerfJb = 1n

∑n`=1 1{A(`,b) = φJ(`,b)(O

′(`,b))} for J = 1, 2, 3, 4.

From these results, we compute for each J ∈ {1, 2, 3, 4} the mean and standard deviationof the sample (PerfJ1 , . . . ,PerfJB). All the standard deviations are approximately equal to5%. Second, for every value of σ ∈ {0.5, 1}, performance PerfJ actually depends onlyslightly on J (i.e., on the number of protocols taken into account in the classificationprocedure), without any significant difference for j = 1, 2, 3, 4. Third, the latter perfor-mances all equal approximately 80% when σ = 1, and increase to approximately 90%when σ = 0.5. This increase is the expected illustration of the fact the larger is the vari-ability of the summary measures, the more difficult is the classification procedure. On thecontrary, it is a little bit surprising that the conditional distributions g(P 1

0 ), g(P 20 ), g(P 3

0 )do not affect significantly the performances. Anecdotally, the estimated ranking of theprotocols always coincide with the ranking that we derived from Table 2.4.

2.5. Application to the real dataset 51

2.5 Application to the real dataset

We present here the results of the classification procedure of Section 2.3 applied tothe real dataset. Thus, we first rank the protocols from the more to the less informativeregarding postural control, see Section 2.5.1; then we construct the four classifiers andrely on the leave-one-out rule to evaluate their performances, see Section 3.5.1. A nat-ural extension of the classification procedure is considered and applied in Section 2.5.3,and yields significantly better results. We conclude the article with a discussion, seeSection 2.5.4.

2.5.1 Targeted maximum likelihood ranking of the protocols overthe real dataset

Hemiplegic subjects are known to be sensitive to muscular stimulations, and also totend to compensate for their proprioceptive deficit by developing a preference for visualinformation in order to maintain posture (Bonan et al., 1996). This suggests that protocolsinvolving muscular and/or visual stimulations should rank high. What do the data tellus?

We derive and report in Table 2.5 the results of the ranking of the protocols using theentire dataset. Table 2.5 teaches us that the most informative protocol is protocol 3 (vi-sual and muscular stimulations), and that the three next protocols ranked by decreasingorder of informativeness are protocols 2 (muscular stimulation), 1 (visual stimulation),and 4 (optokinetic stimulation). Apparently, protocols 3 and 2 (which have in commonthat muscular stimulations are involved) are highly relevant for differentiating normal andhemiplegic subjects based on postural control data. On the contrary (and perhaps sur-prisingly, given the introductory remark), protocols 1 and 4 seem to provide significantlyless information for the same purpose.

2.5.2 Classification procedures applied to the real dataset

To evaluate the performances of the classification procedure applied to the real dataset,we carry out steps 2a, 2b, 2c from the leave-one-out rule described in Section 2.4.2 wherewe substitute the real dataset O(1), . . . , O(n) for the simulated one. We actually do it twice.The first time, the super-learning methodology involves a large collection of estimators;the second time, we justify resorting to a smaller collection (see the 2.6). We report theresults in Table 2.6, where the second and third rows respectively correspond to the first(larger collection) and second (smaller collection) round of performance evaluation.

Consider first the performances of the classification procedure relying on the largercollection. The proportion of subjects correctly classified (evaluated by the leave-one-outrule) equals only 70% (38 out of the 54 subjects are correctly classified) when the solemost informative protocol (i.e., protocol 3) is exploited. This rate jumps to 80% (43


out of 54 subjects are correctly classified) when the two most informative protocols (i.e.,protocols 3 and 2) are exploited. Including one or two of the remaining protocols decreasesthe performances.

The theoretical properties of the super-learning procedure are asymptotic, i.e., validwhen the sample size n is large, which is not the case in this study. Even though this iscontradictory to the philosophy of the super-learning methodology, it is tempting to reducethe number of estimators involved in the super-learning. We therefore keep only two ofthem, and run again steps 2a, 2b, 2c from the leave-one-out rule described in Section 2.4.2where we substitute the real dataset O(1), . . . , O(n) for the simulated one. Results arereported in Table 2.6 (third row). We obtain better performances: for each value ofJ (i.e., each number of protocols taken into account in the classification procedure), thesecond classifier outperforms the first one. The best performance is achieved when all fourprotocols are used, yielding a rate of correct classification equal to 85% (46 out of the 54subjects are correctly classified). This is encouraging, notably because one can reasonablyexpect that performances will be improved on when a larger cohort is available.

Yet, this is not the end of the story. We have built a general methodology that canbe easily extended, for instance by enriching the small-dimensional summary measurederived from each complex trajectory. We explore the effects of such an extension in thenext section.

2.5.3 Extension

Thus, we enrich the small-dimensional summary measure initially defined in Sec-tion 2.2.2. Since it mainly involves distances from a reference point, the most naturalextension is to add information pertaining to orientation. Relying on polar coordinates ofthe trajectory (Bt)t∈T poses some technical issues. Instead, we propose to fit simple linearmodels y(Bt) = vx(Bt) + u (where x(Bt) and y(Bt) are the abscisse and ordinate of Bt)based on the datasets {Bt : t ∈ T ∩ [10, 15[}, {Bt : t ∈ T ∩ [15, 20[}, {Bt : t ∈ T ∩ [20, 45[},{Bt : t ∈ T ∩ [45, 50[} and {Bt : t ∈ T ∩ [50, 55[}, and to use the slope estimatesas summary measures of an average orientation over each time interval. The observeddata structure and parameter of interest still write as O = (W,A, Y 1, Y 2, Y 3, Y 4) andΨ(P ) = (Ψj(P ))1≤j≤4, but Y j and Ψj(P ) now belong to R8 (and not R3 anymore). Theranking of the protocols now relies on the criterion

∑8i=1(T ji,n)2, whose definition straight-

forwardly extends that of the criterion introduced in Section 2.3.2. The values of thecriteria are reported in Table 2.7. The ranking of protocols remains unchanged, but thediscrepancies between the values for protocol 2 on one hand and for protocols 1 and 4 onthe other hand are smaller.

We finally apply once again steps 2a, 2b, 2c from the leave-one-out rule described inSection 2.4.2 where we substitute the real dataset O(1), . . . , O(n) for the simulated one,and use either all estimators or only two of them in the super-learner. The results arereported in Table 2.8.

2.5. Application to the real dataset 53

When we include all estimators in the super-learner, the classification procedure thatrelies on the extended small-dimensional summary measure of the complex trajectoriesoutperforms the classification procedure that relies on the initial summary measure, forevery value of J (i.e., each number of protocols taken into account in the classificationprocedure). The performances are even better when we only include two estimators.Remarkably, the best performance is achieved using only the most informative protocol,with a proportion of subjects correctly classified (evaluated by the leave-one-out rule)equal to 87% (47 out of the 54 subjects are correctly classified)

2.5.4 Discussion

We conducted a brief simulation study to evaluate the performances of the classifi-cation procedure. With its three different scenarios (i.e., three conditional distributiong(P k

0 )) and four trajectories (i.e., twelve conditional means Qji ), the simulation scheme

is far from comprehensive. Rather than extending the simulation study, we discuss herewhat additional scenarios would need to be considered before applying the procedure moregenerally. In the same spirit as in Section 2.4, one should consider:

– other conditional distributions g(P k0 ), |g(P k

0 (A = 1|W )− 1/2| being close to 0 withhigh probability (W strong predictor of A) or low probability (W weak predictor ofA);

– other conditional means Qji , (i, j) ∈ {1, 2, 3} × {1, 2, 3, 4}, and standard deviation

σ, {Sj(P k0 ) : j = 2, 3, 4} having one, two, three or four well-separated values.

A straightforward generalization would consist in allowing the standard deviation of Y ji to

depend on (i, j). Furthermore, another approach to simulating could be considered, wherethe trajectories (X1

t )t∈T , (X2t )t∈T , (X3

t )t∈T , (X4t )t∈T would be obtained as realizations

of stochastic processes satisfying a variety of piecewise stochastic differential equations(SDEs). For instance, the same SDE could be used to simulate the trajectory during thefirst and third phases (0→15s and 50→70s, without perturbations), and another SDEcould be used to simulate during the second phase (15→50s, with perturbations). On topof that, the breaking points could be drawn randomly from two symmetric distributionscentered at 15s and 50s.

This alternative approach to simulating arose while we were trying to quantify insome way how much information is lost when one substitutes a summary measure forthe original trajectory for the purpose of classifying. Ultimately such a quantificationcould permit to elaborate new summary measures with minimal information loss. We didnot obtain a satisfactory answer to this very difficult question. However, we identifiedimportant information that can be derived from the original trajectory, such as meanorientation, as used in Section 2.5.3, and empirical breaking points, as evoked for the sakeof simulating in the previous paragraph, and used for the sake of classifying by Denis(2011).


2.6 Supplement to “Classification in postural style”We gather in Appendix 2.6.1 a short and self-contained description of the construction

of a super-learner, as well as the estimation procedures that we choose to involve for thesake of classifying subjects in postural style. One of those estimation procedures, a variantof the top-scoring pairs classification procedure, is presented in Appendix 2.6.2.

2.6.1 Specifics of the super-learning procedures

We refer to (van der Laan et al., 2007; Rose and van der Laan, 2011) for a gen-eral presentation of the super-learning methodology, which is a method for combin-ing/aggregating, by V -fold cross-validation, a family of candidates estimation procedures(or simply estimators) of a regression function.

Denote by X ∈ Rd the vector of predictors and by Y ∈ R the outcome of interest(in the classification framework, Y ∈ {0, 1}). Consider a learning dataset composed byn independent copies (X(1), Y(1)), . . . , (X(n), Y(n)) of (X, Y ). The empirical distribution ofthe sample is denoted by Pn. For each k ∈ K a finite set of cardinality K, fk(Pn) is anestimator of the regression function E(Y |X) built on Pn. The objective is to aggregatethe K estimators into a single one which will enjoy some optimality/oracle properties withrespect to a certain criterion of interest. The criterion of interest drives the choice of a lossfunction L, which maps a generic observation (X, Y ) and a candidate estimator f of theregression function to a real number L((X, Y ), f). We use two different loss functions, onefor the estimation of the conditional means Qj

i (P0) and the conditional distribution g(P0)as needed to rank the protocols (see Section 3.2), one for the classification of subjects(see Section 3.3). Regarding how to combine (fk(Pn))k∈K, we decide to consider convexcombinations.

We use the value V = 10, and therefore draw a n-tuple (V(1), . . . , V(n)) ∈ {1, . . . , V }nsuch that maxv,v′≤V |

∑ni=1 1{V(i) = v} −

∑ni=1 1{V(i) = v′}| ≤ 1 (the v-strata have the

same cardinality, up to one unit). In the classification framework, we also impose thatminv≤V,y=0,1

∑ni=1 1{Y(i) = y, V(i) = v} ≥ 1 (each v-stratum contains at least one observa-

tion from each class). For every v ≤ V , denote by P vn the empirical distribution of those

observations for which V(i) 6= v; the training set made of the latter observations yields Kestimators (fk(P

vn ))k∈K. We define the minimizer of the V -fold cross-validated risk:

α∗(Pn) = arg maxα∈SK

∑v≤V

n∑i=1

L

((X(i), Y(i)),

∑k∈K

αkfk(Pvn )

)1{V(i) = v}

(where SK = {u ∈ RK+ :∑

k≤K uk = 1}), which finally yields the super-learner obtainedas the α∗(Pn)-convex combination of the K estimators (fk(Pn))k∈K trained on the wholesample, f ∗(Pn) =

∑k∈K α

∗k(Pn)fk(Pn).

We now turn to the description of the loss function L and of the estimators (fk)k∈Kspecifically used in the article.

2.6. Supplement to “Classification in postural style” 55

– Estimation of the conditional means Qji (P0) and the conditional distribu-

tion g(P0). We choose the squared error loss function, characterized by

L2((X, Y ), f) = (Y − f(X))2,

whose expectation is minimized, over the set of measurable functions of X, at thetargeted regression function E(Y |X).Regarding the estimators (fk)k∈K,– conditional means Qj

i (P0): we rely on (alphabetical order) the elastic net (Tib-shirani, 1996; Tibshirani et al., 2010), general additive models (Hastie and Tibshi-rani, 1990), linear models, loess local regressions (Loader, 1999), random forests(Breiman, 2001), and support vector machines (Christianini and Shawe-Taylor,2000). Different values of the various tuning parameters are considered.

– conditional distribution g(P0): we rely on (alphabetical order) the k-nearest-neighbors, logistic linear regressions, random forests, and support vector ma-chines. Different values of the various tuning parameters are considered.

– Classification of subjects. We choose the loss function characterized by

L3((X, Y ), f) = (Y − expit {β(f(X)− 1

2)})2

with β = 200 a large positive number. This loss function is meant to provide a trade-off between the loss functions characterized by L1((X, Y ), f) = (Y −1{f(X) ≥ 1

2})2

and L2((X, Y ), f) = (Y − f(X))2. Since L3 is very close to L1, the super-learneroptimizes the convex-combination parameter α∗(Pn) to be applied to the collection(fk(Pn))k∈K of estimators in view of the plug-in rule that will ultimately be used(see Section 3.3). Yet L3 is smoother than L1 (which takes its values in {0, 1}), thusin that sense not so far away from L2, which makes the numerical computation ofα∗(Pn) easier for deriving the super-learner.Regarding the estimators (fk)k∈K, we rely on (alphabetical order) the k-nearest-neighbors, logistic regressions, random forests, and the top-scoring pairs classifica-tion procedure (see Appendix 2.6.2). Different values of the tuning parameters areconsidered for the k-nearest-neighbors and random forests. We also consider thesmaller collection of estimators that reduces to random forests (with a single choiceof the tuning parameters) and the top-scoring pairs classification procedure.Interestingly, the top-scoring pairs classification procedure involves pairwise compar-isons 1{Xj ≤ Xj′} (1 ≤ j 6= j′ ≤ d) of predictors: how relevant such comparisonsmay be for the sake of classification is not taken into account by our method to rankprotocols by decreasing order of informativeness relative to postural control.

The R (R Development Core Team, 2010) coding of our procedure was eased by thefollowing R-packages: SuperLearner, e1071, gam, glmnet, randomForest and stats (werefer to The Comprehensive R Archive Network, www.cran.r-project.org/, for details).


2.6.2 The top-scoring pairs classification procedure

The top-scoring pair (abbreviated to TSP) classification procedure was introducedby Geman et al. (2004) for the purpose of molecular classification based on some geneticinformation. It is parameter-free and simply relies on pairwise comparisons of predictors,hence its remarkable robustness. Even though the context of the present article greatlydiffers from that of (Geman et al., 2004) (mostly in that the number of predictors is smallhere whereas it is huge there), the very good performances enjoyed by the TSP classifierfor molecular classification motivate making a variant of the TSP classification procedureone of the candidates estimators in the super-learner.

Denote by X = (X1, . . . , Xd) ∈ Rd the vector of predictors and Y ∈ {0, 1} the classmembership indicator. The objective is to estimate the regression function P (Y = 1|X)based on n independent copies of (X, Y ) ∼ P that we denote (X(1), Y(1)), . . . , (X(n), Y(n)).One first derives the so-called TSP,

(k0, `0) = arg max1≤k<`≤d

∣∣∣∣∣∑n

i=1 1{Xk(i) ≤ X`

(i), Y(i) = 1}∑ni=1 1{Y(i) = 1}

−∑n

i=1 1{Xk(i) ≤ X`

(i), Y(i) = 0}∑ni=1 1{Y(i) = 0}

∣∣∣∣∣(we implicitly assume that the latter arg max reduces to a single pair; the procedureeasily extends to the case that the arg max contains several pairs by making them vote,see (Geman et al., 2004)). The TSP (k0, `0) is chosen in such a way that the empiricalconditional probabilities of having Xk0 ≤ X`0 given Y = 1 or Y = 0 differ as much aspossible. Introduce now

p−0,n =

∑ni=1 1{Xk0

(i) ≤ X`0(i), Y(i) = 1}∑n

i=1 1{Xk0(i) ≤ X`0

(i)},

p+0,n =

∑ni=1 1{Xk0

(i) > X`0(i), Y(i) = 1}∑n

i=1 1{Xk0(i) > X`0

(i)}.

Our TSP estimator of the regression function P (Y = 1|X) finally writes as

PTSPn (Y = 1|X) = p−0,n1{Xk0 ≤ X`0}+ p+

0,n1{Xk0 > X`0}.

An alternative definition (closer to the original TSP procedure of Geman et al. (2004))

could have been to introduce πy0,n =∑ni=1 1{X

k0(i)≤X`0

(i),Y(i)=y}∑n

i=1 1{Y(i)=y}for y = 0, 1 and to estimate

the regression function P (Y = 1|X) by π10,n1{π1

0,n ≥ π00,n} + (1 − π0

0,n)1{π10,n < π0

0,n} ifXk0 ≤ X`0 , and by (1− π0

0,n)1{π10,n ≥ π0

0,n}+ π10,n1{π1

0,n < π00,n} otherwise.


AcknowledgmentsThe authors thank I. Bonan (Service de Médecine Physique et de réadaptation, CHU

Rennes) and P-P. Vidal (CESEM, Université Paris Descartes) for introducing them tothis interesting problem and providing the dataset. They also thank warmly A. Samson(MAP5, Université Paris Descartes) for several fruitful discussions, and the editor forsuggesting improvements.


Table 2.4: Values of Sj(P k0 ) for (j, k) ∈ {1, 2, 3, 4} × {1, 2, 3} and σ ∈ {0.5, 1}. They

notably teach us that, for every scenario P k0 and σ ∈ {0.5, 1}, the protocols ranked by

decreasing order of informativeness are protocols 3, 2, 1, 4.

fictitious scenario 1 scenario 2 scenario 3protocol σ = 0.5 σ = 1 σ = 0.5 σ = 1 σ = 0.5 σ = 1j = 1 0.14 0.04 0.11 0.03 0.14 0.04j = 2 0.86 0.37 0.74 0.31 0.85 0.37j = 3 2.94 1.12 2.49 0.93 2.90 1.11j = 4 0.06 0.01 0.04 0.01 0.06 0.01

Table 2.5: Ranking the four protocols using the entire real dataset. We report the real-izations of the criteria

∑3i=1(T ji,n)2 obtained for protocols j = 1, 2, 3, 4. These values teach

us that the most informative protocol is protocol 3, and that the three next protocolsranked by decreasing order of informativeness are protocols 2, 1, and 4.

protocol j = 3 j = 2 j = 1 j = 4

criterion∑3

i=1(T ji,n)2 75.51 33.13 6.80 5.53

Table 2.6: Leave-one-out performances PerfJ of the classification procedure using thereal dataset. Performance PerfJ corresponds to the classifier based on J among the fourvectors Y 1, Y 2, Y 3, Y 4 (those associated with the J more informative protocols) and eitherusing all estimators (second row) or only two of them (third row) in the super-learner (seeAppendix A in 2.6).

J = 1 J = 2 J = 3 J = 4

PerfJ (all est.) 0.70 (38/54) 0.80 (43/54) 0.74 (40/54) 0.78 (42/54)PerfJ (two est.) 0.74 (40/54) 0.81 (44/54) 0.78 (42/54) 0.85 (46/54)

Table 2.7: Ranking the four protocols using the entire real dataset and the extended small-dimensional summary measure of the complex trajectories. We report the realizations ofthe criteria

∑8i=1(T ji,n)2 obtained for protocols j = 1, 2, 3, 4. The ranking is the same as

that derived from Table 2.5.

protocol j = 3 j = 2 j = 1 j = 4

criterion∑8

i=1(T ji,n)2 83.64 43.61 14.92 12.60


Table 2.8: Leave-one-out performances PerfJ of the classification procedure using the realdataset and the extended small-dimensional summary measure of the complex trajecto-ries. Performance PerfJ corresponds to the classifier based on J among the four vectorsY 1, Y 2, Y 3, Y 4 (those associated with the J more informative protocols) and either us-ing all estimators (second row) or only two of them (third row) in the super-learner (seeAppendix A in 2.6).

J = 1 J = 2 J = 3 J = 4

PerfJ (all est.) 0.82 (44/54) 0.80 (43/54) 0.80 (43/54) 0.78 (42/54)PerfJ (two est.) 0.87 (47/54) 0.85 (46/54) 0.80 (43/54) 0.82 (44/54)


Chapitre 3

Processus de diffusions pour laclassification en maintien postural

61


Classification in postural style based on stochastic processmodelingabstract:

We address the statistical challenge of classifying subjects as hemiplegic, vestibular or nor-mal based on complex trajectories obtained through two experimental protocols which weredesigned to evaluate potential deficits in postural control. The classification procedure involvesa dimension reduction step where the complex trajectories are summarized by finite-dimensionalsummary measures based on a stochastic process model for a real-valued trajectory. This allowsus to retrieve from the trajectories information relative to their temporal dynamic. A leave-one-out evaluation yields a 79% performance of correct classification for a total of n = 70 subjects,with 22 hemiplegic (31%), 16 vestibular (23%) and 32 normal (46%) subjects.

Keywords: change point estimation; multiclass classification; cross-validation; postural mainte-nance; Stochastic process modeling.

3.1 IntroductionThis article contributes to the study of postural maintenance. Posture is fundamental

for physical activity. A deficit in postural maintenance often results in falling, which isparticularly hazardous in elderly people. The main objective of the research in posturalmaintenance is to adapt protocols for functional rehabilitation for people who displaydeficits in maintaining posture. We focus here on the issue of classifying subjects interms of postural maintenance. A cohort of 70 subjects has been followed at the centerfor the study of sensorimotor functioning (CESEM) of the University Paris Descartes.The subjects who did not exhibit deficit in postural maintenance were labelled as normal.The others were hemiplegic and vestibular subjects and labelled accordingly. A hemiplegicsubject suffers from a disorder of his proprioceptive system, which pertains to the senseof position, location, orientation of the body and its parts. A vestibular subject suffersfrom a deficit of his vestibular system, which is the system composed by the inner ear andthe vestibular nerve and contributes to the sense of balance.

Each subject completed two experimental protocols designed to evaluate his/her po-tential deficits in maintaining posture. Each protocol is divided into three phases: a firstphase of 15 seconds with no postural perturbation, a second phase of 35 seconds withpostural perturbation followed by a last phase of 20 seconds without postural pertur-bation. During each protocol, measurements of the center-of-pressure of each foot areperformed at discrete times, which results in temporal trajectories. Our objective is toclassify subjects as hemiplegic, vestibular, and normal based on these trajectories. This isa significantly more difficult extension of the problem considered by Chambaz and Denis(2012), where only hemiplegic and normal subjects were to be classified (particularly be-cause classifying into three classes is more difficult than classifying into two classes), and

64 Chapitre 3. Processus de diffusions pour la classification en maintien postural

thus another step in the direction of clustering subjects in terms of their postural style.We refer to the bibliography in (Chambaz and Denis, 2012) for a review on the analysisof postural control.

Our present study is related to the topic of functional data classification. There isa sizeable literature dedicated to this topic. A variety of methods have been proposed,relying for instance on linear discriminant analysis (James and Hastie, 2001), principalcomponent analysis (Hall et al., 2001), or a functional data version of the nearest-neighborsclassification rule (Biau et al., 2005). We refer to (Ramsay and Silverman, 2005) for ageneral introduction to functional data analysis. Our approach here involves a dimen-sion reduction step of our complex trajectories based on change points estimation andinference on a stochastic process model. This allows us to better use the data at handthan in (Chambaz and Denis, 2012) in the sense that we manage to retrieve from thetrajectories information relative to their temporal dynamic. In contrast, Chambaz andDenis (2012) retrieve static information from the trajectories in the sense that the dimen-sion reduction step relies on comparisons of basic statistics (such as the mean value of asegment of the trajectory, see Section 3.5.2) computed on arbitrarily chosen time intervalsstarting or ending where the perturbation phase starts or ends.

The data set and its modeling are introduced in Section 3.2. We present our inferenceprocedure in Section 3.3. We carry it out on real and simulated data, and summarize theresults in the latter section. The classification procedure is presented in Section 3.4. Theresults of its application to the real dataset are exposed and commented on in Section 3.5,which concludes on some perspectives for future research.

3.2 Data and modeling

The dataset at hand is described in Section 3.2.1. We introduce and motivate itsmodeling in Section 3.2.2.

3.2.1 Original dataset

The dataset was collected at the center for the study of sensorimotor functioning(CESEM) of the University Paris Descartes. It is composed of a cohort of n = 70 sub-jects. Among the 70 subjects, 22 are hemiplegic (due to cerebrovascular accidents), 16are vestibular, while the 32 remaining subjects are identified as normal based on an initialmedical evaluation.

Each subject completes two protocols which are designed to evaluate potential deficitsin maintaining posture. These protocols have been identified as the most informativeamong four similar protocols for classifying hemiplegic versus normal subjects in theearlier study (Chambaz and Denis, 2012). Both protocols are divided into three phases: afirst phase of 15 seconds with no postural perturbation, a second phase of 35 seconds with

3.2. Data and modeling 65

postural perturbations (either some muscular perturbations or a combination of muscularand visual perturbations), and a third phase of 20 seconds with no postural perturbation.We expect that the subject’s postural sway change around the beginning and the end ofthe perturbation phase (around 15 and 50 seconds).

For each protocol, the center-of-maximal-pressure exerted by each foot on a force-platform is recorded at equispaced discrete times. Thus each protocol results in a trajec-tory (Li, Ri)i≤N , where (Li, Ri) is the observation at time ti = iδ for i = 1, . . . , N = 2800and δ = 0.025 seconds. For each i = 1, . . . , N , Li = (L1

i , L2i ) ∈ R2 and Ri = (R1

i , R2i ) ∈ R2

respectively correspond to the left and right feet.

protocol 1st phase (0→15s) 2nd phase (15→50s) 3rd phase (50→70s)1 muscular stimulation2 no perturbation eyes closed no perturbation

muscular stimulation

Table 3.1: Specifics of the two protocols considered in this study. A protocol is divided intothree phases: a first phase with no postural perturbation is followed by a second phase withperturbations, which is itself followed by a third phase without perturbation. Different kind ofperturbations are considered. In protocol 1, one perturbs the processing by the brain of proprio-ceptive information (through muscular stimulation). In protocol 2, one perturbs the processingby the brain of both visual information (subjects must close their eyes) and proprioceptive in-formation (through muscular stimulation).

Following Chambaz and Denis (2012), we derive from (Li, Ri)i≤N the one-dimensionaltrajectory C1:N (sometimes we will denote a n-tuple (x1, . . . , xn) by x1:n) characterized by

Ci =

∥∥∥∥Li +Ri

2− γ∥∥∥∥

2

,

where γ is defined as the componentwise median value of (12(Li + Ri))i≤400 over the ten

first seconds of the protocol. The process C1:N provides a relevant description of the swayof the body during the course of the protocol. In Figure 3.1 we display the trajectoriesC1:N corresponding to protocol 1 (left plot) and protocol 2 (right plot) as completed bya single hemiplegic subject. Figure 3.1 confirms the intuition that the subject’s posturalsway is not necessarily instantaneously affected by the start and end of perturbations.

3.2.2 Data modeling

We model the trajectory C1:N as the observation at discrete times of a stochasticprocess (C(t))t∈[δ,Nδ] characterized by a stochastic differential equation.

We model the effects of the perturbations by possible changes in the volatility and driftfunctions at two unknown change points. In order to account for the fact that subjectsreact differently, we also assume that the change points differ among subjects. We denote


Figure 3.1: Two trajectories C1:N respectively corresponding to protocol 1 (left) and protocol 2(right) undergone by a single hemiplegic subject.

by (T1, T2) the change points of a trajectory, with T0 = τ0δ = δ < T1 = τ1δ < T2 =τ2δ < T3 = 70 (τ0 = 1, τ1 and τ2 are integers). We assume that there exist two functionsa, b, and two parameters (φ1, φ2, φ3) and (σ1, σ2, σ3) such that, for all t ∈ [Tk−1, Tk[ andk = 1, 2, 3, {

dC(t) = a(C(t), φk)dt+ b(C(t), σk)dW (t)

C(Tk−1) = Cτk−1,

where (W (t))t∈[δ,Nδ] is a standard Wiener process.We now specify the parametric forms of the volatility and drift functions a and b.

Because Figure 3.1 indicates that the variance of C1:N is not constant over time andbecause, in addition, the process C1:N takes positive values, we decide to rely on theclassical Cox-Ingersoll-Ross (CIR) process (Kloeden and Platen, 1992), by setting a(x, φ =(λ, µ)) = λ(µ− x) and b(x, σ) = σ

√x (for all x ≥ 0).

In summary, we model the trajectory C1:N as the observation at discrete times of thestochastic process (C(t))t∈[δ,Nδ] characterized, for t ∈ [Tk−1, Tk[ (k = 1, 2, 3), by{

dC(t) = λk(µk − C(t))dt+ σk√C(t)dW (t)

C(Tk−1) = Cτk−1,

(3.1)

where the initial condition C(Tk−1) is the observation Cτk−1at time Tk−1, λk, µk and

σk are positive constants. It is known that if 2µkλk/σ2k ≥ 1 then the process remains

positive and admits a stationary distribution. This distribution is a Gamma distributionwith shape parameter 2µkλk/σ

2k and scale parameter σ2

k/2λk (Kloeden and Platen, 1992).Therefore, the complete parameter writes as (τ1, τ2, θ1, θ2, θ3) with θk = (λk, µk, σk) fork = 1, 2, 3. We assume that the changes necessarily affect the drift parameter through achange in µ, or put in other words that µ1 6= µ2 and µ2 6= µ3. Changes can also affect theother parameters.

3.3. Inference 67

3.3 InferenceIn this section we address the inference of the parameter (τ1, τ2, θ1, θ2, θ3). We first

describe the estimating procedure in Section 3.3.1, then we present its results on thereal dataset in Section 3.3.2, and finally we summarize a simulation study performed toevaluate its performances in Section 3.3.3.

3.3.1 A two-step estimating procedure

We define a two-step estimating procedure: we first estimate the sequence of changepoints, then we estimate (θ1, θ2, θ3) on each interval characterized by the previously esti-mated change points.

A great variety of methods have been proposed for the estimation of change points (Bas-seville and Nikiforov, 1993; Bai, 1994; Cuenod et al., 2011; Lavielle, 2005; Bardet et al.,2012, among many others). We choose to adopt a popular approach originally proposedby Bai (1994). The approach relies on a least-squares criterion and aims at detectingchange points which affect the mean of a linear process. The estimator (τ1, τ2) of thesequence of change points (τ1, τ2) is defined as:

(τ1, τ2) = arg min(τ1,τ2)

1

N

3∑k=1

τk−1∑i=τk−1

(Ci − Ck

)2,

where Ck is the arithmetic mean of Cτk−1:τk−1. This makes sense because if (C(t))t∈[Tk−1,Tk[

reaches a stationary regime then the random variables Cτk , . . . , Cτk−1 are identically dis-tributed from a stationary distribution whose mean parameter is µk.

Set k ∈ {1, 2, 3}. On each interval [Tk−1, Tk[= [τ1δ, τ2δ[, we estimate θk by minimizinga contrast function which is based on the log-likelihood of the approximated discrete-timesEuler-Marumaya’s scheme (Kessler, 1997, for instance). The latter scheme with step sizeδ guarantees that, for i = τk−1, . . . , τk − 1,

Ci+1 ≈ (1− δλk)Ci + δλkµk + σk√δ√Ci ηi+1, (3.2)

where (ηi)τk−1<i≤τk is a sequence of independent random variables with standard Gaussiandistribution. By (3.2), it is convenient to consider the following equivalent parametriza-tion: θk = (θ1,k, θ2,k, θ3,k), with θ1,k = (1 − δλk), θ2,k = δλkµk and θ3,k = σk

√δ. The

estimator θk of parameter θk is defined as:

θk = arg minθk∈R3

+

τk−1∑i=τk−1

(Ci+1 − θ1,kCi − θ2,k)2

Ciθ23,k

+ (τk − τk−1) log(θ23,k),

which actually yields closed-form expressions:

θ1,k =(τk − τk−1)

∑ Ci+1

Ci−∑Ci+1

∑1Ci

(τk − τk−1)2 −∑Ci∑

1Ci

,


θ2,k =

∑Ci+1 − θ1,k

∑Ci

(τk − τk−1),

θ3,k =

√√√√∑ (Ci+1−θ1,kCi−θ2,k)2)

Ci

τk − τk−1

(the sums in the above expressions range over [τk−1, τk − 1]).Under mild conditions, and if the process reaches the stationary regime, then (τ1, τ2)

consistently estimates (τ1, τ2) (see Lavielle, 1999; Lavielle and Ludeña, 2000, for instance).Furthermore, if the true change points (τ1, τ2) are known then, under another set of mildconditions, the estimators (θ1, θ2, θ3) consistently estimate (θ1, θ2, θ3) (see Kessler, 1997,Theorem 1). To the best of our knowledge, there is no satisfactory result in the literatureregarding the joint estimation of the change points (τ1, τ2) and the parameter (θ1, θ2, θ3).

3.3.2 Application to the real dataset

We undertake a simulation study of the properties of the two-step estimating procedurepresented in the previous section, and summarize its results in the next section. Becausewe characterize our simulation scheme based on the results obtained when applying thelatter procedure to the real dataset, we first present them here.

For each subject and protocol, we estimate (τ1, τ2, θ1, θ2, θ3) from the corresponding(real) observed trajectory. The results pertaining to the estimation of (τ1, τ2) are summa-rized in Table 3.2 (the mean and standard deviation of the estimates (τ1δ, τ2δ) computedover each group of subjects are provided) and illustrated in Figure 3.2. Results pertain-ing to the estimation of (θ1, θ2, θ3) are summarized in Table 3.5 (the mean and standarddeviation of the estimates (θ1, θ2, θ3) computed over each group of subjects are provided).

protocol 1 protocol 2hemiplegic subjects 18.4 (7.4) 43.3 (12.2) 16.8 (1.0) 44.5 (12.4)vestibular subjects 20.9 (5.4) 49.9 (6.5) 19.5 (6.9) 50.3 (5.3)

normal subjects 21.4 (8.2) 47.5 (10.7) 22.4 (8.1) 51.4 (2.7)

Table 3.2: For each subject and protocol, we estimate the change points (τ1, τ2). For eachgroup of subjects, we compute over the group the mean and standard deviation (given betweenparentheses) of the estimates (τ1δ, τ2δ).

Three features of Table 3.2 are worth commenting on. First, one notes that there isno significant difference across groups of subjects (however, the means of τ1δ and τ2δ areslightly shifted to the left in the group of hemiplegic subjects relative to the two othergroups). Second, the means of τ1δ are close to the time of start of perturbations for allgroups and protocols. As for the means of τ2δ, they are close to the time of end forvestibular and normal subjects and both protocols. In hemiplegic subjects, one notes

3.3. Inference 69

Figure 3.2: Representing the estimated change points (τ1δ, τ2δ) (vertical lines) as obtained basedon the two (real) observed trajectories of the same single hemiplegic subject as in Figure 3.1 (left:protocol 1, right: protocol 2).

that the standard deviations are quite large and that the mean of τ2δ is shifted to the leftrelative to the time of end of perturbations. This is due to the fact that for each protocol,30% of hemiplegic subjects feature an estimator τ2δ close to 30 seconds. Third, judgingby the standard deviations, all hemiplegic subjects tend to react similarly by adjustingquickly to the perturbations undergone in protocol 2. Likewise, all normal subjects tendto react similarly by adjusting quickly to the end of perturbations undergone in protocol 2.In protocol 1, the large standard deviation associated to the mean value of τ2δ computedover the group of normal subjects reflects the fact that 20% of these subjects feature anestimator τ2δ close to 30 seconds.

Regarding Table 3.5, we consider in turn the parameters θ1,k (k = 1, 2, 3), θ2,k (k =

1, 2, 3) and θ3,k (k = 1, 2, 3). First, the estimates θ1,k (k = 1, 2, 3) behave similarly acrossgroups of subjects, protocols and intervals [Tk−1, Tk[ (k = 1, 2, 3). On the contrary, theestimates of θ2,k (k = 1, 2, 3) behave quite differently across groups of subjects, protocolsand intervals [Tk−1, Tk[ (k = 1, 2, 3). This is a promising feature for the sake of classifyingsubjects by group, which is our main problem at stake. As for the estimates θ3,k (k =1, 2, 3), for any given group of subjects and protocol, they behave quite similarly. However,for protocol 2, it seems that the estimates θ3,k (k = 1, 2, 3) in normal subjects behavedifferently from their counterpart in hemiplegic or vestibular subjects. This is anotherpromising feature.

In conclusion, and for the sake of illustrating our modeling and two-step estimatingprocedure,(i) we arbitrarily choose a hemiplegic subject and a normal subject;(ii) for each of them, we retrieve the estimator (τ1, τ2, θ1, θ2, θ3) of (τ1, τ2, θ1, θ2, θ3) based

on the trajectory obtained under protocol 2;(iii) for each of them, we simulate a trajectory under (τ1, τ2, θ1, θ2, θ3) (we refer to Sec-

tion 3.3.3 for the specifics of the simulation);


(iv) we represent on the same plots the real and simulated trajectories, see Figure 3.3.

It appears that the real and simulated trajectories are relatively resembling.

Figure 3.3: In each plot (left: hemiplegic subject, right: normal subject), we represent the real(solid line) and simulated (dotted line) trajectories, where the parameters of the simulation arederived from the real trajectory by applying our two-step estimating procedure.

3.3.3 Simulation study

We carry out a simulation study to evaluate the performances of the two-step esti-mating procedure presented in Section 3.3.1. We directly simulate the trajectory C1:N .Specifically,

(i) we set (T1, T2) = (τ1δ, τ2δ) = (15, 50);

(ii) we rely on the Euler scheme (3.2) with step size δ/10 to approximate the sam-pling of (C(t))t∈[δ,Nδ] from the distribution characterized by (4.15) where, for eachk ∈ {1, 2, 3}, θk equals the mean of its 22 estimates based on the 22 real trajecto-ries associated to the 22 hemiplegic subjects and protocol 1 (see Section 3.3.2 andTable 3.5);

(iii) we conclude by sub-sampling (C(t))t∈[δ,Nδ] to derive C1:N .

We sample B = 100 independent copies of C1:N . For each copy we estimate theparameters (τ1, τ2, θ1, θ2, θ3). The means and standard deviations of the estimated changepoints and parameters computed over the 100 independent replications are reported inTable 3.3.

Three comments on Table 3.3 are in order. First, regarding the estimation of (τ1, τ2),we note that the means of the estimated change points are very close to their respectivetrue values. Moreover, the standard deviations are small. Second, regarding the estima-tion of (θ1,k, θ3,k) (k = 1, 2, 3), we note that the means of the estimates of θ1,k and θ3,k

are very close to their respective true values and that the standard deviations are small.Third, regarding the estimation of θ2,k (k = 1, 2, 3), we emphasize that the means are

3.4. Classification 71

k = 1 k = 2 k = 3

θ1,k 0.975 (0.010) 0.975 (0.009) 0.973 (0.006)θ1,k 0.970 0.980 0.980

θ2,k 0.114 (0.033) 0.473 (0.121) 0.214 (0.076)θ2,k 0.110 0.390 0.160

θ23,k 0.078 (0.004) 0.058 (0.002) 0.060 (0.003)θ2

3,k 0.080 0.060 0.060τ1δ 16.2 (0.9) τ2δ 50.8 (2.1)τ1δ 15.0 τ2δ 50.0

Table 3.3: For each of the B = 100 independently simulated datasets, we derive the estimates(τ1δ, τ2δ, θ1, θ2, θ3). We report here the true values and means and standard deviations (betweenparentheses) computed over the B = 100 replications.

quite apart from their respective true values. Moreover, the standard deviations are notsmall. Overall this indicates a poorer estimation of θ2,k (k = 1, 2, 3) than of (θ1,k, θ3,k)(k = 1, 2, 3). This is probably due to the fact that the time intervals [Tk−1, Tk[ (k = 1, 2, 3)are relatively narrow given the value of δ.

3.4 Classification

In Section 3.4.2 we describe our classification procedure of subjects as hemiplegic,vestibular or normal based on their trajectories obtained under the two protocols. Itis built upon the previous section. Indeed, it does not rely on the trajectories them-selves but rather on their finite-dimensional summary measures whose definition, given inSection 3.4.1, depends on the results of our two-step estimation procedure.

3.4.1 Summary measures

Most classification procedures based on trajectories involve a preliminary step of di-mension reduction where the high-dimensional trajectories are summarized into a low-dimensional summary measure (Biau et al., 2005; Ramsay and Silverman, 2005). Herewe build a tailored finite-dimensional summary measure of every trajectory which relieson the estimates (τ1, τ2, θ1, θ2, θ3) derived from it by applying our two-step estimatingprocedure.

We argue in Section 3.3.2 that, overall, τ1, τ2, θ2,k and θ3,k (k = 1, 2, 3) may berelevant for the sake of classifying subjects as hemiplegic, vestibular or normal. Becausethe ratio θ2,k/(1 − θ1,k) = µk is easier to interpret than θ2,k (and since θ1,k varies verylittle across subjects and protocols) and because θ3,k = σk

√δ, we choose to define our


finite-dimensional summary measure as (τ1, τ2, µ1, µ2, µ3, σ1, σ2, σ3). Hereafter we denoteX1 the latter vector derived from the trajectory associated to protocol 1 and X2 thatderived from the trajectory associated to protocol 2.

3.4.2 Classification procedure

We actually construct two classifiers, φ1 and φ2, to classify subjects as hemiplegic,vestibular or normal based on either X1 or X2 for φ1, and on both (X1, X2) for φ2.

The generic observed data structure associated to a given subject is (X1, X2, Y ), whereY ∈ {1, 2, 3} indicates the subject’s group (with convention Y = 1 for hemiplegic, Y = 2for vestibular, and Y = 3 for normal). We denote P0 the true distribution of (X1, X2, Y ).

Let S be the set of all classifiers based on X = (X1, X2). The misclassification riskassociated to S ∈ S is R(P0)(S) = EP0 [1{S(X) 6= Y }]. Denote R∗ = minS∈S R

(P0)(S)its minimum. It is achieved at the Bayes classifier S∗ ∈ S, characterized by S∗(X) =arg maxy∈{1,2,3} P0 (Y = y|X). For j = 1, 2, we also introduce the Bayes classifier S∗j ∈ Swhich relies only on Xj: S∗j(X) = arg maxy∈{1,2,3} P0(Y = y|Xj).

Our objective is to construct φ1 and φ2 which respectively estimate the better classifieramong S1∗ and S2∗ (i.e., arg minS1∗,S2∗ R(P0)(Sj∗)) and S∗. We choose to rely on thepopular methodology of random forests (Breiman, 2001), which proved powerful in avariety of applications. This is made easy thanks to the R package randomForest (weused the default tuning). The construction of the estimators S1∗ and S2∗ of S1∗ and S2∗,and that of φ2 is straightforward. We derive φ1 from (S1∗, S2∗) by V -fold cross-validation,with V = 10.

3.5 Application

This section is devoted to the application of our classification procedure to the realdataset. In Section 3.5.1 we apply it exactly as it is described in Section 3.4. In Sec-tion 3.5.2, we consider a slightly enhanced version, whose performances are better. Weconclude on some perspectives in Section 3.5.3.

3.5.1 Performance of the classification procedure

We evaluate the performances of the classification procedure by the leave-one-out rule.We acknowledge that it may result in overly optimistic error rates. Resorting to the leave-one-out rule is notably motivated by the relatively small sample size of our dataset. Theresults are reported in Table 3.4 (second row).

With its leave-one-out performance equal to 74%, the best classifier is φ1, which in-volves one protocol only. For curiosity, we also evaluate the performances of φ1 and φ2

3.5. Application 73

when systematically replacing (τ1δ, τ2δ) with (15, 50) (the start and end times of the per-turbation phase). We report the results in Table 3.4 (first row). Quite satisfactorily, wenote that both φ1 and φ2 are not as good as the original versions: estimating the changepoints proves very relevant.

3.5.2 Performance of an extended classification procedure

It is tempting to extend our classification procedure by simply extending the definitionof the summary measure which is at its core. Following Chambaz and Denis (2012), wemerely substitute (X1, U1) and (X2, U2) for X1 and X2 with U j (j = 1, 2) derived fromC1:N as U j = (C+

1 − C−1 , C−2 − C+1 , C

+2 − C−2 ) where

C−1 =δ

5

∑i∈[10/δ,15/δ[

Ci, C+1 =

δ

5

∑i∈]15/δ,20/δ]

Ci,

C−2 =δ

5

∑i∈[45/δ,50/δ[

Ci, C+2 =

δ

5

∑i∈]50/δ,55/δ]

Ci.

We refer to (Chambaz and Denis, 2012) for a justification. Then we apply the extendedclassification procedure and report its performances in Table 3.4 (third row).

With its leave-one-out performance equal to 79%, the best classifier is φ2, which bothextended summary measures (X1, U1) and (X2, U2). This is slightly better than thebest performance obtained in Section 3.5.1: enriching the summary measures seems toprovide relevant additional information for the sake of classifying subjects as hemiplegic,vestibular or normal.

performancesφ1 φ2

3.4.2 with (τ1δ, τ2δ) = (15, 50) 61% 64%procedure of Section 3.4.2 74% 68%

3.5.2 76% 79%

Table 3.4: Leave-one-out performances of φ1 and φ2 on the real dataset for the sake of classifyingsubjects as hemiplegic, vestibular or normal. First row: classification procedure of Section 3.4.2when imposing (τ1δ, τ2δ) = (15, 50). Second row: classification procedure of Section 3.4.2. Thirdrow: extended classification procedure of Section 3.5.2.

3.5.3 Perspectives

We address the statistical challenge of classifying subjects as hemiplegic, vestibular ornormal based on complex trajectories obtained through two experimental protocols which


were designed to evaluate potential deficits in postural control. The classification proce-dure involves a dimension reduction step where the complex trajectories are summarizedby finite-dimensional summary measures based on a stochastic process model for a real-valued trajectory. This allows us to retrieve from the trajectories information relative totheir temporal dynamic. A leave-one-out evaluation yields a 79% performance of correctclassification for a total of n = 70 subjects, with 22 hemiplegic (31%), 16 vestibular (23%)and 32 normal (46%) subjects.

In future work, we will extend the classification procedure by introducing finite-dimen-sional summary measures based on a stochastic process model for the original trajectoriesin R4. We will also draw advantage from our good understanding of the classificationproblem to tackle the closely related next statistical challenge of clustering our subjectsin terms of postural style.

Acknowledgments

I wish to thank I. Bonan (Service de Médecine Physique et de réadaptation, CHURennes) and P-P. Vidal (CESEM, Université Paris Descartes), my PhD advisors A.Chambaz and A. Samson (MAP5, Université Paris Descartes), and my colleague V.Genon-Catalot (MAP5, Université Paris Descartes), for introducing me to this inter-esting problem and providing the dataset, for their attentive guidance and all the fruitfuldiscussions.

3.5. Application 75

protocol

1protocol

2k

=1

k=

2k

=3

k=

1k

=2

k=

3

θ 1,k

0.97

(0.02)

0.98

(0.02)

0.98

(0.01)

0.97

(0.02)

0.98

(0.01)

0.97

(0.02)

hemiplegicsubjects

θ 2,k

0.11

(0.08)

0.39

(0.48)

0.16

(0.11)

0.14

(0.09)

0.38

(0.33)

0.38

(0.62)

θ2 3,k

0.08

(0.05)

0.06

(0.06)

0.06

(0.04)

0.10

(0.06)

0.10

(0.10)

0.11

(0.09)

k=

1k

=2

k=

3k

=1

k=

2k

=3

θ 1,k

0.99

(0.01)

0.99

(0.01)

0.99

(0.01)

0.98

(0.02)

0.98

(0.01)

0.98

(0.01)

vestibular

subjects

θ 2,k

0.06

(0.03)

0.23

(0.24)

0.09

(0.09)

0.14

(0.12)

0.56

(0.37)

0.19

(0.20)

θ2 3,k

0.05

(0.04)

0.07

(0.08)

0.07

(0.06)

0.10

(0.07)

0.12

(0.09)

0.09

(0.07)

k=

1k

=2

k=

3k

=1

k=

2k

=3

θ 1,k

0.99

(0.01)

0.99

(0.01)

0.99

(0.01)

0.98

(0.02)

0.99

(0.01)

0.98

(0.01)

norm

alsubjects

θ 2,k

0.02

(0.22)

0.13

(0.15)

0.10

(0.09)

0.08

(0.08)

0.15

(0.29)

0.15

(0.14)

θ2 3,k

0.04

(0.04)

0.02

(0.03)

0.04

(0.05)

0.06

(0.05)

0.04

(0.04)

0.06

(0.06)

Table3.5:

Foreach

subjectan

deach

protocol,wefirst

estimatethechan

gepo

ints

(τ1δ,τ 2δ)

then

wecompu

tetheestimates

(θ1,θ

2,θ

3)of

(θ1,θ

2,θ

3)over

each

ofthethreeresultingintervals

[δ,τ

1δ[,

[τ1δ,τ 2δ[

and

[τ2δ,Nδ],whe

reθ k

=(θ

1,k,θ

2,k,θ

3,k

)(k

=1,2,3

).Fo

reach

grou

pof

subjects

andeach

protocol,wecompu

teover

thegrou

pan

dforthat

protocol

themeanan

dstan

dard

deviation(betweenpa

renthe

se)of

theestimates

(θ1,θ

2,θ

3).


Chapitre 4

Étude du top scoring pair classifier

77


4.1 Introduction

This chapter is devoted to the study of two top scoring pair (TSP) classifiers inspiredby the original TSP classifier defined and studied by Geman et al. (2004).

The original TSP classifier was coined to address the classification of different cancersbased on gene-expression profiles. The main feature of the original TSP classifier is thatit is based on pairwise-comparisons. More precisely, the method consists in differentiatingbetween two classes by finding pairs of genes whose expression levels typically invertfrom one class to the other. A single pair of genes (or sometimes a handfull of them)is selected by maximizing a score. Then the resulting TSP classification rule is basedonly on the selected pair(s) of genes. Thus, the TSP classifier does not suffer from thelack of interpretability which often arises in the statistical analysis of microarray data.Indeed, it is easy to interpretate the fact that the expression level of a gene is larger thanthe expression level of another gene, even if these expression levels are obtained underdifferent experimental conditions. Hopefully, one can pass the so-called "elevator exam":explaining to one’s colleague from the biology department how this classifier works inthe elevator between the first and the fourth floors. Moreover, this classification ruleis robust to quantization effects and invariant to pre-processing such as normalizationmethods (Yang et al., 2001). Furthermore, the TSP classifier is easy to compute, and itsimplementation requires no tuning parameters. We refer to the tspair package for anR implementation. Geman et al. (2004) also argue that the TSP classifier behaves wellin the "small n large p" paradigm, and they show on several real datasets that the TSPclassification rule compares favorably with more complex ones. Note that the applicationof the TSP classification procedure in Chapter 2 also leads to good performances forclassifying subjects in terms of postural maintenance.

Various extensions of the TSP classification procedure have been proposed. Tan et al.(2005) and Zhou et al. (2012) are interested in k-TSP procedures which involve the pairsthat achieve the k largest scores rather than the highest score only. Note that the k-TSP procedure of Tan et al. (2005) applies to the multi-class framework. Furthermore,Czajkowski and Kretowski (2011) propose a TSP procedure based on decision trees. Asfar as we know, there is no theoretical study of the TSP classification procedure.

The aim of this work is to provide such a theoretical study. We show that the differ-ences in risks (for two different risks and their cross-validated versions) of the two empiricalTSP classifiers relative to their theoretical counterparts are O(

√log(p)/n) where p is the

number of pairs and n the sample size. In particular, the results shed some light on howthe empirical TSP classifiers behave in the "small n large p" paradigm.

In Section 4.2, we define the two TSP classification procedures of interest as two max-imizers of two different scores. Their empirical counterparts are defined as maximizers ofthe corresponding empirical scores in Section 4.3, where we also carry out their asymp-totic study. We introduce the cross-validated versions of the two TSP classification rulesin Section 4.4, where we also undertake their asymptotic behaviors. The proofs of the

80 Chapitre 4. Étude du top scoring pair classifier

main results are postponed in Section 4.5.

4.2 General framework

This section is devoted to the definition of two TSP classifiers. We first introducesome useful notations and definitions in Section 4.2.1. We define the TSP classifiers inSection 4.2.2.

4.2.1 Notations

Let O = (X, Y ) be the observed data-structure taking values in RG × {0, 1} (for apossibly large integer G). For instance, X can be viewed as the expression levels of Ggenes while Y can indicate if the subject is healthy or not. The true data-generatingdistribution is P0, which is an element of the set M of all candidate data-generatingdistributions. We denote Dn =

{Ok = (Xk, Y k), k = 1, . . . , n

}a learning dataset, where

O1 = (X1, Y 1), . . . , On = (Xn, Y n) are independent copies of O. We set J = {J = (i, j) ∈{1, . . . , G}2 , i < j} and ZJ = 1{Xi<Xj} for all J ∈ J . Obviously, card(J ) = G(G− 1)/2.

We introduce the following notations: p1 = P0(Y = 1) = 1− p0, p = min(p1, p0), andfor each J ∈ J , αJ = P0(ZJ = 1), ηJ(ZJ) = P0(Y = 1|ZJ), pJ(1) = P0(ZJ = 1|Y = 1),and pJ(0) = P0(ZJ = 1|Y = 0). We assume that card(J ) ≥ 2 and p > 0.

Let F be the set of these functions which map RG onto {0, 1}, and consider the lossfunctions L : RG × {0, 1} × F → R+ such that

L((X, Y ),Ψ) = 1{Ψ(X)6=Y } = Y 1{Ψ(X) 6=1} + (1− Y )1{Ψ(X)6=0}.

The loss function L is the usual loss function in the classification framework. For allP ∈M, this yields the risks R(P )

1 , R(P )2 : F → R+ characterized by

R(P )1 (Ψ) = EP [L(O,Ψ)] = P (Ψ(X) 6= Y ), and

R(P )2 (Ψ) = EP [L(O,Ψ)|Y = 1] + EP [L(O,Ψ)|Y = 0]

=1

p1P (Ψ(X) 6= 1, Y = 1) +

1

p0P (Ψ(X) 6= 0, Y = 0).

The risk R1 is called misclassification risk. We call R2 a weighted misclassification risk.The risk R2 is particularly usefull when p � max(p1, p0) and it is important to identifythe elements of the rare class.

Finally, we define Fpair =⋃J∈J FJ where FJ is the set of these functions t of ZJ such

that t(ZJ) ∈ {0, 1}. For J ∈ J , a classifier t ∈ FJ is called a pair classifier. Note thatcard(Fpair) = 4card(J ).

4.2. General framework 81

4.2.2 Definition of the TSP classifiers

The two TSP classifiers that we consider here are elements of Fpair. Their definitionsinvolve the risks R1 and R2. Of course there is no guarantee a priori that classifying basedon basic comparisons as they do will prove efficient. However, they are so simple and sofast that one can try them almost at no cost.

TSP for the misclassifcation risk

We first introduce the TSP classifier for the misclassification risk R1.For each J ∈ J , let ΨJ denote the Bayes classifier on the set FJ , defined by

ΨJ(X) = ΨJ(ZJ) = 1{ηJ (ZJ )≥1/2}. (4.1)

The classifier ΨJ votes for the class with the larger probability conditionally on Xi < Xj

or Xi ≥ Xj. We recall that ΨJ is also characterized by ΨJ ∈ arg mint∈FJ R(P0)1 (t). We

define the score γJ of the pair J as

γJ = αJ |ηJ(1)− 1/2|+ (1− αJ)|ηJ(0)− 1/2|. (4.2)

The following lemma connects the score of a pair J ∈ J to the misclassification risk ofΨJ .

Lemma 4.1. For each J ∈ J it holds that γJ = 1/2−R(P0)1 (ΨJ).

Proof. Set J ∈ J . We first decompose R(P0)1 (ΨJ) as follows:

R(P0)1 (ΨJ) = EP0

[1{ΨJ (1)6=1}1{Y=1}1{ZJ=1}

]+ EP0

[1{ΨJ (0)6=1}1{Y=1}1{ZJ=0}

]+ EP0

[1{ΨJ (1)6=0}1{Y=0}1{ZJ=1}

]+ EP0

[1{ΨJ (0)6=0}1{Y=0}1{ZJ=0}

].

From this decomposition, we deduce that

R(P0)1 (ΨJ) = αJ

[1{ΨJ (1)6=1}ηJ(1) + (1− 1{ΨJ (1)6=1})(1− ηJ(1))

]+ (1− αJ)

[1{ΨJ (0)6=1}ηJ(0) + (1− 1{ΨJ (0)6=1})(1− ηJ(0))

].

Using the facts that, firstly, ΨJ(1) 6= 1 implies ηJ(1) < 1/2 and, secondly, ΨJ(0) 6= 1implies ηJ(0) < 1/2, we obtain that

1/2−[1{ΨJ (1)6=1}ηJ(1) + (1− 1{ΨJ (1)6=1})(1− ηJ(1))

]= |ηJ(1)− 1/2| ,

1/2−[1{ΨJ (0)6=1}ηJ(0) + (1− 1{ΨJ (0)6=1})(1− ηJ(0))

]= |ηJ(0)− 1/2| .

The last equalities with 1/2 = αJ/2 + (1− αJ)/2 completes the proof.


Lemma 4.1 teaches us that the larger the score γJ , the better the classification basedonly on the pair J . Therefore, the TSP J∗1 is characterized by

J∗1 ∈ arg maxJ∈J

γJ . (4.3)

It yields the TSP classifier for the misclassification risk:

ΨJ∗1(X) = ΨJ∗1

(ZJ∗1 ) = 1{ηJ∗1 (ZJ∗1)≥1/2}.

By (4.3) and Lemma 4.1, one can equivalently characterize this TSP classifier as

ΨJ∗1∈ arg min

t∈FpairR

(P0)1 (t),

showing that ΨJ∗1can also be viewed as a risk minimizer–we will draw advantage of this

remark later.

TSP for the weighted misclassification risk

We now introduce the TSP classifier for the weighted misclassification risk. It is theoriginal TSP classifier of Geman et al. (2004). It can be viewed as weighted counterpart ofthe TSP classifier ΨJ∗1

in the sense that it is a minimizer of the weighted misclassificationrisk over Fpair.

For each J ∈ J , we introduce the classifier ΦJ ∈ FJ defined by

ΦJ(X) = ΦJ(ZJ) = 1{pJ (1)>pJ (0)}1{ZJ=1} + 1{pJ (1)≤pJ (0)}1{ZJ=0}. (4.4)

The classifier ΦJ votes for the class where the observed ordering between Xi and Xj isthe more likely. We also introduce the score ∆J of each J ∈ J as

∆J = |pJ(1)− pJ(0)| . (4.5)

The following lemma teaches us that one can interpret ∆J as the weighted counterpart ofγJ and ΦJ as the weighted counterpart of ΨJ .

Lemma 4.2. Set J ∈ J . For all t ∈ FJ , it holds that

R(P0)2 (t)−R(P0)

2 (ΦJ) = ∆J

(1{t(1)6=Ψ(1)} + 1{t(0)6=ΨJ (0)}

),

which implies that ΦJ ∈ arg mint∈FJ R(P0)2 (t). Moreover, ∆J = 1−R(P0)

2 (ΦJ).

Proof. Set J ∈ J , t ∈ FJ , and defineA1 = EP0

[1{t(ZJ )6=1}|Y = 1

], andA0 = EP0

[1{t(ZJ )6=0}|Y = 0

].

We can decompose A1 as

A1 = EP0

[1{t(1)6=1}1{ZJ=1}|Y = 1

]+ EP0

[1{t(0)6=1}1{ZJ=0}|Y = 1

]

4.3. Empirical TSP classifiers 83

= pJ(1)1{t(1)6=1} + (1− pJ(1))1{t(0)6=1}. (4.6)

Similarly,

A0 = EP0

[1{t(1)6=0}1{ZJ=1}|Y = 0

]+ EP0

[1{t(0)6=0}1{ZJ=0}|Y = 0

]= pJ(0)1{t(1)6=0} + (1− pJ(0))1{t(0)6=0}

= pJ(0)(1− 1{t(1)6=1}

)+ (1− pJ(0))

(1− 1{t(0)6=1}

). (4.7)

Since R(P0)2 (t) = A1 + A0, we deduce from (4.6) and (4.7) that

R(p0)2 (t) = 1 + (pJ(0)− pJ(1))1{t(0)6=1} + (pJ(1)− pJ(0))1{t(1)6=1}. (4.8)

Equation (4.8) holds in particular when t = ΦJ . Therefore,

R(P0)2 (t)−R(P0)

2 (ΦJ) = (pJ(0)− pJ(1))(1{t(0)6=1} − 1{ΦJ (0)6=1}

)+ (pJ(1)− pJ(0))

(1{t(1)6=1} − 1{ΦJ (1)6=1}

)= ∆J

(1{t(1)6=Φ(1)} + 1{t(0)6=ΦJ (0)}

),

which is the first stated result. Moreover, a direct application of (4.8) with t = ΦJ yieldssecond the result.

The TSP classifier for the weighted misclassification risk is ΦJ∗2with J∗2 characterized

byJ∗2 ∈ arg max

J∈J∆J . (4.9)

By Lemma 4.2 and (4.9), it holds that ΦJ∗2∈ arg mint∈Fpair R

(P0)2 (t), showing that ΦJ∗2

canbe viewed as a minimizer of the weighted misclassification risk over Fpair.

4.3 Empirical TSP classifiers

In this section, we introduce our empirical TSP classifiers and study their asymptoticbehaviors in terms of risks control. Section 4.3.1 and Section 4.3.2 are devoted to theempirical TSP classification procedures for the misclassification risk and the weightedmisclassification risk, respectively.

4.3.1 Empirical TSP classifier for the misclassification risk

The definition of the empirical TSP classifier for misclassification risk relies on esti-mators of J∗1 and ηJ∗1 that we plug into (4.1).


For every t ∈ Fpair, we set R1(t) = 1n

∑nk=1 1{t(Xk) 6=Y k}, the empirical misclassification

risk of t. For each J ∈ J , let γJ = αJ |ηJ(1)−1/2|+(1− αJ)|ηJ(0)−1/2| be the empiricalscore, where αJ = 1

n

∑nk=1 Z

kJ and

ηJ(z) =

{1

nβJ (z)

∑nk=1 1{ZkJ=z}1{Y k=1} if βJ(z) > 0

12

otherwise,

with βJ(z) = zαJ + (1 − z)(1 − αJ) (for both z = 0, 1). The random variable ηJ(z) isthe empirical version of ηJ(z). If card({k, Zk

J = z}) = 0, we choose ηJ(z) = 1/2 byconvention. The plug-in estimator ΨJ(·) = 1ηJ (·)≥1/2 of ΨJ implements a majority votingrule:

ΨJ(z) =

{1 if card({k, Zk

J = z, Y k = 1}) ≥ card({k, ZkJ = z, Y k = 0})

0 otherwise,

henceΨJ ∈ arg min

t∈FJR1(t). (4.10)

We illustrate the classification rule ΨJ in Figure 4.1. Finally, J1 = arg maxJ∈J γJ definesan estimator of the TSP J∗1 which leads to the emprical TSP classifier ΨJ1

. A slightadaptation of the proof of Lemma 4.1 shows the following result:

Lemma 4.3. For each J ∈ J , it holds that γJ = 1/2− R1

(ΨJ

).

Lemma 4.3 and (4.10) entail that

ΨJ1∈ arg min

t∈FpairR1(t). (4.11)

This property leads to the following asymptotic result which teaches us that, in the limit,ΨJ1

performs as well as the TSP classifier ΨJ∗1.

Theorem 4.1. It holds that

0 ≤ E[R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ∗1

)]

= O

(√log(card(J ))

n

).

This is the classical rate of convergence that one expects for a classifier which can beviewed as a minimizer, over the set of the classifiers defined on Fpair, of the empiricalmisclassification risk (Bousquet et al., 2004). We see clearly how the number of pairsaffects the rate of convergence.

The proof of Theorem 4.1 is postponed to Section 4.5.2.

4.3. Empirical TSP classifiers 85

Figure 4.1: Illustration of the empirical classification rules ΨJ and ΦJ for a pair J = (i, j).First, we have ηJ(1) = 5/8 and ηJ(0) = 5/7. Therefore, for a new observation X, ΨJ(X) = 1if Xj ≤ Xi and ΨJ(X) = 1 if Xj ≤ Xi. Moreover, the score γJ of the pair J is equal to(8/15) |5/8− 1/2| + (7/15) |5/7− 1/2| = 1/6. For the computation of ΦJ , pJ(1) = 1/2 andpJ(0) = 3/5. Therfore, for a new observation X we obtain ΦJ(X) = 0 if Xj > Xi and ΦJ(X) = 1

si Xj ≤ Xi. Moreover, the score ∆J of the pair J is equal to |1/2− 3/5| = 1/10.

4.3.2 Empirical TSP classifier for the weighted misclassificationrisk

The definition of the empirical TSP classifier for the weighted misclassification riskrelies on estimators of J∗2 and pJ∗2 . It is the original empirical TSP classifier of Gemanet al. (2004).

Set I(y) = {k ≤ n, Y k = y} and N(y) = card(I(y)) for y ∈ {0, 1}. For each J ∈ J ,the empirical score ∆J is defined as ∆J = |pJ(1)− pJ(0)|, where, for each y = 0, 1,

pJ(y) =1{N(y)>0}

N(y)

∑k∈I(y)

ZkJ

(with convention 0/0 = 0). We also define for each J ∈ J the empirical counterpartΦJ of ΦJ as ΦJ(X) = ΦJ(ZJ) = 1{pJ (1)>pJ (0)}1{ZJ=1} + 1{pJ (1)≤pJ (0)}1{ZJ=0}. Finally,J2 ∈ arg maxJ∈J ∆J defines an estimator of the TSP J∗2 , which leads to the empiricalTSP classifier ΦJ2

for the weighted misclassification risk.The following asymptotic result shows that ΦJ2

performs as well, in the limit, as ΦJ∗2:



0 ≤ E[(R

(P0)2 (ΦJ2

)−R(P0)2 (ΦJ∗2

))

1{0<N(1)<n}

]= O

(√log(card(J ))

np

). (4.12)

The above rate of convergence is the same as in Theorem 4.1 with n replaced by np,the expected number of those observations Ok = (Xk, Y k) such that Y k = y where y isthe rare outcome (i.e., p = P0(Y = y)). The additional factor 1/

√p featured in (4.12)

quantifies to what extent working with R2 instead of R1 makes the classification problemmore difficult.

The proof of Theorem 4.2 is given in Section 4.5.3.

4.4 Cross-validated TSP classifiers

This section parallels Section 4.3. The main idea is to adopt a different approach toestimate the two TSP classifiers: instead of building the empirical TSP classifiers thatwe introduce and study in Section 4.3, we rely here on the cross-validation principle. Bydoing so, we could possibly achieve a greater stability and greater performances for theresulting estimators. The cross-validation principle has been widely studied both fromthe theoretical and pratical viewpoints (Dudoit and van der Laan, 2005; Arlot, 2007,and references therein). We define the cross-validated counterparts of R1 and R2 inSection 4.4.1. We introduce the two cross-validated TSP classifiers in Section 4.4.2, andwe study their asymptotic behaviors in Section 4.4.3.

4.4.1 Cross-validated risk estimator

We set an integer V ≥ 2 and a regular partition (Bv)1≤v≤V of {1, . . . , n}, i.e., apartition such that, for each v = 1, . . . , V , card(Bv) ∈ {bn/V c, bn/V c+ 1}.

For each v ∈ {1, . . . , V }, we denote D(v)n (respectively D(−v)

n ) the dataset {Ok, k ∈ Bv}(respectively {Ok, k 6∈ Bv}), and define the corresponding empirical measures

P (v)n =

1

card(Bv)

∑k∈Bv

Dirac(Ok), and

P (−v)n =

1

n− card(Bv)

∑k 6∈Bv

Dirac(Ok).

Let t be a pair classifier, i.e., a function mapping the empirical distribution to Fpair.Note that t can be viewed simply as a black box algorithm that one applies to data. We

4.4. Cross-validated TSP classifiers 87

characterize the empirical cross-validated risk estimators R1,n, for the misclassificationrisk, and R2,n, for the weighted misclassification risk, by

R1,n(t) =1

V

V∑v=1

R(P

(v)n )

1

(t(P (−v)

n )), and

R2,n(t) =1

V

V∑v=1

R(P

(v)n )

2

(t(P (−v)

n ))

for all t.For each v ∈ {1, . . . , V } and m = 1, 2, R(P

(v)n )

m (t(P(−v)n )) is the empirical estimator of

R(P0)m (t(P

(−v)n ), based on D

(v)n and conditionally on D

(−v)n . Obviously, it holds that, for

every v ∈ {1, . . . , V },

R(P

(v)n )

1

(t(P (−v)

n ))

=1

card(Bv)

∑k∈Iv

L(Ok, t(P (−v)n )), and

R(P

(v)n )

2

(t(P (−v)

n ))

=1{Nv(1) > 0}

Nv(1)

∑k∈Iv(1)

L(Ok, t(P (−v)n ))

+1{Nv(0) > 0}

Nv(0)

∑k∈Iv(0)

L(Ok, t(P (−v)n )),

with Iv(y) = {k ∈ Bv, Yk = y} and Nv(y) = card(Iv(y)) for y = 0, 1.

4.4.2 V-fold cross-validation principle

Let t1, . . . , tL be L pair classifiers (with L a possibly large integer).We first address the case of the misclassification risk R1. Each pair classifier can be

viewed as a candidate to estimate the TSP classifier ΨJ∗1for the misclassification risk

R1. One could for instance take L = card(J ) and {t1, . . . , tL} = {ΨJ , J ∈ J }. Thegoal is to select a pair classifier in the collection {t1, . . . , tL}, whose risk is the closest toR

(P0)1 (ΨJ∗1

). The V -fold cross-validation procedure consists in selecting the pair classifierwhich minimizes the cross-validated risk R1,n. So, we introduce the cross-validated selector

1,n ∈ arg min`≤L R1,n(t`). The cross-validated TSP classifier is finally defined as Ψn =t

1,n.Consider now the case of the weighted misclassification risk R2. In that case, each

pair classifier can be viewed as a candidate to estimate the TSP classifier ΦJ∗2for the

misclassification risk R2. The set of the L pair classifiers could be {t1, . . . , tL} = {ΦJ , J ∈J } for instance. Similarly, we set 2,n ∈ arg min`≤L R2,n(t`) and Φn = t

2,n.


4.4.3 Asymptotic perfomances of the cross-validated TSP classi-fiers

The asymptotic results that we obtain for the cross-validated TSP classifiers definedin Section 4.4.2 results are similar in nature to those of Dudoit and van der Laan (2005).They are expressed as comparisons to the oracle counterparts of the cross-validated TSPclassifiers in terms of risks. Accordingly, define R1,n and R2,n the oracle counterparts ofR1,n and R2,n: for any t,

R1,n(t) =1

V

V∑v=1

R(P0)1

(t(P (−v)

n )), and

R2,n(t) =1

V

V∑v=1

EP0 [L(O, t(P (−v)n ))|Y = 1]1{Nv(1)>0}

+EP0 [L(O, t(P (−v)n ))|Y = 0]1{Nv(0)>0}.

They yield the oracle counterparts ˜1,n = arg min`≤L R1,n(t`) and ˜2,n = arg min`≤L R2,n(t`)

of 1,n and 2,n, which yield in turn the oracle couterparts Ψn = t˜1,n

and Φn = Φn = t˜2,n

of Ψn and Φn. We obtain the following result:


E[R1,n(Ψn)− R1,n(Ψn)

]= O

(√log(L)

bn/V c

), and (4.13)

E[R2,n(Φn)− R2,n(Φn)

]= O

(√log(L)

bn/V cp

). (4.14)

As usual when one deals with cross-validated estimators, the theorem compares Ψn

and Φn to their oracle counterparts in terms of the oracle cross-validated risks. Thetheorem teaches us that, in the limit, Ψn and Φn perform as well as Ψn and Φn.

If we choose {t1, . . . , tL} equal to {ΨJ , J ∈ J } or {ΦJ , J ∈ J }, then the resultsin Theorem 4.3 are similar to those in Theorems 4.1 and 4.2. However, the rates ofconvergence in Theorem 4.3 are slightly slower than those of Theorems 4.1 and 4.2 dueto the factor

√V .

Equation (4.13) directly stems from (Dudoit and van der Laan, 2005, Theorem 2).The proof of (4.14) is postponed to Section 4.5.4.

4.5 ProofThis section gathers the proofs of the Theorems 4.1, 4.2 and 4.3.

4.5. Proof 89

4.5.1 Two useful lemmas

Lemma 4.4. Set two positive integers N,M and introduce the function f defined on theset of non-negative real numbers by f(x) = min(1, exp(log(2M)− 2Nx2)). The followinginequality holds: ∫ +∞

0

f(x)dx ≤√

log(2M)

2N+

√π

2√

2N.

Proof. For all x ≥ 0, we have f(x) = exp(−(2Nx2 − log(2M))+). Therefore∫ +∞

0

f(x)dx =

√log(2M)

2N+

∫x≥

√log(2M)

2N

exp(−(2Nx2 − log(2M)))dx. (4.15)

Since a2 − b2 ≥ (a− b)2 for a ≥ b ≥ 0, note that:∫x≥

√log(2M)

2N

exp(−(2Nx2 − log(2M)))dx ≤∫x≥

√log(2M)

2N

exp

−2N

(x−

√log(2M)

2N

)2 dx

=1√2N

∫ +∞

0

exp(−x2)dx =

√π

2√

2N. (4.16)

Finally, Equation (4.15) and Equation (4.16) yield the result.

Lemma 4.5. Let Z L= B(n, p) be a binomial random variable. Then

E

[1{Z>0}√

Z

]≤

√2

(n+ 1)p.

Proof. By the Cauchy-Schwartz inequality, it holds that(E

[1√Z + 1

])2

≤ E

[1

Z + 1

]=

n∑k=0

1

k + 1

(n

k

)pk(1− p)n−k =

∫ 1

0

(xp+ 1− p)ndx ≤ 1

(n+ 1)p.

Now, since 1/√k ≤√

2/√k + 1 for all k ≥ 1, we obtain

E

[1{Z>0}√

Z

]=

n∑k=1

1√k

(n

k

)pk(1− p)n−k ≤

√2

n∑k=1

1√k + 1

(n

k

)pk(1− p)n−k

≤√

2E

[1√Z + 1

]≤

√2

(n+ 1)p,

which is the stated result.


4.5.2 Proof of Theorem 4.1

The proof of Theorem 4.1 relies on the characterization (4.11) of the empirical TSPclassifier. We have:

0 ≤ R(P0)1 (ΨJ1

)−R(P0)1 (ΨJ∗1

) =(R

(P0)1 (ΨJ1

)− R1(ΨJ1))

+(R1(ΨJ1

)−R(P0)1

(ΨJ∗1

)).

By (4.11), this yields that

0 ≤ R(P0)1 (ΨJ1

)−R(P0)1 (ΨJ∗1

) ≤ 2 supt∈Fpair

∣∣∣R(P0)1 (t)− R1(t)

∣∣∣ .Therefore

0 ≤ E[R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ∗1

)]≤ 2E

[supt∈Fpair

∣∣∣R(P0)1 (t)− R1(t)

∣∣∣] . (4.17)

Next, we provide an upper bound for the right-hand side expectation.By the Bonferroni inequality, we have for all h ≥ 0,

P

(supt∈Fpair

∣∣∣R(P0)1 (t)− R1(t)

∣∣∣ ≥ h

)≤ min

1,∑t∈Fpair

P(∣∣∣R(P0)

1 (t)− R1(t)∣∣∣ ≥ h

) .

Since for each t ∈ Fpair, R1(t) is an empirical mean of i.i.d Bernouilli random variableswith common mean R(P0)

1 (t), we deduce from Hoeffding’s inequality that:

P

(supt∈Fpair

∣∣∣R(P0)1 (t)− R1(t)

∣∣∣ ≥ h

)≤ min

(1, exp

(log(2card(Fpair))− 2nh2

)).

Now, with card(Fpair) = 4card(J ),

E

[supt∈Fpair

∣∣∣R(P0)1 (t)− R1(t)

∣∣∣] =

∫ +∞

0

P

(supt∈Fpair

∣∣∣R(P0)1 (t)− R1(t)

∣∣∣ ≥ h

)dh

≤√

log(8card(J ))

2n+

√π

2√

2n,

by Lemma 4.4. Then (4.17) yields the theorem.

4.5. Proof 91


We have:

0 ≤ R(P0)2 (ΦJ2

)−R(P0)2 (ΦJ∗2

) =(R

(P0)2 (ΦJ2

)−R(P0)2 (ΦJ2

))

+(R

(P0)2 (ΦJ2

)−R(P0)2 (ΦJ∗2

))

=(R

(P0)2 (ΦJ2

)−R(P0)2 (ΦJ2

))

+(∆J∗2−∆J2

)=

(R

(P0)2 (ΦJ2

)−R(P0)2 (ΦJ2

))

+(

∆J∗2− ∆J∗2

)+(

∆J∗2−∆J2

)≤

(R

(P0)2 (ΦJ2

)−R(P0)2 (ΦJ2

))

+(

∆J∗2− ∆J∗2

)+(

∆J2−∆J2

),

by definition of J2.To complete the proof, it remains to control E

[1{0<N(1)<n} supJ∈J

∣∣∣∆J − ∆J

∣∣∣] and

E[1{0<N(1)<n} supJ∈J

∣∣∣R(P0)2 (ΦJ)−R(P0)

2 ΦJ

∣∣∣], by relying on Lemmas 4.6 and 4.7.

Lemma 4.6. For all J ∈ J , it holds that

R(P0)2 (ΦJ)−R(P0)

2 (ΦJ) ≤ 2 (|pJ(1)− pJ(1)|+ |pJ(0)− pJ(0)|) , and (4.18)∣∣∣∆J −∆J

∣∣∣ ≤ |pJ(1)− pJ(1)|+ |pJ(0)− pJ(0)| . (4.19)

Proof. Inequality (4.18) is a by-product of Lemma 4.2 and the fact that, for each y ∈{0, 1},

(ΦJ(y) 6= ΦJ(y)

)implies ∆J = |pJ(1)− pJ(0)| ≤ |pJ(1)− pJ(1)|+ |pJ(0)− pJ(0)|.

To show this implication, we just check one of the four different cases that can arise(the others can be addressed similarly). For instance, if y = 1 and ΦJ(1) = 0 thenpJ(0) ≥ pJ(1) and pJ(0) < pJ(1). Thus,

∆j = |pJ(1)− pJ(0)| = pJ(1)− pJ(0) = (pJ(1)− pJ(1)) + (pJ(1)− pJ(0))

≤ (pJ(1)− pJ(1)) + (pJ(1)− pJ(0))

≤ |pJ(1)− pJ(1)|+ |pJ(0)− pJ(0)| .

Inequality (4.19) relies on a direct application of the reverse triangle inequality.

Lemma 4.7. For each y ∈ {0, 1}, it holds that

EP0

[1{N(y)>0} sup

J∈J|pJ(y)− pJ(y)|

]≤

√2 log(2card(J ))

np+

√π

2np. (4.20)

Proof. By symmetry, it suffices to present the proof in the case where y = 1. Let Ydenotes the σ-field spanned by {Y k, k = 1, . . . , n}. We have:

EP0

[1{N(1)>0} sup

J∈J|pJ(1)− pJ(1)|

]= EP0

[E

[1{N(1)>0} sup

J∈J|pJ(1)− pJ(1)|

∣∣Y]] ,


which equals

EP0

[1{N(1)>0}

∫ +∞

0

P

(supJ∈J|pJ(1)− pJ(1)| ≥ h|Y

)dh

].

If N(1) > 0 then conditionally on Y and for each J ∈ J , the random variable pJ(1) isan empirical mean of i.i.d Bernoulli random variables with common mean pJ(1). There-fore, by the Bonferroni and Hoeffding inequalities, we obtain for all h ≥ 0:

1{N(1)>0}P

(supJ∈J|pJ(1)− pJ(1)| ≥ h

∣∣Y)≤ 1{N(1)>0}min

(1, exp

(log (2card(J )− 2N(1)h2

)).

Applying Lemma 4.4 then gives

1{N(1)>0}

∫ +∞

0

P

(supJ∈J|pJ(1)− pJ(1)| ≥ h

∣∣Y) dh

≤1{N(1)>0}√

2N(1)

(√log (2card(J )) +

√π

2

). (4.21)

Since N(1)L= B(n, p1), (4.21) and Lemma 4.5 yield the result.


We recall that (4.13) directly stems from Dudoit and van der Laan (2005). We nowgive the proof of (4.14). By definition of ˜2

n, one has

0 ≤ R2,n(Φn)− R2,n(Φn) = (R2,n(Φn)− R2,n(Φn)) + (R2,n(Φn)− R2,n(Φn))

≤ (R2,n(Φn)− R2,n(Φn)) + (R2,n(Φn)− R2,n(Φn))

≤ 2 sup`∈L

∣∣∣R2,n(t`)− R2,n(t`)∣∣∣ .

Now, for each ` ∈ {1, . . . , L}, R2,n(t`)− R2,n(t`) is equal to

1

V

V∑v=1

1{Nv(1)>0}

Nv(1)

∑i∈Iv(1)

(L(Oi, t`(P

(−v)n ))− EP0 [L(O, t`(P

(−v)n ))|Y = 1]

)+

1

V

V∑v=1

1{Nv(0)>0}

Nv(0)

∑i∈Iv(0)

(L(Oi, t`(P

(−v)n ))− EP0 [L(O, t`(P

(−v)n ))|Y = 0]

) ,

4.6. Appendix 93

hence

sup`∈L

∣∣∣R2,n(t`)− R2,n(t`)∣∣∣ ≤ 1

V

V∑v=1

(sup

`∈{1,...,L}

∣∣H1`,v

∣∣+ sup`∈{1,...,L}

∣∣H0`,v

∣∣) , (4.22)

where, for y = 0, 1,

Hy`,v =

1{Nv(y) > 0}Nv(y)

∑i∈Iv(y)

(L(Oi, t`(P

(−v)n ))− EP0 [L(O, t`(P

(−v)n ))|Y = y]

).

For each v ∈ {1, . . . , V } and y ∈ {0, 1}, conditionally on D(−v)n and (Y i)i∈Bv , H

y`,v is

an empirical mean of i.i.d bounded centered variable. Thus, the Bonferroni and Hoeffdinginequalities imply that, for all h ≥ 0,

P

(sup

`∈{1,...,L}

∣∣Hy`,v

∣∣ ≥ h∣∣D(−v)

n , (Y i)i∈Bv

)≤ min(1, exp(log(2L)− 2Nv(y)h2)),

so that, for each v ∈ {1, . . . , V }, we deduce by Lemma 4.4 that

E

[sup

`∈{1,...,L}

∣∣Hy`,v

∣∣ ∣∣D(−v)n , (Y i)i∈Bv

]≤

1{Nv(y)>0}√2Nv(y)

(√log(2L) +

√π

2

).

Since Nv(y)L= B(n, py), we complete the proof by applying again Lemma 4.5 and (4.22).

4.6 Appendix

In this section, we present another version of Theorem 4.1. This second version is notas good as Theorem 4.1, but it does not involve the fact that the empirical TSP classifierfor R1 is a minimizer of the empirical misclassification risk over the set Fpair.

Theorem 4.4. Set α∗ = minJ∈J αJ and α∗ = maxJ∈J αJ . If α∗ = min(α∗, 1 − α∗) > 0then

0 ≤ E[R(P0)(ΨJ1

)−R(P0)(ΨJ∗1)]

= O

(√log(card(J ))

n

)+O

√ log(card(J ))

nα∗

+O

(log(card(J )3/2

nα2∗

)+O (exp(log(card(J ))− nα∗)) .


Proof. We have

0 ≤ R(P0)1 (ΨJ1

)−R(P0)1 (ΨJ∗1

) =(R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ)

)+(R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ∗1

))

=(R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ1

))

+ (γJ∗1 − γJ1)

=(R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ1

))

+ (γJ∗1 − γJ∗1 ) + (γJ∗1 − γJ1)

≤(R

(P0)1 (ΨJ1

)−R(P0)1 (ΨJ1

))

+ (γJ∗1 − γJ∗1 ) + (γJ1 − γJ1),

by definition of J∗1 . To complete the proof, it simply remains to apply Lemmas 4.8 and4.9 to control the quantities E

[supJ∈J

∣∣∣R(P0)1 (ΨJ)−R(P0)

1 (ΨJ)∣∣∣] and E [supJ∈J |γJ − γJ |].

Lemma 4.8. For each J ∈ J , the following inequalities hold:

0 ≤ R(P0)1 (ΨJ)−R(P0)

1 (ΨJ) ≤ 2 (|ηJ(1)− ηJ(1)|+ |ηJ(0)− ηJ(0)|) , and (4.23)|γJ − γJ | ≤ |αJ − αJ |+ |ηJ(1)− ηJ(1)|+ |ηJ(0)− ηJ(0)|. (4.24)

Proof. First, it holds that

R(P0)1 (ΨJ) = αJ

[1{ΨJ (1)6=1}ηJ(1) + (1− ηJ(1))1{ΨJ (1)6=0}

]+ (1− αJ)

[1{ΨJ (0)6=1}ηJ(0) + (1− ηJ(0))1{ΨJ (0)6=0}

].

From this decomposition, we deduce that:

R(P0)1 (ΨJ)−R(P0)

1 (ΨJ) = αJ1{ΨJ (1)6=ΨJ (1)}|2ηJ(1)− 1|+ (1− αJ)1{ΨJ (0)6=ψJ (0)}|2ηJ(0)− 1|. (4.25)

Since, for z ∈ {0, 1}, ΨJ(z) 6= ΨJ(z) implies |2ηJ(z)− 1| ≤ 2|ηJ(z)− ηJ(z)|, (4.23) followsfrom (4.25).

Second, it also holds that

αJ |ηJ(1)− 1/2| − αJ |ηJ − 1/2|= αJ |ηJ(1)− 1/2| − αJ |ηJ(1)− 1/2|+ αJ |ηJ(1)− 1/2| − αJ |ηJ(1)− 1/2|

= αJ (|ηJ(1)− 1/2| − |ηJ(1)− 1/2|) + (αJ − αJ) |ηJ(1)− ηJ(1)| .

By the reverse triangle inequality, we deduce from these inequalities that∣∣αJ |ηJ(1)− 1/2| − αJ |ηJ(1)− 1/2|∣∣ ≤ |ηJ(1)− ηJ(1)|+ 1

2|αJ − αJ | .

By symmetry, we also have∣∣(1− αJ) |ηJ(0)− 1/2| − (1− αJ) |ηJ(0)− 1/2|∣∣ ≤ |ηJ(0)− ηJ(0)|+ 1

2|αJ − αJ | ,

which completes the proof.

4.6. Appendix 95

Lemma 4.9. Let C =√

2 log(2card(J )). For each y ∈ {0, 1} it holds that:

E

[supJ∈J|αJ − αJ |

]≤ C√

n, and (4.26)

E

[supJ∈J|ηJ(y)− ηJ(y)|

]≤√

2C√nα∗

+C3

nα2∗

+1

2exp(log(card(J ))− nα∗). (4.27)

Proof. By the Bonferroni inequality, we have for all h ≥ 0,

P

(supJ∈J|αJ − αJ | ≥ h

)≤ min

(1,∑J∈J

P (|αJ − αJ | ≥ h)

).

Since for each J ∈ J , αJ is an empirical mean of i.i.d Bernouilli random variables withcommon mean αJ , we deduce from Hoeffding’s inequality and the previous bound that:

P

(supJ∈J|αJ − αJ | ≥ h

)≤ min(1, exp(log(2card(J ))− 2nh2)).

Since card(J ) ≥ 2, applying Lemma 4.4 yields (4.26).Consider now (4.27). By symmetry it suffices to present the proof in the case where

y = 1 for instance. For each J ∈ J , we denote NJ = nαJ , J1 = {J ∈ J , NJ > 0} andJ2 = J�J1. Let Z denote the σ-field spanned by {Zk

J : k = 1, . . . , n, ; J ∈ J }. Wehave:

E

[supJ∈J|ηJ(1)− ηJ(1)|

]= E

[E

[supJ∈J|ηJ(1)− ηJ(1)|

∣∣Z]]= E

[∫ +∞

0

P

(supJ∈J|ηJ(1)− ηJ(1)| ≥ h

∣∣Z) dh

].

Let us first derive an upper bound for I(Z) =∫ +∞

0P(supJ∈J |ηJ(1)− ηJ(1)| ≥ h

∣∣Z) dh.Note that I(Z) ≤ I1(Z) + I2(Z), with

Im(Z) =

∫ +∞

0

P

(supJ∈Jm

|ηJ(1)− ηJ(1)| ≥ h∣∣Z) dh (m = 1, 2).

Since conditionally on Z, the random variable ηJ(1) is an empirical mean of i.i.d Bernouillirandom variables with common mean ηJ(1), the Bonferroni and Hoeffding inequalitiesyield, for all h ≥ 0,

P

(supJ∈J1|ηJ(1)− ηJ(1)| ≥ h

∣∣Z) ≤ min

(1, exp

(log(2card(J ))− 2 min

J∈J1NJh

2

)).


Since card(J ) ≥ 2, applying Lemma 4.4 then yields

I1(Z) ≤ C maxJ∈J1

1√NJ

1{J1 6=∅}.

Since for each J ∈ J2 we have ηJ(1) = 12, it holds that

P

(supJ∈J2|ηJ(1)− ηJ(1)| ≥ h

∣∣Z) ≤ 1{h≤1/2}1{J2 6=∅}

for all h ≥ 0, hence I2(Z) ≤ 121{J2 6=∅}.

Let us now derive upper bounds on I1 = E

[C maxJ∈J1

1√NJ

1{J1 6=∅}

]and

I2 = 12P (J2 6= ∅).

Set β > 0. One has

maxJ∈J1

1√NJ

≤ 1

βlog

(∑J∈J1

exp

(β√NJ

)).

Observe that∑J∈J1

exp

(β√NJ

)=∑J∈J1

exp

(β√NJ

)1{NJ≤nαJ2 } +

∑J∈J1

exp

(β√NJ

)1{NJ≥nαJ2 }

≤ exp(β)∑J∈J

1{NJ≤nαJ2 } + card(J ) exp

(β

√2

nα∗

),

so that Jensen’s inequality implies

E

[maxJ∈J1

1√NJ

1{J1 6=∅}

]≤ 1

βlog

(E

[exp(β)

∑J∈J

1{NJ≤nαJ2 } + card(J ) exp

(β

√2

nα∗

)]). (4.28)

Because NJL= B(n, αJ) Hoeffding inequality yields, for each J ∈ J , P (NJ ≤ nαJ/2) ≤

exp (−nα2J/2) ≤ exp (−nα2

∗/2). Combining the previous inequality and (4.28) implies

I1 ≤C

βlog

(card(J )

(exp

(β − nα2

∗2

)+ exp

(β

√2

nα∗

))). (4.29)

4.6. Appendix 97

Finally, choosing β = nα2∗/2 in (4.29) yields

I1 ≤ C

(2 log(2card(J ))

nα2∗

+

√2

nα∗

). (4.30)

Regarding I2, we note that P (J2 6= ∅) ≤∑

J∈J P (NJ = 0) hence

I2 ≤card(J )

2exp(−nα∗). (4.31)

Inequalities (4.29) and (4.31) yield the result.


Chapitre 5

Compléments

99

5.1. Estimation des instants de rupture 101

5.1 Estimation des instants de rupture

Dans cette section, nous apportons quelques compléments sur l’estimation des instantsde rupture par minimisation de contraste pour un processus Cox-Ingersoll-Ross, observéà temps discret, dans le cas où le nombre de ruptures est inconnu. Nous proposons no-tamment une procédure d’estimation des instants de rupture pour des ruptures affectantla dérive et/ou la volatilité. Nous définissons le cadre et les objectifs en Section 5.1.1. Laprocédure proposée est décrite en Section 5.1.2 et une étude de simulation est menée enSection 5.1.3 afin d’en étudier les performances numériques.

5.1.1 Cadre et objectif

Nous considérons C1:N les observations discrètes d’un processus stochastique (C(t))[T0,T ].On note T0 = δ < T1 = τ ∗1 δ < . . . < TK∗−1 = τ ∗K∗−1δ < TK∗ = T la suite des instants derupture inconnus avec τk ∈ N. Nous supposons que l’entier K∗ est inconnu, néanmoinsnous supposons connu K ≥ K∗ une borne maximale. Nous notons τ ∗ = (τ ∗1 , . . . , τ

∗K∗−1).

Pour k = 1, . . . , K∗, nous supposons (C(t))t∈[Tk−1,Tk[ défini par l’équation différentiellestochastique (EDS) suivante :{

dC(t) = λk(µk − C(t))dt+ σk√C(t)dW (t)

C(Tk−1) = Cτk−1,

où les paramètres λk, µk, σk sont inconnus et strictement positifs, la condition initialeC(Tk−1) est l’observation Cτk−1

au temps Tk−1. Nous notons K = {0, . . . , K∗}. Noussupposons que les ruptures peuvent affecter les paramètres µk et σk pour k = 1, . . . K∗.Nous notons K1 = {k0 = 0 < k

(1)1 < . . . < k

(1)K∗1

= K∗} le sous-ensemble de K caractérisépour j ∈ {1, . . . , K∗1} par

µk = µk(1)j, ∀k ∈ {kj−1 + 1, . . . , kj}, et µ

k(1)j6= µ

k(1)j+1, j 6= K∗1 .

Ainsi (τ ∗k )k∈K1\{0,K∗} est la suite des instants de ruptures affectant le paramètre µ. Demanière analogue nous notons K2 = {k0 = 0 < k

(2)1 < . . . < k

(2)K∗2

= K∗} le sous-ensemblede K tel que (τ ∗k )k∈K2\{0,K∗} soit la suite des instants de rupture affectant le paramètreσ. Enfin nous notons A ∈ {1, 2, 3}(K∗−1) défini, pour k = 1, . . . , K∗ − 1, par Ak = 1 sik /∈ K2, Ak = 2 si k /∈ K1 et dans le cas où k ∈ K1

⋂K2, Ak = 3 (dans ce cas, la rupture

affecte simultanément les paramètres µ et σ). Il est à noter qu’à notre connaissance aucuneétude n’a été réalisée pour l’estimation d’instants de rupture, dont le nombre est inconnu,affectant les paramètres µ et/ou σ pour les processus Cox-Ingersoll-Ross. Néanmoins,de nombreux travaux se sont intéressés à l’estimation d’un nombre connu de rupturesaffectent soit la dérive, soit la volatilité pour des processus de diffusion (Iacus and Yoshida,2012). Dans ce travail, nous proposons une procédure d’estimation de K∗, τ ∗ et A.

102 Chapitre 5. Compléments

Nous introduisons les quantités ∆τ∗ ,∆µ et ∆σ définies par :

∆τ∗ = mink∈{1,...,K∗}

|τ ∗k − τ ∗k−1|, ∆µ = mink∈{1,...,K∗}

|µk − µk+1|,

∆σ = mink∈{1,...,K∗}

|σk − σk+1|.

Nous supposons les réels ∆τ∗ ,∆µ et ∆σ connus et strictement positifs. Cette dernièrehypothèse est une hypothèse classique pour l’identifiabilité des paramètres. Dans la suite,nous notons θk = (1− δλk, δλkµk,

√δσk).

5.1.2 Procédure d’estimation

Dans cette section, nous décrivons la procédure d’estimation de K∗, τ ∗ et A. Notreprocédure s’appuie sur les travaux de Lavielle (1999) et Lavielle and Ludeña (2000) pourl’estimation de ruptures par minimisation de contraste pénalisé. Nous adaptons leur pro-cédure au contexte particulier des instants de rupture affectant la dérive et la volatilitéd’une EDS. Pour cela, nous définissons les contrastes basés sur la pseudo-vraisemblancedu schéma d’Euler de pas de temps δ :

J1(K, τ, C1:N) =1

N

K∑k=1

τk∑i=τk−1

(Ci+1 − θ1,kCi − θ2,k

)2

J2(K, τ, C1:N) =1

N

K∑k=1

(τk − τk−1 + 1) log(θ2

3,k

),

avec, pour k = 1, . . . , K, θ3,k l’estimateur du paramètre θ3,k défini au Chapitre 3 et(θ1,k, θ2,k) défini par :

(θ1,k, θ2,k) = arg minθ1,k,θ2,k

τk∑i=τk−1

(Ci+1 − θ1,kCi − θ2,k)2 .

La procédure d’estimation de (K∗, τ ∗, A) est une procédure en deux étapes :– dans un premier temps, nous calculons :

1. (τµ, K1) les estimateurs des instants des ruptures affectant µ (i.e les paramètres((τ ∗k )k∈K1\{0,K∗}, K

∗1)) définis par :

(τµ, K1) = arg min(τ,K≤K)

J1(K, τ, C1:N) + β(1)N K,

avec β(1)N une suite réelle convergeant vers 0 ;


2. (τσ, K2) les estimateurs des instants des ruptures affectant σ (i.e les paramètres((τ ∗k )k∈K2\{0,K∗}, K

∗2)) définis par

(τσ, K2) = arg min(τ,K≤K)

J2(K, τ, C1:N) + β(2)N K,

avec β(2)N une suite réelle convergeant vers 0 ;

les estimateurs obtenus définissent un vecteur τ = (τµ, τσ) ∈ RK1−1 × RK2−1 depotentiels points de rupture ; nous définissons A = (Aµ, Aσ) ∈ {1}K1−1×{2}K2−1 levecteur contenant la nature des ruptures ; il est à noter qu’à ce stade, nous affectonsla valeur 1 si la rupture affecte µ et 2 si la rupture affecte σ ; comme nous n’avonspas encore déterminé, parmi les éléments de τ , les ruptures affectant simultanémentµ et σ aucun élément du vecteur A ne prend la valeur 3

– la dernière étape consiste alors à déterminer les ruptures affectant simultanément µet σ, pour cela nous considérons :

τprop =(τpropi , i ≤ (K1 + K2 − 2)

)le vecteur composé des éléments de τ tel que

les éléments de τprop soient ordonnés par ordre croissant ;Aprop le vecteur contenant la nature des ruptures associées à τprop ;

– enfin nous calculons les estimateurs de (K∗, τ ∗, A) suivant le procédé itératif suivant :1. nous définissons τfin = τprop

1 , Afin = Aprop1 , i = 1, K = 1 et un seuil s > 0 ;

tant que i ≤ (K1 + K2 − 2), on répète la procédure suivante :2. i← i+ 1

(a) si (τfinK− τprop

i ) > s,alors τfin ← (τfin, τprop

i ), Afin ← (Afin, Apropi ) et K ← K + 1 ;

(b) si (τfinK− τprop

i ) ≤ s,alors τfin ← τfin et K ← K ; de plus,. si Aprop

i 6= AfinK

alors AfinK← 3 ;

. Afin ← Afin sinon.Finalement nous considérons (K, τfin, Afin) comme estimateur de (K∗, τ ∗, A).

Le choix des suites β(1)N et β(2)

N repose sur les résultats obtenus par Lavielle (1999),Lavielle and Ludeña (2000) qui montrent, dans le cadre de suite de variables dépendanteset sous des hypothèses générales, qu’un bon choix pour les suites β(1)

N et β(2)N consiste à

prendre la vitesse de convergence des estimateurs des paramètres affectés par la rupture.Or, Kessler (1997) montre, sous des hypothèses générales, que si les instants de rupturesont connus, alors pour k = 1, . . . , K∗1 l’estimateur µk de µk obtenu via θ1,k et θ2,k estconsistant et vérifie |µk − µk| = OP

(1√Nδ

)et pour k = 1, . . . , K∗2 l’estimateur σk de σk

obtenu via θ3,k est consistant et vérifie |σk − σk| = OP

(1√N

). Nous proposons donc de

considérer les suites β(1)N = 1√

Nδet β(2)

N = 1√N. Le choix du paramètre s s’appuie sur ∆τ∗ .


5.1.3 Étude de simulation

Dans cette section, nous proposons une étude de simulation afin d’étudier les perfor-mances numériques de la procédure décrite en Section 5.1.2. Dans le but d’étudier l’in-fluence de K∗, nous considérons deux configurations différentes :K∗ = 2, T1 = δτ ∗1 = 15(scénario 1) et K∗ = 3, T1 = δτ ∗1 = 15 et T2 = δτ ∗2 = 50 (scénario 2). Pour le scénario 1,nous considérons un ensemble de paramètres S1 possibles pour (λk, µk, σk)k=1,...,K∗ . Pourle scénario 2, nous considérons deux ensembles de paramètres S2 et S3. Nous présentonsles ensembles de paramètres en Table 5.1.

scenario 1 scenario 2S1 S2 S3

param. k = 1 k = 2 k = 1 k = 2 k = 3 k = 1 k = 2 k = 3λk 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2µk 9 5 9 5 5 5 9 5σk 1.5 1 1 1 1.5 1 1.5 1

Table 5.1 – Ensemble des paramètres. Pour chaque scénario, nous donnons l’ensemble deparamètres (λ, µ, σ) considéré. Pour le scénario 1, la rupture affecte les paramètres µ et σ, on aA = 3. Pour le scénario 2 et l’ensemble de paramètres S2, la première rupture affecte le paramètreµ et la seconde rupture affecte le paramètre σ, ainsi A = (1, 2). Pour le scénario 2 et l’ensemble deparamètres S3, les deux ruptures affectent simultanément les deux paramètres et donc A = (3, 3).

Pour chaque scénario et chaque ensemble de paramètres, nous simulons 100 trajec-toires selon le schéma d’Euler de pas de temps 0.0025, pour chacune d’elles nous neconservons que le sous-ensemble de la trajectoire formé des observations aux temps iδpour i = 1, . . . , N . Finalement, pour chaque trajectoire obtenue nous appliquons la pro-cédure décrite en Section 5.1.2 avec K = 4, δs = 8, β(1)

N = 2√δN

et β(2)N = 1√

N. Il est à

noter que nous utilisons 2√δN

en pratique plutôt que 1√δN

pour la suite β(1)N , la suite 2√

δN

donnant de meilleurs résultats. Pour j = 1, . . . , 100, nous obtenons Kj, τj, Aj, les valeursestimées de (K∗, τ ∗, A) pour la j-ième trajectoire. Finalement, nous calculons :

Perf =1

100

100∑j=1

1{Aj=A} et τ =1

]{j, Aj = A}

∑j, Aj=A

τj,

le taux d’estimation correcte de A et le vecteur des moyennes empiriques des valeursestimées de τ ∗ lorsque A est correctement estimé. Les résultats sont résumés en Table 5.2.

Les résultats de la Table 5.2 montrent que la procédure proposée obtient de bonnesperformances numériques pour le scénario 1 et pour le scénario 2 avec l’ensemble deparamètres S1 et S2. Dans ces cas, le vecteur A est correctement estimé dans 95% (ou


scenario 1 scenario 2S1 S2 S3

Perf 95% 94% 75%δτ 14.1 (1.7) 15.8 (1.6) 50.0 (0.2) 15.0 (0.5) 48.9 (2.0)

Table 5.2 – Performance de la procédure. Pour chaque scénario et chaque ensemble de para-mètre, Perf indique le nombre de fois que le vecteur A est correctement estimé. Sur l’ensembledes trajectoires pour lesquelles A est correctement estimé, nous calculons la moyenne empiriquedes estimateurs des instants de rupture et l’écart type indiqué entre parenthèses.

presque, 94% pour le scénario 2) des cas, de plus l’estimation des instants de rupture esttrès satisfaisante : les moyennes empiriques obtenues sont très proches des vraies valeurs etles écarts types sont petits. Dans le cas du scénario 2 avec l’ensemble de paramètres S3 lesperformances sont moins bonnes. Dans ce cas, le vecteur A n’est correctement estimé quedans 75% des cas. On peut noter quand même que les estimations des instants de ruptureeffectuées lorsque le vecteur A est correctement estimé sont toujours satisfaisantes, lesmoyennes empiriques sont proches des vraies valeurs et les écart types sont petits. Ainsi,la procédure est moins performante pour reconnaître des ruptures affectant simultanémentµ et σ, dans le cas où il y a plus qu’une rupture. Ceci peut s’expliquer par le fait queNδ = 70 n’est pas assez grand, ceci impliquant une dégradation des performances del’estimateur (τµ, K1) (quelque soit le scénario et le jeux de paramètres considéré l’entierK∗2 est toujours correctement estimé).

Finalement, nous concluons de cette étude de simulation que l’on peut espérer étendreles résultats de consistance (les vitesses de convergence, notamment) des estimateursdes instants de ruptures obtenu par Lavielle (1999); Lavielle and Ludeña (2000) dans lecadre de l’estimation, par minimisation de la pseudo-vraisemblance Euler, des instants derupture affectant le paramètre µ et/ou σ pour les processus Cox-Ingersoll-Ross observésà temps discrets (et éventuellement pour d’autres types de processus). Il est d’ailleursimportant de noter que Iacus and Yoshida (2012) ont démontré ce type de résultat dansle cadre de processus de diffusion observés à temps discrets et tel que la dérive est inconnue,pour l’estimation d’une rupture affectant la volatilité.


5.2 Super-learning pour la classification multi-classeDans cette section, nous proposons une extension du super-learning au cadre de la

classification multi-classe. Nous proposons quelques rappels en Section 5.2.1 et présen-tons le problème d’intérêt en Section 5.2.2. Nous décrivons le principe du super-learningpour la classification multi-classe en Section 5.2.3 et son implémentation en Section 5.2.4.Finalement, nous appliquons le super-learning sur quelques jeux de données classiques enSection 5.2.5.

5.2.1 Cadre et notations

L’observation générique s’écrit (X, Y ) ∼ P0 avec X ∈ X et Y ∈ Y = {1, . . . , K}où K ≥ 3. Pour tout j ∈ {1, . . . , K}, nous notons p∗ = (p∗1, . . . , p

∗j) avec p∗j(x) =

P0(Y = j∣∣X = x) pour tout x ∈ X . Nous notons aussi R∗ = R(g∗) l’erreur de géné-

ralisation associée au classifieur de Bayes. Nous introduisons F = {f : X → RK , f =(f1, . . . , fK), f1 + . . .+ fK = 0}. On peut remarquer que se donner un élément f ∈ F estéquivalent à se donner un classifieur g ∈ G à travers la relation suivante :

g(X) = arg maxj∈Y

fj(X).

Dans le cas où pour tout j ∈ {1, . . . , K}, p∗j(X) < 1 p.s, nous pouvons considérer lafonction f ∗ ∈ F définie, en dehors d’un ensemble de mesure nulle, par :

f ∗j (X) =1

K

K∑l=1

log(1− p∗l (X))− log(1− p∗j(X)) (j = 1, . . . , K).

La fonction f ∗ est appelé fonction cible et vérifie :

g∗(X) = arg maxj∈Y

f ∗j (X).

Soit Ψ : R→ R+ une fonction réelle. Nous notons LΨ la Ψ-fonction de perte caracté-risée pour tout f ∈ F par :

LΨ((X, Y ), f) =K∑j=1

1{Y 6=j}Ψ(−fj(X)),

et RΨ(f) = EP0 (LΨ((X, Y ), f)) le Ψ-risque associé. Ce type de fonctions de perte estconsidéré dans Rubin (2010) pour une version multi-classe de la procédure AdaBoost.Nous rappelons le lemme suivant (voir Rubin (2010)) :

Lemme 5.1. Si Ψ est convexe, dérivable en 0 tel que Ψ′(0) < 0 alors Ψ est calibrée.

Autrement dit, pour toute suite de fonctions (fm)m≥0 ∈ FN, en caractérisant gm par

5.2. Super-learning pour la classification multi-classe 107

gm(x) = arg maxj∈Y fmj (x) (tout m ≥ 0 et x ∈ X ), eh bien si lim

m→∞RΨ(fm) = R∗Ψ ≡

inff∈F RΨ(f) alors limm→∞

R(gm) = R∗.De plus si Ψ est strictement convexe et si p∗j(X) < 1 p.s pour tout j ∈ {1, . . . , K}

alors :RΨ(f) = R∗Ψ ssi f(X) = f ∗(X) p.s.

Le Lemme 5.1 permet de faire le lien entre le Ψ-risque RΨ et le risque R. Il montreque la construction d’une règle de classification optimale pour le Ψ-risque RΨ garantitune règle de classification optimale pour le rique R.

Dans la suite, nous notons CM = {α ∈ [0, 1]M ,∑M

m=1 αm = 1} pour M ≥ 1 un entier.

5.2.2 Formulation du problème d’intérêt

Dans le cadre de la classification multi-classe et pour la fonction de perte L((X, Y ), g) =1{g(X)6=Y } pour g ∈ G, nous pouvons formuler l’objectif du super-learning comme suit :

– pour M ≥ 2 un entier et étant donné (p1, . . . , pM) des estimateurs de p∗ définissantpour α ∈ CM la règle de classification

gα(X) = arg maxj∈Y

M∑m=1

αmpmj (X);

– déterminer gα∗ avecα∗ = arg min

α∈CMR (gα) .

L’objectif du super-learning est ainsi formulé comme un problème d’optimisation. Parconséquent, il est nécessaire de considérer une formulation analogue mais convexe de ceproblème. Le Lemme 5.1 nous invite à considérer le problème d’intérêt suivant :

– pour M ≥ 2 un entier et étant donné (f 1, . . . , fM) des estimateurs de f ∗ ;– déterminer fα∗ avec

α∗ = arg minα∈CM

RΨ

(M∑m=1

αmfm

),

où Ψ une fonction réelle et convexe, dérivable en 0 et telle que Ψ′(0) < 0.

La procédure super-learning repose sur le principe de validation croisée V -fold : le risqueRΨ, inconnu, est estimé par son estimateur cross-validé. Dans la suite nous considéronsdes Ψ-fonctions de perte de type exponentiel : nous nous restreignons au cas où Ψ est unélément de Ψ = {x 7→ exp(−βx), β > 0}.

5.2.3 Super-learning multi-classe : principe

Soit Ψ ∈ Ψ. Nous considérons (Bv)1≤v≤V une partition régulière de {1, . . . , n}, avecV ≥ 2 un entier. Pour tout v ∈ {1, . . . , V }, nous notons D(v)

n = (Xi, Yi)i∈Bv et D(−v)n =


(Xi, Yi)i/∈Bv . Nous rappelons que P(v)n (P (−v)

n , respectivement) désigne la distribution em-pirique de D(v)

n (la distribution empirique de D(−v)n , respectivement). Pour M ≥ 2, nous

considérons H = {h1, . . . , hM} un ensemble de procédures d’estimation. On entend parprocédure d’estimation, une fonction de la distribution empirique qui renvoie un élémentde F . Nous définissons le risque empirique cross-validé RΨ et sa contrepartie oracle RΨ,pour tout h ∈ H par

RΨ(h) =1

V

V∑v=1

R(P

(v)n )

Ψ

(h(P (−v)

n )),

RΨ(h) =1

V

V∑v=1

RΨ

(h(P (−v)

n )),

où pour tout v = 1, . . . , V , conditionnellement à D(−v)n , R(P

(v)n )

Ψ (h(P(−v)n )) est l’estimateur

empirique de RΨ(h(P(−v)n )) caractérisé par

R(P

(v)n )

Ψ

(h(P (−v)

n ))

=1

Card(Bv)

∑i∈Bv

LΨ

((Xi, Yi), h(P (−v)

n )).

L’estimateur super-learning est alors défini comme

hα(Pn) =M∑m=1

αmhm(Pn) (5.1)

avec :

α = arg minα∈CM

RΨ

(M∑m=1

αmhm

).

De la même manière, nous définissons sa contrepartie oracle hα(Pn) =∑M

m=1 αmhm(Pn)avec :

α = arg minα∈CM

RΨ

(M∑m=1

αmhm

).

van der Laan et al. (2007) montrent un résultat d’optimalité pour l’estimateur super-learning dans le cadre de l’estimation de E[Y

∣∣X] et pour la perte quadratique. Nousproposons au Théorème 5.1 un résultat du même type pour l’estimateur super-learningdéfini par (5.1). Ce résultat montre que l’estimateur super-learning est asymptotiquementaussi bon que sa contrepartie oracle.

Théorème 5.1. S’il existe C0 > 0 tel que pour tout h ∈ H on ait h(Pn) ∈ F et pour toutx ∈ X ,maxj∈{1,...,K} |hj(Pn)(x)| ≤ C0, alors l’inégalité suivante est vérifiée :

EP0

[RΨ

(hα)−R∗Ψ

]≤ EP0

[RΨ

(hα)−R∗Ψ

]+O

(√log(n)

bn/V c

).


Démonstration. Nous notons Cn = {( i1n, . . . , iM

n), (i1, . . . , iM) ∈ {0, . . . , n}}

⋂CM .

Nous notons aussi pour tout α ∈ CM , hα =∑M

m=1 αmhm.Nous avons :

RΨ(hα)− RΨ(hα) =(RΨ(hα)− RΨ(hα)

)+(RΨ(hα)− RΨ(hα)

)≤

(RΨ(hα)− RΨ(hα)

)+(RΨ(hα)− RΨ(hα)

)≤ 2 sup

α∈CM

∣∣∣RΨ(hα)− RΨ(hα)∣∣∣ .

Pour tout α ∈ CM , il existe αn ∈ Cn tel que ‖α− αn‖∞ ≤ 1net∣∣∣RΨ(hα)− RΨ(hα)

∣∣∣ ≤ ∣∣∣RΨ(hα)− RΨ(hαn

)∣∣∣+∣∣∣RΨ(hα)− RΨ(hα

n

)∣∣∣+∣∣∣RΨ(hα

n

)− RΨ(hαn

)∣∣∣ .

De plus, comme il existe C0 > 0 tel que pour tout h ∈ H et pour tout x ∈ X ,maxj∈{1,...,K} |hj(Pn)(x)| ≤ C0, nous déduisons qu’il existe une constante C1 > 0 telleque : ∣∣∣RΨ(hα)− RΨ(hα)

∣∣∣ ≤ C1

n.

Finalement, nous obtenons que :∣∣∣RΨ(hα)− RΨ(hα)∣∣∣ ≤ 2 sup

α∈Cn

∣∣∣RΨ(hα)− RΨ(hα)∣∣∣+

2C1

n.

Les arguments développés dans Dudoit and van der Laan (2005) et abordés au Chapitre 4pour le contrôle de E

[supα∈Cn

∣∣∣RΨ(hα)− RΨ(hα)∣∣∣] avec card(Cn) ≤ (n+1)M−1 permettent

alors de conclure.

5.2.4 Super-learning multi-classe : implémentation

En pratique, nous ne disposons pas directement de procédures d’estimation de f ∗ : lesalgorithmes de classification sont des procédures d’estimation de p∗. Il nous faut donc,pour p estimateur de p∗ et X une observation indépendante de Dn, spécifier le calcul de festimateur de f ∗. Si pour tout j = 1, . . . , K, pj(X) < 1, nous pouvons définir f(X) par :

fj(X) =1

K

K∑l=1

log(1− pl(X))− log(1− pj(X)).

S’il existe j ≤ K tel que pj(X) = 1 alors f(X) n’est pas défini. En pratique il n’est pasrare de se retrouver face à ce type de situation. Pour pallier ce problème, nous proposonsde procéder par seuillage :


– nous fixons 0 < ε < 12, et définissons a ∈ RK par a1 = 1− ε, a2 = ε,

aj = 0, j = 3, . . . , K ;– nous calculons t = 1

K

∑Kl=1 log(1− al)− log(1− a1) > 0 ;

– s’il existe j0 tel que pj0(X) = 1 ou 1K

∑Kl=1 log(1− pl(X))− log(1− pj0(X)) > t alors

f(X) = (f1(X), . . . fK(X)) est définie par fj0(X) = (K − 1) et fj(X) = −1 pourj 6= j0 ;

– sinon pour tout j = 1, . . . , K,

fj(X) =(K − 1)

tK

K∑l=1

log(1− pl(X))− log(1− pj(X)).

Cette façon de procéder nous assure que l’estimateur proposé f est toujours bien défini,et qu’il est borné (avec minj≤K fj(X) ≥ −1 et maxj≤K fj(X) ≤ K − 1).

Soit Ψ ∈ Ψ et H l’ensemble des procédures d’estimation. Nous pouvons décrire laprocédure super-learning comme suit :

1) soit V ≥ 2, un entier, et (Bv)1≤v≤V une partition de l’ensemble {1, . . . , n} régulière ;2) pour tout v ∈ {1, . . . , V } :

i) on détermine h(−v)m = hm(P

(−v)n ), pour tout m ∈ {1, . . .M} ;

ii) puis, pour tout i ∈ Bv on calcule H(m,i) = h(−v)m (Xi) ∈ RK ;

3) nous considérons alors la matrice H ∈ Mn,KM tel que, pour m ∈ {1, . . . ,M} etj ∈ {1, . . . , K} :

Hi,(K(m−1)+j) = H(m,i)j ;

4) pour tout α ∈ CM , nous considérons la matrice Φ(α) ∈ Mn,K définie pour (i, j) ∈{1, . . . , n} × {1, . . . , K} par

Φi,j(α) = Ψ

(−

M∑m=1

αmH(m,i)j

),

et calculons α défini par :

α = arg minα∈CM

1

n

n∑i=1

φi(α),

où (φ1(α), . . . , φn(α)) = diag(Φ(α)ZT

), avec Z =

(1{Yi 6=yj}

)1≤i≤n,1≤j≤K .

Nous pouvons remarquer que pour tout i ∈ {1, . . . , n}, on a :

1

n

n∑i=1

φi(α) =n∑i=1

K∑j=1

1{Yi 6=yj}Ψ

(−

M∑m=1

αmH(m,i)j

);


5) finalement, nous considérons la règle de classification suivante :

g = arg maxj∈Y

M∑m=1

αmhm(Pn).

Il est à noter que le calcul du Ψ-risque empirique résulte d’un calcul matriciel. Cetteremarque est importante pour l’implémentation de cet algorithme en R qui est très per-formant dans le calcul matriciel.

5.2.5 Application sur quelques jeux de données classiques

Nous proposons d’illustrer cette version multi-classe du super-learning sur quelquesjeux de données réelles disponibles sur le site UCI Irvine Machine Learning Repository( http : //archive.ics.uci.edu/ml/). Nous présentons les jeux de données en Table 5.3.Pour évaluer les performances du super-learning sur chacun des trois jeux de données, nous

jeux de données ] échantillon ] covariables ] classesIris 150 4 3Wine 178 13 3Glass 214 9 6

Table 5.3 – Présentations des jeux de données. Pour chacun des trois jeux de données nousdonnons la taille de l’échantillon (] échantillon), le nombre de covariables (] covariables)ainsi que le nombre de classes (] classes).

proposons d’utiliser le principe de validation croisée B-fold avec B = 5. Nous répétonsle procédé 10 fois et calculons le taux d’erreur de classification obtenu par super-learningainsi que celui de chacune des procédures de classification utilisées pour constituer la li-brairie. Les résultats sont présentés en Table 5.5. Pour l’implémentation, la librairie a étéconstituée à l’aide de différents packages R ; nous décrivons la librairie dans la Table 5.4.Nous évaluons les performances du super-learning pour deux Ψ-fonctions de perte dif-férentes, Ψ1 : x 7→ exp(−x) et Ψ2 : x 7→ exp(−10x). Les procédures super-learningrésultantes sont appelées SL1 pour Ψ1 et SL2 pour Ψ2. Nous utilisons V = 10 pour leparamètre de validation croisée pour les jeux de données Iris et Wine. Pour le jeux dedonnées Glass, nous utilisons V = 2, la faible représentation d’une des classes (9/214) nepermettant pas de considérer une valeur trop grande. Le paramètre ε est fixé à 10−2.

Nous tirons deux remarques des résultats présentés dans la Table 5.5. Premièrementnous constatons que, quelque soit le jeu de données considéré, la règle de classificationobtenue par super-learning réalise d’aussi bonnes performances (voir même meilleurs pourles jeux de données Iris et Wine) que la procédure obtenant les meilleures performances.La seconde remarque est que le choix du paramètre β pour la fonction de perte Ψ estsensible au nombre de classes. En effet, pour les jeux de données Iris et Wine, dont le


nature package paramètresRF1 forêts aléatoires randomforest standardsRF2 forêts aléatoires randomforest ntree = 1000rpart1 arbres de classification rpart maxdepth = 2rpart2 arbres de classification rpart maxdepth = 6svm machine à vecteurs support e071 standardspolyclass régression polychotomique polspline standards

Table 5.4 – Description de la librairie. Pour chaque algorithme de classification consti-tuant la librairie, nous indiquons la nature de l’algorithme, le package R ainsi que lesparamètres utilisés.

RF1 RF2 svm rpart1 rpart2 polyclass SL1 SL2

Iris 0.045 0.048 0.041 0.065 0.060 0.059 0.057 0.040Wine 0.019 0.021 0.020 0.143 0.088 0.074 0.069 0.014Glass 0.142 0.143 0.320 0.420 0.227 0.406 0.146 0.273

Table 5.5 – Résultats. Pour chacun des jeux de données, nous évaluons les performancesde chacune des procédures de classification par validation croisée répétée.

nombre de classes est 3, c’est le super-learning construit à partir de Ψ2 (β = 10) quidonne de bonnes performances alors que le choix de Ψ1 (β = 1) conduit à une règle declassification qui n’est pas satisfaisante. À contrario, pour le jeu de données Glass dont lanombre de classes est 6 c’est le choix de Ψ1 plutôt que Ψ2 qui conduit à une bonne règlede classification. Ceci provient du fait que plus le nombre de classes est important plusl’erreur commise par un algorithme de classification sera importante. Par conséquent, lechoix d’un β plus grand pour des données où le nombre de classe n’est pas très grandpermet de rendre plus importante l’erreur commise par un algorithme de classification etainsi de mieux discriminer les algorithme entre eux. À contrario, pénaliser trop fortementune erreur par un choix du paramètre β trop grand ne permettra pas non plus de distinguerles algorithmes entre eux.

Chapitre 6

Conclusion

113

6.1. Bilan et perspectives pour la classification en maintien postural 115

6.1 Bilan et perspectives pour la classification en main-tien postural

Dans cette thèse, nous avons mis au point une procédure de classification de donnéesde maintien postural s’appuyant sur un traitement préalable des trajectoires associées auxdifférents protocoles par filtrage. Ceci a pour but d’exploiter certaines spécificités des tra-jectoires, afin d’extraire des informations pertinentes pour la classification tout en prenanten compte les difficultés liées à la classification de données fonctionnelles. Cette procé-dure est facilement généralisable au sens où la définition de nouvelles mesures résuméespeut toujours être envisagée sans modifier la nature de la procédure. Les performancesobtenues par les différentes versions de cette procédure de classification mises en œuvredans cette thèse sont satisfaisantes. La question de leur amélioration doit néanmoins êtreposée.

Nous pensons qu’une piste prometteuse de progrès sera de construire des mesuresrésumées fondées sur une modélisation des trajectoires (Bi)i≤N par un processus bi-dimensionnel de diffusion (processus Ornstein-Uhlenbeck, par exemple). Ceci est l’ex-tension naturelle de l’approche entreprise dans le Chapitre 3. Bien entendu, en ne traitantpas directement les trajectoires (Li, Ri)i≤N nous n’exploitons pas toute l’information dis-ponible et une autre approche peut justement être de travailler avec l’ensemble de toutesles trajectoires. À ce sujet le travail mené par Bai et al. (2012) sur les movelets laisseentrevoir des pistes de réflexions très intéressantes.

L’approche envisagée dans cette thèse est celle de la classification (binaire et multi-classe). Nous avons discuté dans l’introduction de l’intérêt d’une approche alternativede classification non supervisée. Nous pensons que le traitement des trajectoires et laprocédure de classification mise en place vont servir de tremplin pour l’entreprise de cetteapproche complémentaire. Les travaux de Abraham et al. (2003); De Gregorio and Iacus(2010) sur le thème de la classification non supervisée de trajectoires nous serviront depistes de réflexions.

6.2 Bilan et perspectives pour l’étude du TSP classifier

Dans l’étude menée sur le TSP classifier nous avons obtenus des résultats de consis-tance classiques sur la comparaison des différentes versions empiriques du TSP classifier àleurs contreparties oracles. De cette étude, nous tirons plusieurs perspectives. La premièreserait d’étudier quels types d’hypothèse sur la distribution peut garantir que le TSP clas-sifier soit une bonne règle de classification. En effet, sans hypothèse sur la distributiondes données, il n’est bien entendu pas garanti que la règle de classification induite par leTSP classifier soit optimale. La seconde perspective serait d’étendre les propriétés obte-nues pour des versions du TSP classifier fondés sur plusieurs paires et de construire uneprocédure permettant la sélection d’un nombre de paires garantissant de meilleures per-

116 Chapitre 6. Conclusion

formances au classifieur ainsi construit. La dernière perspective que nous tirons de cetteétude serait d’exploiter les résultats obtenus dans cette thèse pour étudier des arbres bi-naires de classification construits à partir de critères tels que la maximisation du score.Cette approche est d’ailleurs envisagée par Czajkowski and Kretowski (2011).

Bibliographie

Abraham, C., Biau, G. and Cadre, B. (2006). On the kernel rule for function classification.Annals of the Institute of Statistical Mathematics 58, 619–633.

Abraham, C., Cornillon, P., Matzner-Lober, E. and Molinari, N. (2003). Unsupervisedcurve clustering using b-splines. Scandinavian Journal of Statistics 30, 581–595.

Arlot, S. (2007). Rééchantillonage et selection de modèles. Thèse. Université Paris-Sud,Orsay.

Asai, M., Nakagawa, H. and Ohashi, N. (1990). Difference between visual feedback andvisual suppression upon the stabilization of body sway in normal subjects. Journal forOto-Rhino-Laryngology and its Related Specialties 52, 226–231.

Astrua, M., Fontani, G., Kratter, G., Riva, D. and Soardo, G. (1997). Quantitativeassessment of disequilibrium management in skiers, related to the qualification level ofthe athletes. In Proceedings of the IV IOC World Congress on Sport Sciences.

Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers.The Annals of Statistics 35, 608–633.

Bai, J. (1994). Least squares estimation of a shift in linear processes. J. Time Ser. Anal.15, 453–472.

Bai, J., Goldsmith, J., Caffo, B., Glass, T. and Crainiceanu, C. (2012). Movelets : adictionary of movement. Electronic Journal of Statistics 6, 559–578.

Bardet, J. and Bertrand, P. (2007). Identification of the multiscale fractional brownianmotion with biomechanical applications. Journal of Time Series Analysis 28, 1–57.

Bardet, J. M., Kengne, W. and Witenberger, O. (2012). Multiple breaks detection ingeneral causal time series using penalized quasi- likelihood. published in ElectronicJournal of Statistics 6, 435–477.

Basseville, M. and Nikiforov, I. V. (1993). Detection of abrupt changes : theory andapplication. Prentice Hall Information and System Sciences Series. Prentice Hall Inc.,Englewood Cliffs, NJ. ISBN 0-13-126780-9.

117

118 Bibliographie

Bertrand, P., Bardet, J., Dabonneville, M., Mouzat, A. and Vaslin, P. (2001). Automaticdetermination of the different control mechanisms in upright position by a waveletmethod. Engineering in Medicine and Biology Society, 2001. Proceedings of the 23rdAnnual International Conference of the IEEE 2, 1163 – 1166. ISSN 1094-687X.

Biau, G., Bunea, F. and Wegkamp, M. (2005). Functional classification in hilbert spaces.IEEE Trans. Inform. Theory 51, 2163–2172.

Biau, G., Devroye, L. and Lugosi, G. (2008). Consistency of random forests and otheraveraging classifiers. Journal of Machine Learning Research 9, 2015–2033.

Bonan, I., Yelnik, A., Laffont, I., Vitte, E. and Freyss, G. (1996). Selection of sensoryinformation in postural control of hemiplegics after unique stroke. Annales de Réadap-tation et de Médecine Physique 39, Issue 3, 157–163.

Bouisset, S. and Maton, B. (1995). Muscles, posture et mouvement : bases et applicationsde la méthode électromyographique. Hermann, Paris.

Bousquet, O., Boucheron, S. and Lugosi, G. (2004). Introduction to statistical learningtheory. Advanced Lectures in Machine Learning, Springer pp. 169–207.

Breiman, L. (2001). Random forests. Machine Learning 45, 5–32.

Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification andregression trees. Wadsworth Advanced Books and Software, Belmont, CA.

Cardot, H. (1999). Functional linear model. Statistics and Probability Letters 45, 11–22.

Cardot, H. (2000). Nonparametric estimation of smoothed principal components analysisof sampled noisy functions. Journal of Nonparametric Statistics 12, 503–538.

Chambaz, A., Bonan, I. and Vidal, P.-P. (2009). Deux modèles de Markov caché pourprocessus multiples et leur contribution à l’élaboration d’une notion de style postural.J. SFdS 150, 73–100. ISSN 2102-6238.

Chambaz, A. and Denis, C. (2012). Classification in postural style. Annals of AppliedStatistics 6, 977–993.

Chiari, L., Cappello, A., Lenzi, D. and Della Croce, U. (2000). An improved technique forthe extraction of stochastic parameters from stabilograms. Gait Posture 12, 225–234.

Christianini, N. and Shawe-Taylor, J. (2000). An introduction to Support Vector Machines.Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge UniversityPress.

Bibliographie 119

Collins, J. and De Luca, C. (1995). Open-loop and closed-loop control of posture : Arandom-walk analysis of center- of-pressure trajectories. Experimental Brain Research95, 308–318.

Cuenod, C., Favetto, B., Genon-Catalot, V., Rozenholc, Y. and Samson, A. (2011). Pa-rameter estimation and change-point detection from Dynamic Contrast Enhanced MRIdata using stochastic differential equations. Math. Biosci. 233, 68–76.

Czajkowski, M. and Kretowski, M. (2011). Top scoring pair decision tree for gene expres-sion data analysis. Software Tools and Algorithms for Biological Systems, Springer 3,27–36.

De Gregorio, A. and Iacus, S. (2010). Clustering of discretly observed diffusion processes.Computational Statistics and Data Analysis 54, 598–606.

Denis, C. (2011). Classification in postural maintenance based on stochastic process mo-deling. URL http://hal.archives-ouvertes.fr/hal-00653316/en/. PrépublicationMAP5_2011-34, référence HAL hal-00653316.

Donoho, D. (2000). High-dimensional data analysis : the curses and blessings of dimen-sionality. Math Challenges of the 21st Century .

Dudoit, S. and van der Laan, M. J. (2005). Asymptotics of cross-validated risk estimationin estimator selection and performance assessment. Stat. Methodol. 2, 131–154. ISSN1572-3127.

Díaz-Uriarte, R. and Alvarez de Andrés, S. (2006). Gene selection and classifcation of mi-croarray data using random forest. BMC Bioinformatics 7.

Ferraty, F. and Vieu, P. (2003). Curves discrimination : a nonparametric functionalapproach. Computational Statistics and Data Analysis 44, 161–173.

Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regressiontools. Technometrics 35, 109–148.

Frank, T., Daffertshofer, A. and Beek, P. (2000). Multivariate ornstein-uhlenbeck pro-cesses with mean-field dependent coefficients : Application to postural sway. PhysicalReview E 63, 011905.

Fransson, P., Hafström, A., Karlberg, M., Magnusson, M., Tjäder, A. and Johansson, R.(2003). Postural control adaptation during galvanic vestibular and vibratory proprio-ceptive stimulation. IEEE Transactions on Biomedical Engineering 50, 1310–1319.

Geman, D., d’Avignon, C., Naiman, D. Q. and Winslow, R. L. (2004). Classifying gene ex-pression profiles from pairwise mRNA comparisons. Statistical Applications in Geneticsand Molecular Biology 3.

120 Bibliographie

Genthon, N. (2006). Déficience unilatérale et adaptation de la fonction posturale. Rôle dechacun des appuis dans le maintien de la station debout. Thèse. Université de Savoie,Chambéry.

Genuer, R. (2010). Forêts aléatoires : aspects théoriques, sélection de variables et appli-cations. Thèse. Université Paris-Sud, Orsay.

Gey, S. (2002). Bornes de risque, détection de ruptures, boosting : trois thèmes statistiquesautour de CART en régression. Thèse. Université Paris-Sud, Orsay.

Hall, P., Poskitt, D. and Presnell, B. (2001). A functionnal data-analytic approach tosignal discrimination. Technometrics 43.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The elements of statistical learning.Springer Series in Statistics. Springer, New York.

Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models, vol. 43. Chapmanand Hall Ltd., London.

Iacus, S. and Yoshida, N. (2012). Estimation for the change point of the volatility instochastique differential equation. Stochastic Processes and Their Applications 122,1068–1092.

Inglis, J., Lauk, M. and Pavlik, A. (1999). The effects of stochastic galvanic vestibularstimulation on human postural sway. Experimental Brain Research 124, 273–280.

Jacquot, J., Pélissie, J. and Strubel, D. (1999). La chute de la personne âgée. Masson,Paris.

James, G. and Hastie, T. (2001). Functionnal linear discriminant analysis for irragularlysampled curves. J. Roy. Statistic. Soc. Ser. B 63, 533–550.

Kessler, M. (1997). Estimation of an ergodic diffusion from discrete observations. Scan-dinavian Journal of Statistics. Theory and Applications 24, 211–229.

Kloeden, P. and Platen, E. (1992). Numerical Solution of Stochastic Differential Equation,vol. 23 of Stochastic modelling and applied probability. Springer.

Lavielle, M. (1999). Detection of multiple changes in a sequence of dependent variables.Stochastic Processes and Their Applications 83, 79–102.

Lavielle, M. (2005). Using penalized contrasts for the change-point problem. SignalProcessing 85, n. 8, 1501–1510.

Lavielle, M. and Ludeña, C. (2000). The multiple change-points problem for the spectraldistribution. Bernoulli 6, 845–869.

Bibliographie 121

Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal ofthe American Statistical Association 101, 578–590.

Loader, C. (1999). Local Regression and Likelihood. Statistics and Computing. Springer,New-York.

Massion, J. (1994). Postural control system. Current Opinion in Neurobiology 4, 877–887.

Mergner, T., Schweigart, G. and Blümle, A. (2005). Human postural responses to motionof real and virtual environments under different support base conditions. ExperimentalBrain Research 167, 535–556.

Newell, K., Slobounov, S., Slobounova and Molenaar, P. (1997). Stochastic processes inpostural center-of-pressure profiles. Experimental Brain Research 113, 158–164. ISSNDOI : 10.1007/BF02454152.

Paillard, J. (1971). Les déterminants moteur de l’organisation dans l’espace. Cahiers depsychologie 14, 261–316.

Pinsault, N. (2009). De l’objectivation des évaluations posturales et de la compréhen-sion des mécanismes de contrôle de la posture bipédique à leur application en MédecinePhysique et de Réadaptation. Thèse. Université Joseph Fourier-Grenoble 1, Grenoble.

R Development Core Team (2010). R : A Language and Environment for StatisticalComputing. Vienna, Austria. URL http://www.R-project.org.

Ramsay, J. and Silverman, B. (2005). Functional Data Analysis. Springer, New-York.

Ribot, E., Roll, J. and Vedel, J. (1989). Alteration of proprioceptive messages inducedby tendon vibration in man : a microneurographic study. Experimental Brain Research76, 213–222.

Rose, S. and van der Laan, M. J. (2011). Targeted Learning : Causal Inference for Ob-servational and Experimental Data. Springer Verlag.

Rubin, D. (2010). A calibrated multiclass extension of adaboost. Statistical Applicationsin Genetics and Molecular Biology 10, 54.

Sabatini, A. (2000). A statistical mechanical analysis of postural sway using non-gaussianfarima stochastic models. IEEE Transactions on Biomedical Engineering 47, 1219–1227.

Tan, A., Naiman, D., Xu, L., Winslow, R. and D, G. (2005). Simple decision rules forclassifying human cancers from gene expression profiles. Bioinformatics 21, 3896–3904.

122 Bibliographie

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist.Soc. Ser. B 58, 267–288.

Tibshirani, R., Hastie, T. and Friedman, J. H. (2010). Regularization paths for generalizedlinear models via coordinate descent. JSS 33, Issue 1.

Tinetti, M. (1986). Performance-oriented assessment of mobility problems in elderly pa-tients. The Journal of the American Geriatrics Society 34, 119–126.

van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2007). Super learner. Stat. Appl.Genet. Mol. Biol. 6, Art. 25, 23 pp. (electronic). ISSN 1544-6115.

van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. Int.J. Biostat. 2, Art. 11, 40. ISSN 1557-4679.

Yang, Y., Dudoit, S., Luu, P., Peng, V., Ngal, J. and Speed, T. (2001). Normalizationfor cdna microarray data. Microarrays : Optical Technologies and Informatics 4266,141–152.

Zhou, C., Wang, S., Blanzieri, E. and Liang, Y. (2012). An entropy-based improvedk-top scoring pairs (tsp) method for classifying human cancers. African Journal ofBiotechnology 41, 10438–10445.

Documents

Méthodes statistiques pour la classification de données de maintien