lasa.epfl.chlasa.epfl.ch/publications/uploadedFiles/EPFL_TH3814.pdf · POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR ingénieur en microtechnique diplômé EPF de nationalité

POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES

PAR

ingénieur en microtechnique diplômé EPFde nationalité suisse et française

acceptée sur proposition du jury:

Lausanne, EPFL2007

Prof. H. Bleuler, président du juryProf. A. Billard, directrice de thèse

Prof. H. Bourlard, rapporteur Dr Y. Demiris, rapporteur

Prof. S. Schaal, rapporteur

CONTINUOUS EXTRACTION OF TASK CONSTRAINTS IN A ROBOT PROGRAMMING BY DEMONSTRATION

FRAMEWORK

Sylvain CALINON

THÈSE NO 3814 (2007)

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

PRÉSENTÉE LE 6 JUILLET 2007

À LA FACULTÉ DES SCIENCES ET TECHNIQUES DE L'INGÉNIEUR

Laboratoire d'algorithmes et systèmes d'apprentissage

SECTION DE MICROTECHNIQUE

Abstract

Robot Programming by Demonstration (RbD), also referred to as Learning by

Imitation, explores user-friendly means of teaching a robot new skills. Recent

advances in RbD have identified a number of key-issues for ensuring a generic

approach to the transfer of skills across various agents and contexts. This the-

sis focuses on the two generic questions of what-to-imitate and how-to-imitate,

which are respectively concerned with the problem of extracting the essential

features of a task and determining a way to reproduce these essential features

in different situations. The perspective adopted in this work is that a skill can

be described efficiently at a trajectory level and that the robot may infer what

are the important characteristics of the skill by observing multiple demonstra-

tions of it, assuming that the important characteristics are invariant across the

demonstrations.

The proposed approach is based on the use of well-established statistical

methods in a RbD application, by using Hidden Markov Model (HMM) as a

first approach and then moving on to the joint use of Gaussian Mixture Model

(GMM) and Gaussian Mixture Regression (GMR). Even if these methods were

applied extensively in various fields of research including human motion analysis

and robotics, their use essentially focused on gesture recognition rather than

on gesture reproduction. Moreover, the models were usually trained offline

using large datasets. Thus, the use and combination of these machine learning

techniques in a Learning by Imitation framework is challenging and has not

been extensively studied yet. In this thesis, we show that these methods are

well suited for incremental and active teaching scenarios where the user only

provides a small number of demonstrations, either by wearing a set of motions

sensors attached to his/her body, or by helping the robot refine its skill by

kinesthetic teaching, that is, by embodying the robot and putting it through

the motion. These techniques are then applied for enabling a Fujitsu HOAP-2

and HOAP-3 humanoid robot to extract automatically different constraints by

observing several manipulation skills and to reproduce these skills in various

situations through Gaussian Mixture Regression (GMR).

The contributions of this thesis are threefold: (1) it contributes to RbD by

proposing a generic probabilistic framework to deal with recognition, general-

ization, reproduction and evaluation issues, and more specifically to deal with

the automatic extraction of task constraints and with the determination of a

controller satisfying several constraints simultaneously to reproduce the skill in

iii

a different context; (2) it contributes to Human-Robot Interaction (HRI) by

proposing active teaching methods that puts the human teacher ”in the loop”

of the robot’s learning by using an incremental scaffolding process and by using

different modalities to produce the demonstrations; (3) it finally contributes

to robotics and HRI through various real-world applications of the framework

showing learning and reproduction of communicative gestures and manipulation

skills. The generality of the proposed framework is also demonstrated through

its joint use with other learning approaches in collaborative experiments.

Keywords: Robot Programming by Demonstration (RbD), Learning by Imita-

tion, Human-Robot Interaction (HRI), Gaussian Mixture Model (GMM), Hid-

den Markov Model (HMM), Gaussian Mixture Regression (GMR).

iv

Resume

La programmation de robots par demonstration, ou apprentissage par imitation,explore l’utilisation de principes d’apprentissage conviviaux pour apprendre denouvelles taches a un robot humanoıde. L’engouement recent pour ce domained’application a permis d’identifier un certain nombre de questions cles pourassurer le transfert d’une tache de maniere generique entre differents agentset en considerant divers contextes. Dans cette these nous nous penchons surla problematique de l’extraction automatique des caracteristiques importantesd’une tache (ou d’un geste), ainsi que sur la problematique visant a generaliseret a reproduire une tache apprise dans de nouvelles situations. La perspec-tive adoptee est que le robot peut apprendre les caracteristiques essentiellesd’une tache en observant plusieurs demonstrations de celle-ci et en supposantque les elements importants ne varient que faiblement au cours des differentesdemonstrations.

L’approche proposee est basee sur l’utilisation commune de methodes statis-tiques dans un contexte de programmation de robots par demonstration, ense basant dans une premiere approche sur l’utilisation de chaınes de Markovcachees, puis en s’orientant vers un modele de regression statistique base surdes mixtures de gaussiennes. Bien que ces methodes aient deja ete utilisees dansde nombreux domaines scientifiques incluant l’analyse de geste et la robotique,leur utilisation s’est essentiellement focalisee au probleme de reconnaissance,ne laissant que peu de place au probleme de reproduction de gestes. De plus,avant une utilisation pratique, les modeles concernes ont la plupart du tempsete prealablement entraınes a l’aide de volumineuses bases de donnees. Ainsi,l’utilisation et la fusion de ces differents algorithmes d’apprentissage dans uncadre d’apprentissage par imitation reste un defi scientifique qui n’a que peuete etudie jusqu’a present. Cette these montre que ces outils mathematiquessont appropries a un apprentissage incremental ou l’utilisateur n’effectue qu’unnombre restreint de demonstrations pour apprendre une nouvelle tache a unrobot. Le processus d’apprentissage se deroule de maniere active en utilisanten premier lieu des capteurs de mouvement places sur differentes parties ducorps de l’utilisateur, puis par un processus d’apprentissage cinesthesique quiconsiste a saisir les bras du robot et a les mouvoir dans l’espace pour effectuerun geste donne pendant que ce robot enregistre par proprioception les mouve-ments effectues par son propre corps. Ces techniques sont ensuite appliquees adeux robots Fujitsu, HOAP-2 et HOAP-3, permettant d’extraire automatique-

v

ment les contraintes de mouvement liees a diverses taches de manipulation.L’apprentissage est effectue de maniere incrementale, en rectifiant progressive-ment un modele statistique encapsulant les multiples observations de la tache aeffectuer. La reproduction de ces mouvements dans de nouvelles situations peutalors etre consideree par application de procedes de regression statistique sur lemodele appris.

Les contributions de cette these sont triples: (1) elle contribue a un appren-tissage robotique par imitation en proposant un modele generique et proba-biliste qui permet de traiter les problemes de reconnaissance, de generalisation,de reproduction et d’evaluation, et plus particulierement de traiter l’extractionautomatique des contraintes d’une tache et la determination d’un controleur sat-isfaisant plusieurs contraintes simultanement pour reproduire la tache en ques-tion dans des contextes varies; (2) elle contribue a la recherche sur l’interactionhomme-robot en proposant des methodes actives d’apprentissage qui integrentl’utilisateur en tant qu’element-cle au cycle d’apprentissage du robot, en ap-prenant la tache a reproduire de maniere incrementale, en s’aidant du soutienfourni par l’utilisateur et en utilisant differentes modalites pour executer lesdemonstrations; (3) elle contribue finalement a la robotique et a l’etude des in-teractions homme-robot au travers d’applications pratiques variees du modelemontrant l’apprentissage de gestes de communication et de taches de manip-ulation. La genericite du modele propose est egalement demontree au traversd’experiences en collaboration avec d’autres chercheurs, montrant ainsi com-ment l’approche proposee se combine facilement avec d’autres methodologiesd’apprentissage.

Mots cles: programmation de robots par demonstration, apprentissage par im-itation, interaction homme-robot, mixtures de gaussiennes, chaınes de Markovcachees, regression statistique.

vi

Acknowledgments

I would like to thank Aude Billard for her continuous support, patience and

encouragement as supervisor during my PhD, for guiding my research in a wise

and involved manner, for offering me great opportunities to present my work

at various conferences and workshops, and for bringing me into contact with

other researchers in different universities and institutes. This acknowledgment

would not be complete without thanking her particularly for the time that she

devoted to me for reading manuscripts of this thesis and of the papers, articles

and book chapter that I published during this PhD.

I also acknowledge my three examiners, Prof. Herve Bourlard, Dr Yiannis

Demiris and Prof. Stefan Schaal for providing me useful comments and advices

on an earlier version of the manuscript.

I would like to thank all of my colleagues for the support and friendship

that they offered me during this PhD, as well as for their help and collaboration

for the various experiments that we set up with the robot during these years.

Thanks go out to my family and friends for encouraging me and for bringing

me out to the natural light when I started getting hypnotized by the computer

screen (or compulsively playing imitation games with the robot).

Thanks also to the Ecole Polytechnique Federale de Lausanne for offering

me the opportunities to work in a great scientific environment and enjoyable

atmosphere.

The financial support for this work was provided in part by the Secretariat

d’Etat a l’Education et a la Recherche Suisse (SER), under Contract FP6-

002020, Integrated Project Cogniron of the European Commission Division

FP6-IST Future and Emerging Technologies, and by the Swiss National Science

Foundation, through grant 620-066127 of the SNF Professorships program.

vii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Review of Robot Programming by Demonstration (RbD) . . . . . 11.2 Current state of the art in RbD . . . . . . . . . . . . . . . . . . . 91.3 Illustration of our RbD approach . . . . . . . . . . . . . . . . . . 201.4 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . 211.5 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . 23

2 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 242.1 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 242.2 Encoding through Gaussian Mixture Model (GMM) . . . . . . . 262.3 Encoding through Hidden Markov Model (HMM) . . . . . . . . . 272.4 Reproduction and extraction of constraints through Hidden Markov

Model (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.5 Reproduction and extraction of constraints through Gaussian

Mixture Regression (GMR) . . . . . . . . . . . . . . . . . . . . . 372.6 Reproduction of trajectories by considering multiple constraints . 412.7 Recognition, classification and evaluation of a reproduction at-

tempt - Toward defining a metric of imitation . . . . . . . . . . . 462.8 Learning of the model parameters . . . . . . . . . . . . . . . . . . 512.9 Reduction of dimensionality and latent space projection . . . . . 592.10 Number of components in GMM/HMM and initialization . . . . 642.11 Regularization of the GMM parameters . . . . . . . . . . . . . . 682.12 Temporal alignment of the demonstrated trajectories through

Dynamic Time Warping (DTW) . . . . . . . . . . . . . . . . . . 742.13 Use of prior information . . . . . . . . . . . . . . . . . . . . . . . 762.14 Extension to mixture models of different density distributions . . 772.15 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . 81

3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 833.1 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 833.2 Humanoid robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.3 Stereoscopic vision . . . . . . . . . . . . . . . . . . . . . . . . . . 853.4 Motion sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4 Testing and optimization of the parameters . . . . . . . . . . 904.1 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 904.2 Optimal latent space of motion for HMM and robustness to noise 904.3 Optimal latent space of motion for GMM . . . . . . . . . . . . . 984.4 Optimal selection of the number of parameters in the GMM . . . 1024.5 Robustness evaluation of the incremental learning process . . . . 105

viii

5 Extraction of constraints experiments . . . . . . . . . . . . . 1125.1 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 1125.2 Extraction of constraints and reproduction by using a single con-

troller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.3 Extraction of constraints and reproduction combining multiple

controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.4 Extraction of constraints by using incremental teaching methods 143

6 Discussion and further work . . . . . . . . . . . . . . . . . . . 1636.1 Outline of the chapter . . . . . . . . . . . . . . . . . . . . . . . . 1636.2 Roles of an active teaching scenario . . . . . . . . . . . . . . . . . 1646.3 Observational learning using motion sensors and decomposition

of the gesture into joint angles . . . . . . . . . . . . . . . . . . . 1696.4 Advantages of the HMM representation for imitation learning . . 1736.5 Similarities shared by GMR and other regression methods . . . . 1766.6 Failures and limitations of the proposed RbD system . . . . . . . 1786.7 Collaborative work . . . . . . . . . . . . . . . . . . . . . . . . . . 1846.8 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

A Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . 200A.1 Additional results for the search of an optimal latent space for

HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200A.2 Additional results for the search of an optimal latent space for

GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201A.3 Additional results for the optimal selection of the number of pa-

rameters in the GMM . . . . . . . . . . . . . . . . . . . . . . . . 212A.4 Additional results for the robustness evaluation of the incremen-

tal learning process . . . . . . . . . . . . . . . . . . . . . . . . . . 217

B Publications of the author . . . . . . . . . . . . . . . . . . . . . 221

C Source codes available on the internet . . . . . . . . . . . . . 227

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

ix

List of Abbreviations

BIC Bayesian Information Criterion.

BMM Bernoulli Mixture Model.

BSS Blind Source Separation.

CCA Canonical Correlation Analysis.

DOF Degree Of Freedom.

DTW Dynamic Time Warping.

EM Expectation-Maximization.

GMM Gaussian Mixture Model.

GMR Gaussian Mixture Regression.

HMM Hidden Markov Model.

HRI Human-Robot Interaction.

ICA Independent Component Analysis.

MLE Maximum Likelihood Estimation.

PbD Programming by Demonstration.

PCA Principal Component Analysis.

RbD Robot Programming by Demonstration.

x

List of Symbols

(·)> Transpose of (·).

(·)† Pseudo-inverse of (·).

n Number of demonstrations.

T Number of time steps in a demonstration.

N Number of datapoints in a training set.

(d− 1) Spatial dimensionality in the original data space.

d Dimensionality in the original data space with temporal information.

(D − 1) Spatial dimensionality in the latent space.

D Dimensionality in latent space with temporal information.

K Number of Gaussian components.

x Training set in the original Cartesian data space.

θ Training set in the original joint angle data space.

A Linear matrix to project the original data space to a latent space.

v Eigenvector.

λ Eigenvalue.

ξ Training set (in the latent space).

ξt Temporal component of the training set (in the latent space).

ξs Spatial component of the training set (in the latent space).

ξs Expected mean of the generalized trajectory in the latent space (spa-tial component).

Σs Expected covariance of the generalized trajectory in the latent space(spatial component).

N (µ,Σ) Gaussian distribution described by mean µ and covariance matrixΣ.

N (ξ;µ,Σ) Probability of datapoint ξ where the density function is a Gaussiandistribution.

π Prior probability of a Gaussian distribution in a GMM.

µ Mean of a Gaussian distribution.

xi

Σ Covariance matrix of a Gaussian distribution.

$ Initial state probability in an HMM.

ai,j Transition probabilities (from state i to state j) in an HMM.

E Cumulated posterior probability.

B(µ) Bernoulli distribution of parameter µ.

B(ξs;µ) Probability of a binary datapoint ξs where the density function is aBernoulli distribution.

E Expectation.

L Likelihood.

H Metric of imitation.

SBIC Score of the Bayesian Information Criterion (BIC).

xii

1 Introduction

The use of Programming by Demonstration (PbD) for humanoid robots ex-

plores user-friendly means of teaching the robot new skills without the need of

re-programming it through a computer language. The aim is to have a robot

that can learn quickly new skills by observing a human user performing a task,

and to re-use the learned skill in new situations. In this introduction, we first

provide an overview in Section 1.1 of the use of PbD in robotics, namely Robot

Programming by Demonstration (RbD). We show how this paradigm appeared

in software development and how it was used in robotics, starting from a purely

engineering viewpoint and adopting progressively machine learning perspectives

to generalize the learned skills. We then present how the development of hu-

manoid robots introduced new perspective in robotics, and how it consecutively

affected the PbD paradigm, introducing new challenges in terms of human-robot

interaction. Section 1.2 presents the current state of the art in RbD, highlight-

ing the various directions adopted in that field and how the current work is

situated in current research. Section 1.3 presents an illustrative example of the

issues tackled by this thesis. Section 1.4 presents the contributions of this thesis

and Section 1.5 presents its organization.

1.1 Review of Robot Programming by

Demonstration (RbD)

1.1.1 PbD in software development

Programming by Example (PbE) or Programming by Demonstration (PbD) ap-

peared in software development research in the early 90’s to define an intuitive

way for the end-user to re-program a computer simulation without having to

learn a new language (Cypher, 1993; Smith, Cypher, & Spohrer, 1994; Mitchell,

Caruana, Freitag, McDermott, & Zabowski, 1994). It mainly developed from

the observation that most of the end-users were skilled at using a computer and

its associated interfaces and applications (edition skills) but were only able to

re-program them in a limited way (by setting preferences). Due to the limited

impact of the numerous attempts at creating simple languages for the end-users

to re-program their computers,1 some researchers focused on re-formulating the

problem from its source and took the perspective that the main issue to teach

1See, e.g., the LOGO programming language (Goldenberg & Feurzeig, 1987).

1

a task to a computer was not to design a better syntax for a programming

language but to allow the computer to learn differently, taking into considera-

tion that most computer applications followed similar graphical user interface

strategies (environment with windows, menus, icons and a pointer controlled

by the mouse). By taking advantages of the user’s knowledge of the task (the

user knows how to perform a task but does not know how to program it), they

designed softwares that were able to extract rules simply by using the interfaces

(Cypher, 1993; Lieberman, 2001). The syntax of the program was then hid-

den to the users, who were only demonstrating the skill to the computer. Due

to the restricted behaviours that can be applied to the graphical environment

considered, the learned skills consisted mostly of discrete sets of ”IF-THEN”

rules that were extracted automatically by demonstrating the tasks to the com-

puter. It was then possible to generalize the skill and reproduce the behaviour

in different circumstances sharing similarities with the learned skill (Smith et

al., 1994). For example, Mitchell et al. (1994) proposed to use feed-forward

neural networks to learn the user’s preferences when utilizing a calendar, where

the application was progressively able to plan automatically adequate meetings

and appointments with respect to the end-user’s needs.

1.1.2 Early work in Robot Programming by

Demonstration (RbD)

The PbD paradigm naturally attracted a lot of interest in robotics, due to the

costs involved in the development and maintenance of robot programs (Friedrich,

Muench, Dillmann, Bocionek, & Sassin, 1996; Muench, Kreuziger, Kaiser, &

Dillmann, 1994). In this field, the operator has implicit knowledge on the task to

achieve (he/she knows how to do it), but does not have usually the programming

skills (or the time) required to reconfigure the robot. Demonstrating how to

achieve the task through examples would thus allow to learn the skill without

explicitly programming each detail. However, in robotics, the problem is more

complex than for PbD in software applications because the demonstrations and

the reproduction attempts are not necessarily performed on the same medium.

Indeed, when using PbD for software applications, the demonstrator shares the

same embodiment as the imitator. In contrast, by using different architectures

to demonstrate and reproduce the skills (as it is the case for robotics), the issues

related to the difference of embodiments are also introduced, referred to later as

a correspondence problem (Alissandrakis, Nehaniv, & Dautenhahn, 2002). By

considering simultaneously the software and hardware architectures, researchers

soon noticed that the embodiment of the artificial systems played a major role

in the learning paradigm (Pfeifer & Scheier, 2001).

The first PbD strategies proposed in robotics were based on teach-in, guiding

or play-back methods that consisted basically in moving the robot (through a

dedicated interface or manually) through a set of relevant configurations that

2

the robot should adopt sequentially (position, orientation, state of the grip-

per). After collecting a set of keypoints explicitly, the full motion was then

computed by interpolation (Lozano-Perez, 1983). The method was then pro-

gressively ameliorated by focusing principally on the teleoperation control and

by using different interfaces such as vision (Kuniyoshi, Inaba, & Inoue, 1994) or

laser range finder (Ikeuchi & Suchiro, 1992).

1.1.3 Toward the use of machine learning techniques in

RbD

However, these RbD methods still used direct repetition, which was useful in

industry only when conceiving an assembly line using exactly the same product

components. To apply this concept to products with different variants or to

apply the programs to new robots, the generalization issue became a crucial

point. To address this issue, the first attempts at generalizing the skill were

mainly based on the help of the user through queries about the user’s intentions

(Heise, 1989; Friedrich et al., 1996). Then, different levels of abstractions were

proposed to resolve the generalization issue, basically dichotomized in learning

methods at a symbolic level (Muench et al., 1994) or at a trajectory level (Ude,

1993).2 A task at a symbolic level is described by the sequential or hierarchical

organization of a discrete set of primitives that are pre-determined or extracted

with pre-defined rules. A task at a trajectory level is described by temporally

continuous signals representing different configuration properties changing over

time (e.g., current, torques, positions, orientations). Different levels of abstrac-

tion can be used (e.g., the position of the end-effector is a representation of

the robot’s configuration which is of higher-level than the current sent to the

motors to set the robot to this posture), which will also be referred to later

as the notion of granularity, i.e., by considering different levels of segmentation

(Alissandrakis et al., 2002). However, a task processed at a trajectory level still

uses a lower level of abstraction than a task described at a symbolic level, which

will be the main focus of this thesis.

By introducing the generalization issue in RbD, Machine Learning (ML)

soon became a close friend to robotics (Thrun & Mitchell, 1993; Heise, 1989;

Friedrich et al., 1996). Several techniques proposed in ML could be tested with

multiple examples of input/output (sensory/actuators) datasets that perfectly

fit in most of the frameworks developed theoretically in ML, while robotics

benefitted from the ML ability to cope with multivariate data and generaliza-

tion capabilities. It was then possible to extend the record/replay process to a

generalization of the skill to different contexts.3

2Note that in these two mainstreams, different levels of abstraction were considered aswell. For example, at trajectory level, one can consider the torques or current as being at thelowest level, followed by joint angles trajectories. The trajectories followed by the end-effectorin Cartesian space (computed by direct kinematics) would then constitute a more abstractlevel in the hierarchy.

3Note that this record/replay memory-based process is also referred to as motor tapes, in

3

In industrial robotics, Muench et al. (1994) suggested the use of ML algo-

rithms to analyze Elementary Operators (EOs) defining a discrete set of basic

motor skills. In this early work, the authors already established several key-

issues of ML in robotics. Even if they did not provide answers for every issues

introduced, they pointed to several key problems that will be discussed through-

out this thesis in terms of encoding, generalization, reproduction of a skill in

new situations, evaluation of a reproduction attempt, and role of the user in the

learning paradigm.

First they introduced the problems related to the segmentation and extrac-

tion of EOs, which was resolved in their framework through the user’s support.

To leverage this task, they suggested to use statistical analysis to keep only the

EOs that appeared repeatedly (generalization at a symbolic level). Then, by

observing several sequences of EOs, they pointed to the relevance of extracting

a structure from these sequences (generalization at a structure level). In their

work, both generalization processes were supervised by the user, who is asked to

tell whether an observation is example-specific or task-specific. This extraction

of the dependencies at a symbolic level is usually referred to as a functional

induction problem (Dufay & Latombe, 1984), and has been explored further by

many researchers (Saunders, Nehaniv, & Dautenhahn, 2006; Ekvall & Kragic,

2006; Pardowitz, Zoellner, Knoop, & Dillmann, 2007; Steil, Rothling, Haschke,

& Ritter, 2004; Nicolescu & Mataric, 2003).

Muench et al. (1994) admitted that generalizing over a sequence of discrete

actions is only one part of the problem since the controller of the robot also

requires the learning of continuous functions to control the actuators. They

proposed to overcome the missing parts of the learning process by leveraging

them to the user, who took an active role in the teaching process. They high-

lighted the importance of providing a set of examples that are usable by the

robot: (1) by constraining the demonstrations to modalities that the robot can

understand; and (2) by providing a sufficient number of examples to achieve a

desired generality. They noted the importance of providing an adaptive con-

troller to reproduce the task in new situations, that is, how to adjust an already

acquired program. The evaluation of a reproduction attempt was also leveraged

to the user by letting him/her provide additional examples of the skill in the

regions of the learning space that had not been covered yet. In this way, the

teacher/expert could control the generalization capabilities of the robot.4 Even

if research in RbD was turned toward a fully autonomous learning system, early

results quickly showed the importance and advantages of using an interaction

that the robot stores an explicit representation of a movement trajectory in memory and exe-cutes the appropriate tape to reproduce a task. From this simple process, more sophisticatedversions proposed to extend this concept to the combination of a set of tapes to produce thefinal movement (Atkeson et al., 2000).

4Note that the term ”expert” can be misleading because it makes no distinction betweenan expert at executing the skill and an expert at teaching. The first definition involves that theuser has skills for the execution of the task considered while the second involves pedagogicalskills.

4

process involving the user and the robot to cope with the deficiencies of the

current learning systems.5

To summarize, early work in RbD adopted principally a user-guided general-

ization strategy by querying the user to provide additional sources of information

for the induction process. The approach followed in this thesis is to leverage

this process (or at least modify the user’s role to avoid a burden of queries from

the robot) by adapting and combining several probabilistic ML algorithms in a

RbD framework. As the datasets considered in RbD are usually smaller than

the ones considered in ML, the convergence of the learning system becomes a

principal issue that we suggest here to tackle by using of a statistical framework

encapsulating the important properties of a skill.

We now show how the first appearances of humanoid robots in robotics

introduced new challenges in this field, how it consequently modified the view

of RbD and how it introduced new challenges.

1.1.4 From industrial robots to humanoid robots

Building a human-like artificial creature is a longstanding scientist dream,6

which was followed by the robotics engineers since the Waseda University intro-

duced WABOT-1 in 1973 as the first full-scaled humanoid robot (Kato, 1973).

Beside its intriguing nature and its technical challenge, humanoid robotics is

a challenging field in terms of learning and communication, where the robot is

expected to collaborate with humans in a shared environment. Compared to

an industrial robot programmed to repetitively perform a task in a static envi-

ronment, a humanoid robot is expected to perform a large panel of skills in dy-

namic situations (Thrun, 2004). The roles of the hardware (robot) and software

(learning) components are conceptually recast. Indeed, instead of re-designing

the environment for the robot, the robot architecture and learning components

are re-designed to adapt to changing environments. Beside its ergonomic bene-

fit when used in the surroundings of humans, the humanoid architecture is also

optimized for communication, which also increases its capabilities for teamwork

and companionship (Kanda, Hirano, Eaton, & Ishiguro, 2004). Indeed, hu-

manoids can inspire a sense of familiarity that increases the social dimension

of the interaction (Alissandrakis et al., 2002).7 With this emerging field of re-

search, the aim is not to replace every robot by its homolog humanoid form, but

to use them where they can present social or collaborative advantages (Dauten-

hahn, 1997; Kahn, Freier, Friedman, Severson, & Feldman, 2004). It would for

example be counterproductive to replace a robot performing a single repetitive

operation in an industrial assembly line by its humanoid version.

5Note that in this body of work, the supervision and guidance of the user took a large extent(at demonstration, generalization and reproduction levels) compared to the recent trends inRbD following this work.

6See for example the humanoid automaton designed by Leonardo da Vinci in the 15thcentury (Rosheim, 2006).

7Note that this social dimension can take culturally various forms (Kaplan, 2004).

5

New challenges came along with these new architectures, especially for learn-

ing in robotics. As the humanoid robot is supposed by its nature to adapt to new

environments, not only the human appearance is important but the algorithms

used for its control require flexibility and versatility. Due to the continuously

changing environments and to the huge varieties of tasks that a robot is ex-

pected to perform, the robot should have the ability to continuously learn new

skills and adapt the existing skills to new contexts. As the robot is expected to

physically cooperate with the human user (e.g., by moving in the same rooms,

sharing the same devices or manipulating the same tools), it also needs to be

predictable and to behave human-likely regarding social interaction, gestures

or learning behaviour. Indeed, it has been shown that human may negatively

evaluate a robot that has too much the appearance of a human but too little

the behaviours of the human (Ishiguro, 2006; Dautenhahn, 2003).8

Numerous fields of research emerged from this necessity to develop flexible

robot controllers. Most efforts were put in the control of the legs and to the

development of natural and flexible walking motion. Compared to the control of

wheel-based autonomous robots, the main issue tackled here is concerned with

the design of closed-loop controllers and dynamical systems to ensure stability

and to control walking patterns flexibly (Yamane & Nakamura, 2003; Nakanishi

et al., 2004; Righetti & Ijspeert, 2006). Researchers also tackled the issue of legs

stability when performing an additional task with the rest of the body (Khatib,

Sentis, Park, & Warren, 2004; Naksuk & Lee, 2005; Matsumoto, Konno, Gou,

& Uchiyama, 2006). Recently, research also turned toward running controllers

in humanoid robotics (Kajita, Nagasaki, Yokoi, Kaneko, & Tanie, 2002).

Concerning manipulation skills, several issues were addressed. A body of

work concentrated on manipulation at the level of the hand and fingers by

learning several dexterous types of grasping, depending on the object to be ma-

nipulated (Caurin, Albuquerque, & Mirandola, 2004; Aleotti & Caselli, 2006a;

Ekvall & Kragic, 2005). Other body of work focused on learning the dynamics

of simple reaching movements (Hersch & Billard, 2006; Svinin, Goncharenko,

Luo, & Hosoe, 2006; Arimoto, Hashiguchi, Sekimoto, & Ozawa, 2005). Sev-

eral researchers adopted the perspective that teleoperation is currently the most

suitable approach to control upper-body motion (Sakamoto, Kanda, Ono, Ishig-

uro, & Hagita1, 2007), and enlarged the perspective of the approach by using

bilateral teleoperation where the experience of the robot in the environment is

reflected back to the operator (Yoon, Suehiro, Onda, & Kitagaki, 2006; Takubo,

Inoue, Arai, & Nishii, 2006), and by using dedicated haptic interface to control a

high number of degrees of freedom simultaneously (Yokoi et al., 2006). Another

approach proposed was to leverage the multidimensional control complexity by

mixing teleoperation with autonomous behaviours (Chestnutt, Michel, Nishi-

waki, Kuffner, & Kagami, 2006). Other body of work suggested to separate the

learning phase performed in a virtual world with the reproduction phase per-

8This notion is referred to as the Uncanny Valley phenomenon (Mori, 1970).

6

formed on the real robot (Dillmann, 2004; Aleotti, Caselli, & Reggiani, 2004).

To develop flexible robot controllers, another body of work, which is the one

adopted in this thesis, is concerned with the use of probabilistic tools to gener-

alize a skill by observing multiple instances of it. We now present an overview

of this approach.

1.1.5 From a simple copy to the generalization of a skill

To reproduce a skill in a new situation, the robot can not simply copy an

observed behaviour; it must have the capability to generalize. This can be

achieved by asking explicit questions to the user, as proposed by Muench et

al. (1994). The robot can also find a control strategy by self-experimentation

and practice. Following this, Hwang, Choi, and Hong (2006) proposed to use

genetic algorithms to learn manipulation skills. Other approaches focused on

reinforcement learning to let the robot discover new skills (Peters, Vijayakumar,

& Schaal, 2003; Yoshikai, Otake, Mizuuchi, Inaba, & Inoue, 2004).

Several researchers in robotics adopted the perspective that learning through

self-experimentation and practice can be used jointly with imitation, where ob-

servations are used to initiate further search of a controller by repeated repro-

duction attempts. In Atkeson and Schaal (1997), a single observation is used

to initiate the further search of a controller solution by repeated reproduction

attempts. The authors demonstrated their approach with a robot manipulator

learning how to swing a pendulum by first observing the user performing the

task and then updating the model through practice. In Bentivegna, Atkeson,

and Cheng (2004), a humanoid robot learns how to play air-hockey and a ded-

icated robot learns how to move a ball in a pan-tilt maze game by following a

similar learning strategy. The sub-goals of a skill are first extracted by obser-

vation; the robot then learns how to achieve these sub-goals by practicing the

task using a reinforcement learning strategy. A body of work also explored the

individual exploration of the search space as a basis to investigate how knowl-

edge is propagated among a population of agents (Jansen & Belpaeme, 2006;

Steels, 2001).

Another trend of research, which is the one adopted in this thesis, took the

perspective that observing multiple demonstrations can help at generalizing a

skill by extracting which are the task requisites. To extract and reproduce only

the essential aspects of a task, different approaches based on statistics have

been proposed which aim at extracting the invariant features across multiple

demonstrations (Ekvall & Kragic, 2006; Pardowitz et al., 2007; Delson & West,

1996; Sato, Genda, Kubotera, Mori, & Harada, 2003; Ogawara, Takamatsu,

Kimura, & Ikeuchi, 2003; Nicolescu & Mataric, 2003; Jansen & Belpaeme, 2006).

In Delson and West (1996), the amount of human variation among multiple

trials is used to define the accuracy requirements along a motion in a Pick &

Place task. The region bounded by the different trajectories is used to define

7

a range of acceptable trajectories, where a lower variation indicates a higher

accuracy requirement.

In Ogawara et al. (2003), simple statistics are used at each time step to

compute the mean and variance of a set of spatial variables observed through

the task. Then, the sequence of means and variances provided respectively a

generalized trajectory and associated constraints. There are three drawbacks to

this approach: (1) the system is memory-based and requires to keep all historical

data, which can lead to a scaling-up problem (see the rapid development of

sensors for humanoid robots exploiting various modalities); (2) as RbD considers

only a few demonstrations of the task, using simple statistics is usually not

sufficient to guarantee the generation of trajectories that are smooth enough

to be replayed by the robot; and (3) the constraints concerning the correlation

across the different variables are not extracted.

1.1.6 From a purely engineering perspective to an

interdisciplinary approach

In the previous sections, we noted the importance of the embodiment and high-

lighted the use of machine learning techniques to tackle the generalization issue

in RbD. With the development of humanoid robot architectures, researchers

focused on the use of intelligent controllers to make these robots move appropri-

ately to their embodiment, i.e., by using human-like movement and appropriate

cognitive capabilities.

Maybe due to this rise of the humanoids, research in RbD progressively

departed from its original purely engineering perspective to adopt an inter-

disciplinary approach, taking insights from neuroscience and social sciences to

emulate the process of imitation in humans and animals. The first insight was

particularly marked with the discovering in primates of specific neural mech-

anisms for visuo-motor imitation (referred to as mirror neurons) (Rizzolatti,

Fogassi, & Gallese, 2001; Rizzolatti, Fadiga, Fogassi, & Gallese, 2002; Decety,

Chaminade, Grezes, & Meltzoff, 2002). These works inspired a large trend of

research aiming at finding biologically plausible solutions to the RbD paradigm

(Billard & Schaal, 2006; Arbib, Billard, Iacobonic, & Oztop, 2000; Demiris

& Hayes, 2002). The second insight was marked by a collection of work ex-

ploring the developmental stages of imitation capabilities in children (Piaget,

1962; Meltzoff & Moore, 1977; Bekkering, Wohlschlaeger, & Gattis, 2000; Nadel,

Guerini, Peze, & Rivet, 1999; Zukow-Goldring, 2004). With the increasing con-

sideration of this body of work in robotics, the notion of Robot Programming by

Demonstration was also progressively replaced by the more biological label of

Learning by Imitation (Billard & Dillmann, 2006).

To understand and describe the imitation paradigm, a direction of research

concentrated on identifying a number of key-issues for ensuring a generic ap-

proach to the transfer of skills across various agents and situations. In this direc-

8

tion, Nehaniv and Dautenhahn (2000) defined a generic framework for imitation,

where the key-issues were formulated into a set of generic questions, namely

what-to-imitate, how-to-imitate, when-to-imitate and who-to-imitate. Following

this approach, Nehaniv and Dautenhahn (2001) suggested to map the states,

effects and actions metrics between a demonstrator and an imitator in a generic

algebraic framework, and showed that the framework can be used to address

the correspondence problem in both natural and artificial systems (Nehaniv &

Dautenhahn, 2002). Demiris and Hayes (2002) proposed a framework high-

lighting the dual role of perception and prediction in imitation, and explored

the use of this framework in robotics. Based on this approach, the framework

was then extended to the use of building blocks consisting of pairs of inverse

and forward models that compete in parallels and in a hierarchical manner to

both plan and execute an action (Demiris & Khadhouri, 2006). The architec-

ture allows to treat both perception and actions in a unified model, and is able

(through the forward model) to provide an estimate of the upcoming outcomes

when observing an action performed by the demonstrator.

1.2 Current state of the art in RbD

The previous section described the different approaches adopted in RbD by

providing principally an historical overview. In this section, we develop some of

the aspects previously discussed and present how this thesis takes place in the

various directions of research in the current state of the art. We review here

the major trends of research by focusing essentially on work that have directly

been applied to robotics.

1.2.1 Gesture interfaces

One trend of research is to focus on the sensory system of the robot, by assum-

ing that for learning new skills, the robot should first be able to perceive and

interpret a wide range of communicative modalities and cues. Among the var-

ious human-robot interfaces, gesture based interfaces are of particular interest

to ensure the transfer of manipulation skill to the robot.

Mouses and keyboards are input devices that are very common in Human

Computer Interaction (HCI) to control a software on a screen interface, that

is, to control a 2 degrees of freedom (DOFs) motion or a sequence of symbolic

commands. However, these input devices are not well designed to control the

high number of DOFs of a robot, particularly when considering a manipulation

robot or a humanoid robot. For transferring skills to these robots, research in

human-robot interfaces focused on the simultaneous control of the high number

of DOFs, by using for example solutions originally developed for virtual reality

or 3D computer graphics applications.

Interfaces for body gesture tracking can be roughly dichotomized into vision

9

based and non-vision based systems. Vision based systems are often assumed to

provide the most natural way of communicating between human and machine

collaborators. However this type of interaction considerably depends on the

progress in the computer vision field (Shon, Storz, & Rao, 2007; D. Lee &

Nakamura, 2007).

Information concerning human body gesture can also be retrieved by var-

ious mechanical, inertia or magnetic sensors attached to different body parts,

recording position and/or orientation information that allows to reconstruct

the evolution of the body pose through time. These sensors use and combine

different modalities to record information, commonly based on gyroscopes, ac-

celerometers or magnetometers (see Hightower and Borriello (2001) and Rolland,

Davis, and Baillot (2001) for surveys on tracking technologies). Compared to

vision based trackers, these motion sensors have the advantage of being robust

to occlusion and to provide position and orientation information at a usually

lower computational cost than for vision. Note that magnetic motion sensors

can nevertheless suffer from the interference of low-frequency current-generating

devices such as a CRT-type display (Shin, Lee, Shin, & Gleicher, 2001). The

main drawback of these non-vision based devices is that the sensors must be

attached to the human body and usually require a calibration phase. Note that

similar requirements also appear when using visual markers, and that the re-

cent development of clothes encapsulating directly the sensors can lower this

constraint (Rossi, Bartalesi, Lorussi, Tognetti, & Zupone, 2006).

The use of mechanically based trackers was also explored by attaching a

rigid articulated structure (with encoders at the joint level) to the body of the

user (Hollerbach & Jacobsen, 1996; Tachi, Maeda, Hirata, & Hoshino, 1994).

A similar strategy can be used to teach a robot new skills by demonstrating

a gesture using an external robotic tracker, that is, not attached directly to

the body of the user. The term kinesthetic teaching will be used throughout

this thesis to describe this process. This can be achieved either by setting the

robot’s motors in a passive mode (Peshkin, Colgate, & Moore, 1996), or by

implementing a gravity compensated controller (Heinzmann & Zelinsky, 1999;

Massie & Salisbury, 1994). In this case, force sensors are often used to perceive

the force applied at the end-effector of the robot’s arm. Haptic interfaces that

convey force and torque feedback are widely used in telerobotics to dynamically

interact with robotic agents operating on remote sites (Yokoi et al., 2006). Sim-

ilarly, arm exoskeletons used primary for re-education can also be used as a way

to control robots by executing gestures (Gupta & O’Malley, 2006).

As the human’s hands allow to produces a wide range of poses due to the high

number of articulations, they can also convey robust information for interaction

(e.g., sign language, pointing). Similarly, hand gestures can convey robust infor-

mation for a human-robot interaction application by using data gloves designed

to record the joint angles trajectories of the hands, based principally on optical

or resistive properties of flexible materials (bend-sensing devices). A survey of

10

glove based input interfaces can be found in (Sturman & Zeltzer, 1994), and

applications in robotics range from the control of mobile robots by recognizing

specific hand gestures (Iba, Weghe, Paredis, & Khosla, 1999) to the use of data

gloves to demonstrate a manipulation skill to a manipulator robot (Dillmann,

2004).

In this thesis, we adopted the perspective that a natural human-robot inter-

action does not necessarily pass through the use of ”natural” systems to record

gestures. We thus do not restrict ourself to the use of vision to track gestures,

and used the ”extra-sensorial” abilities of the robot’s current tracking technol-

ogy to observe a gesture through motion sensors. One important aspect of this

thesis is to build a generic learning system allowing the use of various modal-

ities to record gestures. The solution retained will be to use motion sensors

combining inertial and magnetic measures to record body gestures and to use

the robot’s body as an external system to demonstrate a gesture by setting the

robot’s motors in a passive mode and manipulating the robot’s arms manu-

ally. We also deliberately use commercial products (for the robots and sensory

systems used) to allow other researchers to reproduce the set of experiments

presented throughout the thesis or to build new applications upon the current

framework (which has been achieved through collaborative work).

Transferring a skill to a robot involves a correspondence problem between the

user’s body and the robot’s body. Indeed, transferring a skill does not mean to

simply copy the gesture but requires to determine which are the task constraints

to reproduce the skill with respect to the robot’s particular characteristics. To

deal with this issue, a body of work proposed to consider an intermediate in-

terface between the demonstration of the skill by the user and the reproduction

of the skill by the robot, based on Virtual Reality (VR) or Augmented Reality

(AR) interfaces where the user can provide additional advices for the transfer of

the task (Kheddar, 2001). This intermediate layer of interaction allows the user

to concentrate on the task to achieve by leveraging the need of considering the

specific robot capabilities. This interface can thus act as a much flexible control

space providing additional information and feedbacks before the real execution

on the robot, by considering a bilateral teleoperation where the experience of

the robot in the environment is reflected back to the operator (Yoon et al.,

2006; Takubo et al., 2006). An overview of the synergy between virtual reality

and robotics can be found in (Burdea, 1999; Nguyen et al., 2001). In Robot

Programming by Demonstration (RbD), a solution proposed for the transfer of

a skill thus consists in performing the learning phase in a virtual world and the

reproduction phase on the real robot (Dillmann, 2004; Aleotti et al., 2004). To

do this, the robot is usually modeled in a virtual space, where virtual fixtures

(or virtual guides) are used to simplify and highlight the task requirements for

the human operator (Kheddar, 2001; Otmane, Mallem, Kheddar, & Chavand,

2000). A similar paradigm using Augmented Reality (AR) can be considered

by displaying a real view of the robot space (e.g., by visualizing the robot on a

11

video) where additional information is added to the image, which helps the user

at transferring skills to a robot (Milgram, Rastogi, & Grodski, 1995; Young,

Xin, & Sharlin, 2007). A review for AR techniques can be found in Azuma et

al. (2001). In Cannon and Thomas (1997), AR is used to collaboratively plan

the execution of a robot action in a hazardous environment by allowing multi-

ple users with different fields of expertise to visualize the effects of the motion

virtually before controlling the real robot. In Bejczy, Kim, and Venema (1990),

a graphical representation of the robot superimposed to the video of the real

robot (called a phantom robot) is used to plan the further steps of the task.

1.2.2 Gesture representation

Encoding of human motion

Another perspective adopted in robotics to transfer a skill to a robot is to search

for an optimal way of representing gestures, which usually depends on the ap-

plication. The problem of determining an optimal human motion representa-

tion has been studied in various fields of research, including computer graph-

ics (Brand & Hertzmann, 2000), biomechanics (Wu et al., 2005) and robotics

(Gams & Lenarcic, 2006; Lenarcic & Stanisic, 2003). It plays a crucial role in

understanding the mechanics of the human body and the dynamics of human

motion.

To transfer a skill to a robot, a body of work adopted the perspective that the

robot first needs to have the ability to encode gestures in an efficient way to allow

generalization of the skill. A large body of work followed the approach that gen-

eralization should pass through a symbolic representation of the skill (Muench

et al., 1994; Friedrich et al., 1996; Nicolescu & Mataric, 2003; Pardowitz et al.,

2007; Saunders et al., 2006). Alissandrakis, Nehaniv, and Dautenhahn (2007)

suggested to represent a motion as a set of pre-defined postures, positions or

configurations by considering different levels of granularity for the symbolic rep-

resentation of the motion. This a priori knowledge was then used to explore the

correspondence problem through several simulated setups including motion in

joint space of arm links and displacements of objects on a 2D plane. Pardowitz

et al. (2007) used pre-defined subtasks to encode a skill at a symbolic level con-

sisting in setting up the table by placing sequentially different types of objects

such as plates or cups. The setup was used to explore the organization of the

skill through a hierarchical and incremental approach. Saunders et al. (2006)

used pre-defined behaviours to encode a skill that consisted of moving through

a maze where a wheeled robot must avoid several kinds of obstacles reach a set

of specific subgoals. The symbolic representation of the skill was then used to

explore the scaffolding issue in the teaching process and the organization of the

skill in a hierarchical framework. In Nicolescu and Mataric (2003), a graph-

based approach is used to generalize a skill (at a symbolic level) across multiple

demonstrations performed by a human instructor. In their model, each node in

12

the graph represents a symbolic behaviour and the generalization takes place

at the level of the topological representation of the graph, which is updated

incrementally by computing the common subsequences and alternate paths to

generalize the skill.

The advantage of these symbolic approaches is that high-level tasks (consist-

ing of sequences of symbolic cues) can be learned efficiently through an inter-

active process. However, due to their symbolic nature, the methods rely on the

pre-determination of the observed cues and on the efficiency of the segmentation

process. Our approach uses a similar generalization paradigm but is aimed at

generalizing a skill represented as a continuous stream of sensory data, which

allows the robot to learn more generic motor skills. Other researchers followed

this approach of encoding a skill at a trajectory level (Ude, 1993; Yang, Xu, &

Chen, 1997; Yamane & Nakamura, 2003). Ude (1993) suggested to use spline

smoothing techniques to deal with the uncertainty contained in several demon-

strations of motion performed in joint space or in task space. Yang et al. (1997)

used HMMs to encode the motion of a robot’s gripper either in joint space or

task space by using either position or velocities. They considered an assembly

task where the HMMs were used to encode efficiently the skill, i.e., to provide

the ability to generalize the skill over several examples.

Other researchers focused on the use of forces or torques to represent ef-

ficiently a manipulation skill (Yamane & Nakamura, 2003). Due to hardware

limitation, we will assume throughout this thesis that kinematics information is

sufficient to describe the skill and that forces information can be learned through

a more generic motor learning process.

Other researchers adopted a dynamical approach to deal with perturbations

and to allow on-line imitation. Ito, Noda, Hoshino, and Tani (2006) proposed to

model the imitative interaction between a human user and a humanoid robot by

using a Recurrent Neural Network (RNN) learning the dynamics of the motion

and allowing to switch between different motions through a dynamic interaction.

They illustrated their approach by using dancing motions where the user first

initiated imitation and where the roles of the imitator and demonstrator could

then be dynamically interchanged. They also considered cyclic manipulation

tasks where the robot moved continuously a ball from one hand to the other

and was able to switch dynamically to another cyclic motion that consisted of

lifting and releasing the ball. In this thesis, we take the perspective that a skill

can be modelled in a continuous form by using only kinesthetic information,

that is, by assuming that the constraints can be represented in an open-loop

form at a trajectory level.9

A large body of work adopted the perspective that the mapping between sen-

sory information and motor output (sensory-motor learning) is the most relevant

9Note that Section 6.7.1 will present collaborative experiments exploring the generalityof the approach through joint applications of the proposed framework with other learningtechniques based on dynamics.

13

representation of a skill for RbD (Atkeson, 1990; Ghahramani & Jordan, 1994).

Following this approach, Locally Weighted regression (LWR) appeared soon very

appealing (Atkeson, 1990; Atkeson, Moore, & Schaal, 1997; Moore, 1992), as

LWR combines the simplicity of linear least squares regression and the flexibility

of nonlinear regression. But LWR is a memory-based approach, and computa-

tion also became limited to the memory capacity of the robot when faced with

large training sets. To prevent this limitation, further work mainly concentrated

on moving on from a memory-based approach to a model-based approach, and

moving on from a batch learning process to an incremental learning strategy

(Schaal & Atkeson, 1996, 1998; Vijayakumar, D’souza, & Schaal, 2005). Schaal

and Atkeson (1998) introduced Receptive Field Weighted Regression (RFWR)

as a non-parametric approach to learn incrementally the fitting function with-

out the need of storing the whole training data in memory (i.e., without using

historical data). Vijayakumar et al. (2005) then suggested to improve the ap-

proach to operate efficiently in high dimensional space, and proposed the use of

Locally Weighted Projection Regression (LWPR). As these methods are comple-

mentary to our approach, the similarities will be discussed further in a specific

section after having defined technically the framework (Section 6.5).

Latent space of human motion

Another major issue in motor learning is concerned with the difficulty of learn-

ing motor signals that are highly dimensional. To deal with this problem, a

direction of research focused on trying to represent the high-dimensional mo-

tor space in a subspace of lower dimensionality. To search for a latent space

encapsulating the main characteristics of the motion, several linear and non-

linear methods have been proposed. Among linear methods, Principal Com-

ponent Analysis (PCA) is one of the most common approach used in various

fields of research where the datasets consist of human motion data (Chalodhorn,

Grimes, Maganis, Rao, & Asada, 2006; Urtasun, Glardon, Boulic, Thalmann,

& Fua, 2004). Note that even if this representation is a simple approach to

project human motion data, the process remains fast and generic for a wide

range of signals, and can be used easily as a pre-processing step to reduce the

dimensionality for further analysis. Another body of work focused on local

dimensionality reduction techniques to deal with the nonlinearities of the mo-

tion in latent space (Vijayakumar & Schaal, 2000; Vijayakumar et al., 2005;

Kambhatla, 1996). Direct non-linear methods were also proposed to analyze

human motion (for transferring a skill to a humanoid robot) by using either

Gaussian Process Latent Variable Model (GPLVM) (Shon, Grochow, & Rao,

2005; Grochow, Martin, Hertzmann, & Popovic, 2004; Lawrence, 2005), Isomap

(Jenkins & Mataric, 2004), or Nonlinear Principal Component Neural Networks

(NLPCNN) (MacDorman, Chalodhorn, & Asada, 2004).

14

1.2.3 Gesture recognition

Another approach adopted in robotics to transfer a skill to a robot is to focus on

the ability of the robot to recognize a set of gestures that are further used to con-

trol the robot and/or teach new skills by using the previously learned gestures.

Different methods were explored to recognize gestures within a human-robot

interaction framework. Brethes, Menezes, Lerasle, and Hayet (2004) suggested

to use particle filters to track the position and posture of hands using a vision

system. Ardizzone, Chella, and Pirrone (2000) used Support Vector Machine

(SVM) to recognize simple upper-body postures. Zoellner, Rogalla, Dillmann,

and Zoellner (2002) used SVM to recognize dynamic hand grasps through the

use of a glove based input device. Waldherr, Romero, and Thrun (2000) used a

feed-forward neural network to recognize gestures for the control of a wheeled

robot.

HMMs have also been applied successfully to recognize body motion trajec-

tories. C. Lee and Xu (1996) used HMMs to recognize hand gestures from the

sign language alphabet using a glove based interface within an HRI application.

Nam and Wohn (1996) used HMMs to recognize hand posture and hand motion

using vision. Park et al. (2005) used HMMs to recognize upper-body gestures

used to control a humanoid robot through vision. Fujie, Ejiri, Nakajima, Mat-

susaka, and Kobayashi (2004) and Kapoor and Picard (2001) also suggested to

use HMMs for a conversational robot to recognize para-linguistic information

such as head nods and head shakes, which were then used to clarify if the user’s

knowledge was correctly transferred to the robot.

1.2.4 Gesture reproduction

In the applications mentioned above, the role of HMMs stopped at the recog-

nition part. Some researchers also focused on gesture reproduction using this

probabilistic model. In Computer Graphics, Brand and Hertzmann (2000) sug-

gested to use HMMs to identify common elements in a motion and synthesize

new motions by combining and blending these elements. A body of work sug-

gested to use an averaging approach to retrieve human motion sequences from

HMMs (Inamura, Kojo, & Inaba, 2006; Inamura et al., 2005; D. Lee & Naka-

mura, 2006, 2007). Their framework is called ”Mimesis Model” and allows to

transfer human motion to a humanoid robot and to blend different motions.

By considering an HMM as a final-state automaton where each state is de-

scribed by a density function modelling a set of observed variables, and where

the transitions between the states are described stochastically,10 the reproduc-

tion process consists in: (1) using the transition probabilities of the HMM to

generate stochastically multiple sequences of states; (2) using for each sequence

of states the corresponding density functions to generate multiple trajectories

10The HMM parameters will be fully described in Section 2.3 and we only briefly summarizehere the components for a better comprehension of the different approaches adopted.

15

stochastically; and (3) retrieving a generalized trajectories by averaging over

a large amount of generated sequences.11 The authors also suggested to rep-

resent different models through a distance measure between HMMs based on

Kullback-Leibler divergence (Kullback & Leibler, 1951), which was then used

to combine different HMMs for the reproduction process. The approach was

also extended to the generation of motion by concatenating different HMMs

(Takano, Yamane, & Nakamura, 2007).

Another approach proposed originally for speech recognition and speech syn-

thesis is to augment the observation vectors of the HMM by the differences

of nearby observations, called δ features, where the relationships between the

static and dynamic features are explicit (Furui, 1986). Originally, these fea-

tures are however considered as being independent when modeled by the HMM,

which allowed inconsistency between static and dynamic features. Recently,

Zen, Tokuda, and Kitamura (2007) re-formulated the problem by imposing ex-

plicit relationships between the static and dynamic features and proposed a

model called trajectory-HMM for speech recognition and speech synthesis which

is statistically correct, and can be trained by dedicated methods proposed by the

authors. Even if this approach has only been applied within a speech recognition

and synthesis context, it is still of interest for a RbD paradigm.

Another approach investigated by Mayer et al. (2007) is to use principles

known from fluid dynamics to reproduce a skill at a trajectory level in a RbD

context. Their approach first considered multiple demonstrations of a skill as a

basis for the reproduction and then allowed to generalize the skill to different

contexts (different initial positions or presence of new obstacles) by computing

new trajectories through simulation. The generalization is performed by consid-

ering the trajectory as a fluid which adapts itself to new situations by bypassing

obstacles and by smoothing out discontinuities. The system is experimented

with a knot-tying task in surgery. The main drawback of the approach is that

the generalization and reproduction process is very long (more than 10 minutes),

and, thus, cannot be used for an interactive teaching process in HRI.

The first approach adopted during this PhD followed the recognition and

reproduction approach of the ”Mimesis Model”, where the HMM is used for

both recognition and retrieval processes within the same probabilistic model.

However, the averaging approach was not satisfying concerning the smoothness

of the retrieved trajectories, i.e., these trajectories could not be used directly

as a controller for our robot. The first direction of research adopted was then

to complement the approach by getting rid of the time-consuming stochastic

process used to retrieve trajectories. We thus explored in a first phase different

analytical ways of retrieving generalized trajectories from an HMM, and moved

finally toward the joint use of GMM/GMR which were better suited for the

extraction of the task constraints and for the reproduction of natural human

motion.

11Note that this approach will be discussed further in Section 2.4.2

16

1.2.5 Human-Robot Interaction (HRI)

Another perspective adopted for an efficient transfer of skills is to focus on the

interaction aspect of the transfer process. As this transfer problem is complex

and involves a combination of social mechanisms, several directions in Human-

Robot Interaction (HRI) were explored. Indeed, when considering humanoid

robots, the shape of the robot and the humanlike properties of the teaching

system show interesting impacts on the interaction. MacDorman and Cowley

(2006) showed that a humanoid robot is the best equipped platform for recipro-

cal relationships with a human being and how this helps the teacher consider the

robot as a peer. Kahn, Ishiguro, Friedman, and Kanda (2006) explored which

are the components of the interaction required to efficiently learn new skill

through the teaching interaction. Salter, Dautenhahn, and Boekhorst (2006)

explored the possible interactions through speech, gestures, imitation, obser-

vational learning, instructional scaffolding, as well as physical interaction such

as molding or kinesthetic teaching. B. Lee, Hope, and Witts (2006) explored

the use of behavioral synchronization as a powerful communicative experience

between a human user and a robot, and showed through experiments that in

human dyadic interactions, close relationship partners are more engaged in joint

attention to objects in the environments and to their respective body parts than

strangers. Thomaz, Berlin, and Breazeal (2005) presented HRI experiments in

which the user can guide the robot’s understanding of the objects in its environ-

ment through facial expression, vocal analysis, detection of shared attention and

by using an affective memory system. Rohlfing, Fritsch, Wrede, and Jungmann

(2006) highlighted the importance of having multimodal cues to reduce the com-

plexity of human-robot skill transfer and considered multimodal information as

an essential element to structure the demonstrated tasks.

To speed up learning, a body of work concentrated on the use of priors as part

of the imitation learning process. Indeed, several hints can be used to transfer a

skill not only by demonstrating the task multiple times but also by highlighting

the important components of the skill, which can be achieved by various means

and by using different modalities. A large body of work explored the use of

natural pointing and gazing cues to convey the intention of the user (Scassellati,

1999; Kozima & Yano, 2001; Ishiguro, Ono, Imai, & Kanda, 2003; Nickel &

Stiefelhagen, 2003; Ito & Tani, 2004; Hafner & Kaplan, 2005; Dillmann, 2004;

Breazeal, Buchsbaum, Gray, Gatenby, & Blumberg, 2005). Speech recognition

used in a RbD context was explored in various applications as well (Dominey et

al., 2005; Dillmann, 2004; Breazeal et al., 2005). Riley, Ude, Atkeson, and Cheng

(2006) highlighted the importance of an active participation of the teacher not

only to demonstrate a model of expert behaviour but also to refine the acquired

motion by vocal feedback. More generically, without looking at the linguistic

content of speech, the analysis of vocal prosody was also explored in order to

provide useful information on the user’s communicative intent (Thomaz et al.,

17

Constraints after the first demonstration

BAξ1

ξ 2

Right A Left A Right B Left B0

0.2

0.4

0.6

0.8

1

P

Constraints after the second demonstration

B Aξ1

ξ 2

Right A Left A Right B Left B0

0.2

0.4

0.6

0.8

1

P

Figure 1.1: Extraction of task constraints at a symbolic level. The initial andfinal positions of the ball are respectively represented in dashed and solid line.

2005; Breazeal & Aryananda, 2002).

In this thesis, we adopt the perspective that the transfer of skills can take

advantages of several psychological factors that the user may encounter during

the teaching interaction. Similarly to Yoshikawa, Shinozawa, Ishiguro, Hagita,

and Miyamoto (2006), we also consider the role of the teacher as one of the most

important key component for an efficient transfer of the skill. The teaching in-

teraction is then designed to let the user become an active participant in the

learning process (and not only a model of expert behaviour). By taking inspi-

ration from the human tutelage paradigm, Breazeal et al. (2004) showed that a

socially guided approach could improve both the human-robot interaction and

the machine learning process by taking into account human benevolence. In

their work, they highlighted the role of the teacher in organizing the skill into

manageable steps and maintaining an accurate mental model of the learner’s un-

derstanding. However, the tasks considered were mainly built on discrete events

organized hierarchically. Our work shares similarities with theirs in terms of the

tutelage paradigm, but we focus on learning continuous motion trajectories and

actions on objects at a trajectory level instead of considering discrete events.

Saunders et al. (2006) provided experiments where a wheeled robot is tele-

operated through a screen interface to simulate a moulding process, that is, by

letting the robot experience sensory information when exploring its environment

through the teacher’s support. Their model used a memory-based approach in

which the user provides labels for the different components of the task to teach

hierarchically high-level behaviors. Our work shares similar ideas, but follows

a model-based approach. The drawback of using teleoperation is avoided by

letting the user directly move the robot’s arms while the robot records sensory

information through its motors encoders.

18

Based on high-level description (symbolic constraints)

BA ???ξ1

ξ 2

Based on end-position (static constraints)

BAξ1

ξ 2

Based on trajectory (continuous constraints)

BAξ1

ξ 2

Figure 1.2: Extraction of task constraints by using different levels of representa-tion of the constraints. The initial and final positions of the ball are respectivelyrepresented in dashed and solid line.

19

1.3 Illustration of our RbD approach

We first present an example to motivate the continuous approach adopted

throughout this thesis to extract task constraints within a probabilistic frame-

work.

An illustration of the extraction of constraints through statistics is first pre-

sented in Figures 1.1 and 1.2. Let us imagine that the aim of the task is to

place the ball on the right-hand side of object B (object A is not relevant for

the task). In Figure 1.1, a qualitative description of the ball’s position relative

to the two objects A and B is used to describe the task (description of the task

at a symbolic level). ”Right A” and ”Left A” correspond to a final position of

the ball respectively on the right-hand side and on the left-hand side of object

A. Similarly, ”Right B” and ”Left B” correspond to a final position of the ball

respectively on the right-hand side and on the left-hand side of object B. An

histogram plot shows the extraction of constraints through discrete statistics.

After the first demonstration, the ball is positioned on the right-hand side of the

two objects. The probabilities to observe these two constraints are then equal

(P = 0.5). After the second demonstration, performed in a new initial condi-

tion, the ball is moved to the left-hand side of object A and to the right-hand

side of object B. As the second observation remains the same, the probability

that this qualitative feature is an important constraint becomes higher than for

the other events. After two demonstrations, the principal constraint of the task

detected is thus to place the ball on the right-hand side of object B. This con-

straint is extracted from the multiple observations of the same task performed

in different initial configurations.

Figure 1.2 shows the extraction of the task constraints by using different

levels of representation. When the constraints are based solely on a symbolic

description, the system can fail at reproducing a coherent behaviour because

there is no information about the distance of the ball with respect to object

B in the final position (first row). For example, is we consider that object B

represents a plate and that the ball represents a knife, it would be awkward

to place the knife one meter away from the plate. Even if the knife is on the

right-hand side of the plate, the task would not be qualitatively said to be

reproduced correctly. We see through this example that the simple left-hand

side and right-hand side rules defined to extract the symbolic features of the

task are not appropriate here for a generic reproduction of the skill.

Using this example, a constraint based on the final relative position of the

objects would be more representative (Figure 1.2, second row). By defining

a mean position of the ball at the end of the motion and a covariance ma-

trix defining the variations allowed around this position, we resolve partly the

problem induced by the symbolic description of the task. The mean µ is now

used to describe the final relative position of the ball with respect to object B.

The covariance matrix Σ (represented by an ellipse) describes a permissible area

20

where the object should be placed, defined by the distribution probabilities over

a given threshold (the contour of the ellipse represents one standard deviation).

However, with this static representation of the constraints, there is still no

information on the way the object is moved, which can be relevant for the

task as well. For example, we see in the two demonstrations that the motion

follow a bell-shaped trajectory in order to displace the ball without hitting

the other objects. This particularity of the task can be considered by defining

the constraints at a trajectory level (Figure 1.2, third row). By extending the

representation defined in the second row to each time step in the movement, we

can then define a temporally continuous representation of the task constraints,

defined by a smoothed averaged trajectory and associated covariance matrices,

representing at each time step a mean position of the ball and the variation

allowed around this position.

To do that, a generic methodology is proposed by encoding temporally con-

tinuous gestures in a probabilistic representation, where the data are encoded

in a Gaussian Mixture Model (GMM) or Hidden Markov Model (HMM), which

provides a generic framework to learn autonomously and incrementally a model

of the skill. Having a probabilistic model of a skill allows: (1) to provide a

compact and efficient representation of high-dimensional data; (2) to classify

existing skills and detect new skills; (3) to evaluate a reproduction attempt; (4)

to extract the constraints of the skill; (5) to reproduce the learned skill in a dif-

ferent situation by generalizing over multiple observations. Indeed, we suggest

to extract the constraints of the task in a probabilistic and continuous form by

using Gaussian Mixture Regression (GMR), which is used to define a metric of

imitation performance and to determine an optimal controller for the robot.

The term constraint is used throughout this thesis to define the behaviour

of a set of data collected through multiple demonstrations of a skill. The con-

straints will be considered at a trajectory level by computing a local estimation

of the variations and correlations allowed across the different variables repre-

senting the task. Thus, the constraints (or local behaviour of the trajectories)

are represented by varying covariance matrices evaluated at each time step along

the movement. This process aims at finding a flexible controller to reproduce

the task in different situations, so that the system can retrieve smooth gener-

alized joint angle trajectories that can be combined and run on the robot in

situations that have not been explicitly demonstrated.

1.4 Contributions of the thesis

This thesis focuses on the generic questions addressed by the what-to-imitate

and how-to-imitate issues, which are respectively concerned with the problem

of extracting what are the essential features of a task and determining a way to

reproduce these essential features under different conditions (Nehaniv & Daut-

enhahn, 2000). The perspective adopted is that multiple demonstrations of a

21

similar task can help the robot understand the key aspects of the skill. In

other words, when a skill shares similar components across multiple demon-

strations, it might suggest that these components are essential features that

should be reproduced by the robot. After considering this, we suggest to use

well-established probabilistic methods to analyze a task at a trajectory level,

using Hidden Markov Model (HMM) as a first approach, and moving on to the

joint use of Gaussian Mixture Model (GMM) and Gaussian Mixture Regression

(GMR).

The different models used in this thesis are not new and have already been

employed in different fields of research, but their use in RbD requires to adapt

the algorithms to the constraints inherent to the specificities of the paradigm,

and to test practically whether these models can be used and combined effi-

ciently in a Learning by Imitation scenario. Moreover, the practical use of the

probabilistic models considered here focused essentially on recognition aspect,

leaving out the reproduction capabilities of the models. The use of these mod-

els in Learning by Imitation is thus challenging and has not been extensively

explored.

The contributions of this thesis are threefold: (1) it contributes to RbD by

proposing a generic probabilistic framework to deal with recognition, general-

ization, reproduction and evaluation issues, and more specifically to deal with

the automatic extraction of task constraints and with the determination of a

controller satisfying several constraints simultaneously to reproduce the skill in

a different context; (2) it contributes to Human-Robot Interaction (HRI) by

proposing active teaching methods that puts the human teacher ”in the loop”

of the robot’s learning by using an incremental scaffolding process and by us-

ing different modalities to perform the demonstrations; (3) it finally contributes

to robotics and HRI through various real-world applications of the framework

showing learning and reproduction of communicative gestures and manipulation

skills. The generality of the proposed framework is also demonstrated through

the joint use of the proposed framework with other learning approaches in col-

laborative experiments.

By letting other researchers access to the source codes (see Appendix C)

and by using commercial products for the robots and sensory hardware, this

work also has two additional benefits: (1) the applications presented throughout

the thesis are reproducible by other researchers and new applications can be

built upon them; (2) it demonstrates the generality of the learning framework

by showing that the hardware components can be changed easily when a new

technology (or a new version of a product) is available, without having to apply

major changes to the framework.

22

1.5 Organization of the thesis

This thesis is decomposed into four main parts. Chapter 2 presents the theoret-

ical and technical characteristics of the probabilistic model. It also presents the

extensions and adaptations of the algorithms to be used in a RbD framework.

Chapter 3 presents the experimental setup using two versions of a humanoid

robot, a stereoscopic vision system and motion sensors. Chapter 4 presents

preliminary experiments for testing and optimizing the parameters of the algo-

rithms, with the principal aim to stimulate their further use in the RbD appli-

cations presented in the following chapter. Indeed, Chapter 5 presents the core

of the experiments for the extraction of constraints, combination of constraints

and reproduction of manipulation skills in different situations. Discussions on

the capabilities and limitations of the model are discussed in Chapter 6, where

collaborative work and directions for further work are also presented. Finally,

Chapter 7 concludes the thesis and summarizes the results achieved.

Supplementary information concerning the list of publications of the author

related to this thesis (with descriptions) can be found in Appendix B. The

Matlab source codes of the proposed framework are listed in Appendix C.

23

2 System architecture

2.1 Outline of the chapter

This chapter presents the different models and algorithms used in the proposed

RbD probabilistic framework, where the sections are ordered with respect to the

different issues tackled by these models. We first present the encoding issue,

the extraction of constraints, the reproduction issue and the evaluation of a

reproduction attempt. We then discuss the issues concerning the learning of the

parameters, the pre-processing steps and the regularization of the parameters.

Note that the different algorithms are not presented in their processing order,

but in the order in which their introduction is required for the understanding of

the complete system. When different methods are proposed to tackle a similar

issue, the different solutions are presented by following an historical order, with

the aim of clarifying how we get to these different solutions when tackling the

concerned issue. Note that most of the models and algorithms presented here

are not new, and that the principal challenge relies on using these algorithms in

a RbD framework, which implies to adapt and combine the different methods

to build a probabilistic system allowing to teach new skills to a humanoid robot

and to re-use the learned skill in different situations (with a skill defined at a

trajectory level).

The chapter is organized as follows:

• Sections 2.2 and 2.3 first present the use of Gaussian Mixture Model

(GMM) and Hidden Markov Model (HMM) to encode a set of continu-

ous movements.

• Sections 2.4 and 2.5 discuss the reproduction issue using respectively an

HMM and a GMM representation.

• Section 2.6 presents the process used to reproduce trajectories when con-

sidering multiple constraints.

• Section 2.7 discusses the recognition and classification issues tackled by

GMM and HMM (Sections 2.7.1 and 2.7.2). The evaluation of a reproduc-

tion attempt and the definition of a metric of imitation are also discussed

here.

• Section 2.8 discusses the learning issue (the GMM and HMM parameters

in the previous sections were previously assumed to be known). Batch

24

learning of the GMM and HMM parameters is first presented (Sections

2.8.1 and 2.8.2). Two incremental learning strategies are then proposed

in Section 2.8.3 to learn the GMM parameters without keeping historical

data in memory.

• Section 2.9 introduces the different pre-processing steps by first showing

how to reduce the dimensionality of the data by projecting the trajecto-

ries in a latent space of motion. We take the perspective that a linear

decomposition of the data is sufficient and propose to use either Principal

Component Analysis (PCA) (Section 2.9.1), Canonical Correlation Anal-

ysis (CCA) (Section 2.9.2) or Independent Component Analysis (ICA)

(Section 2.9.3) to find a latent space of motion.

• Section 2.10 discusses the selection of the optimal number of components

to represent the trajectories, and also discusses how to initialize the GMM

parameters. The most common strategy based on Bayesian Information

Criterion (BIC) (Section 2.10.1) is first presented. An alternative solu-

tion based on a curvature-based segmentation process is then introduced

(Section 2.10.2).

• Section 2.11 presents different ways of regularizing the GMM parame-

ters. We first show that by bounding the covariance matrices, the learned

GMM can retrieve smoother trajectories (Section 2.11.1). We then show

the importance of the temporal scaling depending on the GMM initial

parameters (Section 2.11.2).

• Section 2.12 suggests to use Dynamic Time Warping (DTW) as a pre-

processing step for GMM to deal with the problem of temporal distortions

across multiple demonstrations.

• Section 2.13 discusses the use of prior information in the probabilistic

framework proposed.

• Section 2.14 finally discusses the possible extensions of the model to mix-

ture models of different density distributions. An example of binary signals

encoded in Bernoulli Mixture Model (BMM) is presented in Section 2.14.1.

For an easier comprehension of the different issues tackled by the proposed

RbD framework, the different algorithms are illustrated throughout this chapter

by using a simple dataset of mostly one-dimensional trajectories.1 Real-world

applications using human motion data processed by the proposed models will

be presented further in Chapter 5.

1Note that a corresponding set of Matlab codes written by the author are available onlinefor the principal algorithms presented here (see Appendix C).

25

2.2 Encoding through Gaussian Mixture

Model (GMM)

The theoretical analysis of GMMs and the development of related learning al-

gorithms have been largely studied in machine learning (McLachlan & Peel,

2000; Vlassis & Likas, 2002; Dasgupta & Schulman, 2000). In our framework,

GMMs are used to encode the temporal and/or spatial components of contin-

uous trajectories. Let us assume that we have observed n demonstrations of a

skill, consisting of n observations of sensory data changing through time. We

assume that each demonstration can be rescaled to a fixed number of observa-

tions T = 100. The total number of observations if then given by N = nT . We

form a dataset ξ = ξjNj=1 from these N observations ξj ∈ R

D of dimension-

ality D. By using the subscript notation t and s to define temporal or spatial

variables, each datapoint ξj = ξt, ξs consists of a temporal value ξt ∈ R and a

spatial vector ξs ∈ R(D−1) of dimensionality (D − 1). We assume first that the

trajectories can represent any sensory data changing through time (e.g., joint

angle trajectories or hand paths), and that the dataset can be represented in

any frame of reference (e.g., data projected in a latent space of motion). The

dataset ξ is modelled by a mixture of K components, defined by the probability

density function

p(ξj) =

K∑

k=1

p(k) p(ξj |k), (2.1)

where p(k) is a prior and p(ξj |k) is a conditional probability density function.

For a mixture of K Gaussian distributions of dimensionality D, the parameters

in Equation (2.1) are defined by

p(k) = πk,

p(ξj |k) = N (ξj ;µk,Σk) (2.2)

=1

√

(2π)D|Σk|e−

12 ((ξj−µk)>Σ−1

k(ξj−µk)).

The GMM parameters are then described by πk, µk,ΣkKk=1, representing

respectively prior probabilities, mean vectors and covariance matrices.

Throughout this work, the notation N (µ,Σ) will be used to define a Gaussian

distribution of center µ and covariance matrix Σ, and N (ξ;µ,Σ) will be used

to define the probability of a data point ξ with respect to this distribution.

2.2.1 Illustrative example

Assuming that the GMM parameters have already been estimated, Figure 2.1

illustrates the encoding of a dataset through GMM. The processes used to learn

the model parameters will be described later in Section 2.8.1.

26

20 40 60 80 100

−0.05

0

0.05

0.1

Data

ξt

ξ s20 40 60 80 100

−0.05

0

0.05

0.1

GMM

ξt

ξ s

Figure 2.1: Illustration of a dataset of D = 2 dimensions encoded in a GMM ofK = 3 components, where the Gaussian distributions are used to model the jointdensity p(ξt, ξs). The dataset consists of n = 4 demonstrations of trajectories,where each trajectory is rescaled to a set of T = 100 datapoints defined by atemporal value ξt and a spatial value ξs.

2.3 Encoding through Hidden Markov Model

(HMM)

Similarly to GMM, Hidden Markov Model (HMM), in its continuous form, uses

a mixture of multivariate Gaussians to describe the distribution of the data

(Rabiner, 1989). The main difference is that HMM also has the ability to

encapsulate transition probabilities between different sets of Gaussian distribu-

tions, offering a method to describe probabilistically the temporal variations in

the dataset. HMM can thus be seen as a double stochastic process, described

by transition probabilities, output distributions and initial states distribution.

Throughout this thesis, we will only consider the case where each state is de-

scribed by a single multivariate Gaussian distribution µk,Σk. Thus, by draw-

ing a parallel with a GMM of K components, an HMM can be described by

considering each component k of the GMM as an output distribution for the

corresponding state k in the HMM. Usually, only the spatial components are

considered, since HMM deals with the temporal alignment of the spatial com-

ponents using these additional transition parameters. In the previous section,

the Gaussian distributions µ,Σ of the GMM considered were encapsulating

both temporal and spatial values

(

µt

µs

)

,

(

Σtt Σts

Σst Σss

)

.

When using HMM, the corresponding Gaussian distributions describing each

observation output is defined by keeping only the spatial components µs,Σss.

We define $k as the initial state probability of state k, and ai,j as the transition

probability to switch from state i to state j. The complete set of parameters

used to describe an HMM is then defined by $, a, µs,Σss.

HMMs have been used successfully in various fields of research such as speech

recognition (Rabiner, 1989), handwriting recognition (Gilloux, 1994), or vision

27

−0.05 0 0.05

−0.05

0

0.05

ξs,1ξ s,2

Raw data

−0.05 0 0.05

−0.05

0

0.05

12

34

5

ξs,1

ξ s,2

HMM states

HMM topology

Figure 2.2: Illustration of a 2-dimensional HMM of 5 states used to model thedensity p(ξs,1, ξs,2). The dataset consists of three demonstrations of trajectories,where each trajectory is rescaled to a set of T = 100 datapoints defined by twospatial values ξs,1 and ξs,2. First row: Representation of the data in output statedistributions. Second row: Topology of the model and transition probabilitiesbetween the different states.

processing (Starner & Weaver, 1998). As seen in previous chapter, HMMs have

also been applied successfully to recognize body motion trajectories (C. Lee &

Xu, 1996; Nam & Wohn, 1996), or to encode a RbD skill in a compact represen-

tation (Yang et al., 1997). HMM is an appealing model in RbD because of its

capacity to model locally the properties of multidimensional human motion and

because of its probabilistic representation allowing to deal with real-world noisy

data. Indeed, the spatial variability of the data is encapsulated probabilistically

through the Gaussian distributions of the observations, while the temporal vari-

ability is encapsulated through the transition probabilities between the states.

Comparing to GMM, HMM also allows to deal with non-homogenous temporal

deformation of the trajectories (and so act as a method to temporally align

different trajectories).


Figure 2.2 and Table 2.1 present an example of HMM encoding of 3 demon-

strations of a 2-dimensional motion. Note the difference with Figure 2.1, where

both temporal and spatial values ξt, ξs are encoded in the GMM. Here, as

the temporal sequence of data is represented probabilistically, only the spatial

components ξs are usually represented by the Gaussian distributions character-

izing the output distributions of the states in the HMM. The model considered

is fully-connected (each state is connected to each other state), with an initial-

ization of the parameters biased toward a left-right topology (Figure 2.2). This

28

Table 2.1: Transitions probabilities ai,j and initial state probabilities $k for afully-connected HMM scheme encoding continuous data (see also Figure 2.2).

Transition probabilities ai,jH

HH

HH

ij

1 2 3 4 5

1 0.94 0.05 0.00 0.00 0.002 0.00 0.93 0.07 0.00 0.003 0.00 0.00 0.96 0.04 0.004 0.00 0.00 0.00 0.95 0.055 0.00 0.00 0.00 0.00 1.00

Initial state probabilities $k

1 2 3 4 51.00 0.00 0.00 0.00 0.00

way, we remain generic concerning the architecture of the model, but we biased

the solution toward the one expected from an engineering perspective, that is, to

model sequential observations instead of cyclic motions. As the dataset consid-

ered does not show periodic nor recurrent behaviour, the parameters converge

to a left-right topology, i.e., some of the transition parameters converge to zero

(learning of the parameters will be described further in Section 2.8.2). Table

2.1 shows the transition probabilities of the model, with non-zero values for

the recurrent probabilities (probability to stay in the same state) and for the

transition from state i to state i + 1 (transition to the next closest Gaussian

distribution).

2.4 Reproduction and extraction of constraints

through Hidden Markov Model (HMM)

We now consider the problem of retrieving a smooth trajectory from an HMM,

which is a generalized version of the different demonstrated trajectories encapsu-

lated in the model. This is not an easy task due to the double stochastic process

involved by the HMM representation. We start this section by describing how

to retrieve a sequence of states from an observed trajectory, which will be useful

for the further retrieval process. Then, different approaches considered in our

work and other work will be presented and discussed.

2.4.1 Retrieval of the underlying states sequences

corresponding to observed trajectories

We first present how to retrieve the underlying sequence of states corresponding

to an observed trajectory, using the Viterbi algorithm originally proposed in

29

Viterbi (1967).2 By using the notation q1, q2, . . . , qT and ξ1, ξ2, . . . , ξT to

describe a sequence of states and a sequence of observations of length T (e.g., a

3D Cartesian path),3 we define the variable δk,t as the highest probability along

a single sequence of states, at time t, which accounts for the first t observations

and ends in state k, i.e., δk,t = max[

p(q1, q2, . . . , qt = i, ξ1, ξ2, . . . , ξt)]

, which

can be computed by induction. To keep track of the argument which maximizes

δk,t for each time t and state k, we also define the variable εk,t. The optimal

state sequences is defined by using the notation q1, q2, . . . , qT . Knowing the

parameters $, a, µ,Σ of the HMM, defining the initial state probabilities, the

transition probabilities, the centers of the Gaussian distributions and the co-

variance of the Gaussian distributions, δk,t is computed by induction as follows.

Initialization:

δk,1 = $k N (ξ1;µk,Σk) ∀k ∈ 1, . . . ,K,

εk,1 = 0 ∀k ∈ 1, . . . ,K.

Induction:

δk,t = maxi∈1,...,K

(δi,t−1 ai,k) N (ξt;µk,Σk)∀k ∈ 1, . . . ,K

∀t ∈ 2, . . . , T,

εk,t = arg maxi∈1,...,K

(δi,t−1 ai,k)∀k ∈ 1, . . . ,K

∀t ∈ 2, . . . , T.

Termination:

qT = arg maxi∈1,...,K

(δi,T ).

States sequence backtracking:

qt = εqt+1,t+1 ∀t ∈ T − 1, T − 2, . . . , 1.

The underlying sequence of states corresponding to the observed trajectory

ξ is then defined by q.

2.4.2 Reproduction of trajectories using an averaging

approach

As seen in the introduction, a body of work suggested to use an averaging ap-

proach to retrieve human motion sequences from HMM (Inamura et al., 2006,

2005; D. Lee & Nakamura, 2006). The approach first uses the initial states distri-

bution and transition probabilities to generate stochastically multiple sequences

of states. For each sequence of states, the corresponding output distributions

defined by the Gaussian distributions are used to generate multiple trajectories

stochastically. Generalized trajectories are then retrieved by averaging over a

large amount of generated sequences. The main disadvantage of this approach

is that it is computationally expensive and that smoothness of the results can

2See also Rabiner (1989) for its particular use in an HMM framework.3We deliberately omit here the subscript s to define spatial variables for clarity of the

further notation.

30

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

HMM retrieval

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,3

Figure 2.3: Encoding of raw data using HMM and retrieval using an averagingapproach (using 100 generated trajectories). We see that the retrieved trajec-tories lacks smoothness due to the stochastic nature of the process.

not be guaranteed, due to the stochastic nature of the retrieval process. More-

over, to be viable, reproduction usually requires a higher number of states in

the HMM than for recognition purpose. Another side-effect of the averaging

process is that it tends to cut-off and smooth the local minima and maxima of

the signals, which can be essential to reproduce human gestures.

Illustrative example

Figure 2.3 presents an example of reproduction with HMM using an averaging

approach.

2.4.3 Reproduction of trajectories using keypoints and

interpolation

The severe drawbacks induced by the averaging approach presented in the pre-

vious section guided us to explore other approaches, which are presented here.

The first approach that we suggested was to use HMM to encode keypoints

from multiple continuous trajectories, previously extracted in a pre-processing

phase (Calinon & Billard, 2004; Calinon, Guenter, & Billard, 2005). By doing

so, a gesture is considered as a sequence of relevant events in the trajectory,

consisting of inflexion points of the joint angle trajectories. These inflexion

points can be extracted either by considering zero-velocity crossing or a more

elaborated segmentation based on a curvature measure, that will be described

further in Section 2.10.2. This segmentation process is in adequation with sev-

31

eral studies highlighting the presence of typical correlation patterns between

joint angles at distinct phases of a manipulation process (Popovic & Popovic,

2001).

For reproduction, the most likely sequence of states is first retrieved by the

model and a third order spline fit interpolation is used to retrieve a continuous

generalized trajectory. By using piecewise cubic Hermite spline,4 a smooth tra-

jectory can be retrieved that is guaranteed to lie in the robot’s range of motion.

Also referred to as clamped cubic spline, this interpolation function ensured

BIBO stability of the system.5 Thus, under bounded disturbances, the original

signal remained bounded and did not diverge (Sun, 1999), i.e., the trajectory

lies within the range defined by the keypoints. By extracting the keypoints ap-

propriately (in these publications, we considered a basic segmentation of local

extrema), it is possible to retrieve trajectories using only a few HMM states.6

Based on Calinon and Billard (2004), other researchers followed this ap-

proach and improved its performance by suggesting different ways of extract-

ing the keypoints and of interpolating between the keypoints (Asfour, Gyarfas,

Azad, & Dillmann, 2006; Aleotti & Caselli, 2006b, 2005). This approach still

has several drawbacks: (1) it heavily relies on an efficient segmentation of the

trajectories which can be application-dependent and requires fine-tuning of the

segmentation parameters; (2) only the center µ of the distribution is used to

generate the generalized keypoint, which means that the correlation informa-

tion contained in the representation is not used for generating the generalized

trajectory;7 and (3) the extraction of constraints can only be performed at the

level of the keypoints.


Figure 2.4 presents an example of HMM encoding and retrieval of keypoints,

using the same trajectories as in Figure 2.3. For one of the demonstrations, a

keypoint has been detected near the middle of the trajectory (corresponding to

state 3), because a slight perturbation occurred during this demonstration. In

this example, the overall process is however not perturbed by this single outlier,

but it still shows the weakness of an approach based on keypoints extracted in a

4A cubic Hermite spline is a third-degree spline where each polynomial of the spline isdefined in Hermite form, which consists of two control points and two control tangents foreach polynomial. A piecewise cubic Hermite spline guarantees monotonicity in each intervaldefined by two keypoints, sill keeping smooth interpolation properties (Fritsch & Carlson,1980).

5BIBO stands for Bounded-Input Bounded-Output, and is generally used in signal process-ing and control theory as a stability condition, i.e., if a system is stable then the output willbe bounded for every input to the system that is bounded.

6We first suggested to use a third order spline fit for Cartesian trajectories and a cosine fitfor joint angle trajectories. The aim of the cosine fit was to keep the keypoints as local maximaor minima during the reproduction (corresponds to a cycloidal velocity profile). Piecewisecubic Hermite splines has the main advantage of providing a single interpolation function forthe different modalities considered (joint angles and hand paths).

7The covariance information is however used to extract the constraints characterizing eachkeypoint.

32

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

HMM retrieval

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,3

HMM topology

Figure 2.4: Encoding of a sequence of keypoints using an HMM of 5 statesand retrieval through interpolation (in thick black line). The original dataset isrepresented in thin grey line. The last row shows the topology learned by themodel described by transition probabilities.

Table 2.2: Transitions probabilities and initial states distribution parameters ofa fully-connected HMM encoding keypoints (see also Figure 2.4).

Transition probability ai,jH

HH

HH

ij

1 2 3 4 5

1 0.00 1.00 0.00 0.00 0.002 0.00 0.00 0.22 0.78 0.003 0.00 0.00 0.00 1.00 0.004 0.00 0.00 0.00 0.00 1.005 0.00 0.00 0.00 0.00 1.00

Initial states probability $k

1 2 3 4 51.00 0.00 0.00 0.00 0.00

33

pre-processing phase. The transition probabilities and initial state distributions

of the HMM are also reported in Table 2.2. As the keypoints reflect important

changes in the trajectories, modeling of these trajectories in HMM by using

these keypoints and retrieving a generalized version of the trajectories through

regression usually requires less states than encoding of the raw continuous data.

The transition probabilities in Table 2.2 show that the non-zero values for each

state are mostly distributed to a single value (transition to the next keypoint),

except for state 2, because state 3 has not always been detected as a keypoint

for all the demonstrations, i.e., the transitions from state 2 to state 4 are the

ones that are mostly observed (Figure 2.4). The most likely sequence od states

1, 2, 4, 5 is then used for the retrieval process.

2.4.4 Reproduction of trajectories using continuous data

and interpolation

Due to the limitations of the encoding approach through keypoints discussed

in the previous section, we then moved toward a more generic use of HMM by

encoding directly the continuous data (Calinon & Billard, 2007c, 2005; Billard,

Calinon, & Guenter, 2006). To do so, we first assume that the reproduction of a

gesture is triggered by the user, i.e., the gesture is reproduced after recognition

of a similar gesture performed by the demonstrator. Using the observed gesture,

an optimal sequence of states is first retrieved from the recognized model us-

ing the Viterbi algorithm presented in Section 2.4.1. Another solution consists

in generating one or multiple sequences of states stochastically by using the

model, and averaging over the sequences, as proposed in D. Lee and Nakamura

(2006). From an observed trajectory of T time steps, a sequence of states of

size T is thus extracted. This sequence of states is then reduced to a subset

of N states with associated duration computed by the number of steps where

a state transits to itself. From this sequence of N states, a set of N keypoints

is retrieved by considering the center µ of the corresponding Gaussian distribu-

tion. Knowing the duration in-between two keypoints and interpolating between

these keypoints, a generalized version of the trajectory is finally reconstructed

using piecewise cubic Hermite polynomial functions. Figure 2.5 illustrates this

reproduction process.

This process has two principal drawbacks: (1) the covariance information is

not used to retrieve a generalized version of the trajectory, which is a severe lost

of information since the continuous trajectories considered have local direction

information directly encapsulated in the Gaussian distribution;8 and (2) the

information contained in the two extremities of the trajectories is lost during

the reproduction process, because only the center of the Gaussian distribution

is considered.

8The first principal component of the covariance matrix Σ usually provides the main di-rection of the trajectory.

34

Retrieval of the most likely sequence of states

20 40 60 80 100

−0.05

0

0.05

12

3

45

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.051

2

34

5

ξt

ξ s,2

20 40 60 80 100

−0.05

0

0.05

0.1

1 2

3

45

ξt

ξ s,3

Extraction of keypoints and interpolation

20 40 60 80 100

−0.05

0

0.05

12

3

45

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.051

2

34

5

ξt

ξ s,2

20 40 60 80 100

−0.05

0

0.05

0.1

1 2

3

45

ξt

ξ s,3

Retrieval of a continuous generalized trajectory

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

20 40 60 80 100

−0.05

0

0.05

0.1

ξt

ξ s,3

Figure 2.5: Illustration of the reproduction process using an HMM encoding ofthe original continuous data. First row: Dataset used to train the HMM (greysolid line) and retrieval of the most likely sequence of states, where the outputdistributions are represented as boxes depicting means and variance. Secondrow: Extraction of keypoints from the sequence of states (points), and spline fitinterpolation (black solid line). Third row: retrieval of a generalized version ofthe data by resizing in time the interpolated trajectory.

35

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

HMM retrieval

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,3

Figure 2.6: Encoding of continuous data (thin grey line) and retrieval throughinterpolation (thick black line) using an HMM of 5 states.

Illustrative examples

Figures 2.6 and 2.7 present examples of the encoding and reproduction using

the continuous approach, when considering HMMs with respectively 5 and 7

states. The parameters of the first HMM are also reported in Table 2.3. For

comparison of the reproduction process, see also Figures 2.3 and 2.4 using the

same trajectories. We see that the reproduction with 7 states is more accurate

than with 5 states, mainly due to the information lost in the start and the end

of the trajectories, as only the center of the Gaussian distributions are used for

the reproduction.

Table 2.3: Parameters of the HMM encoding raw data (see also Figure 2.6).

Transition probability ai,jH

HH

HH

ij

1 2 3 4 5

1 0.94 0.05 0.00 0.00 0.002 0.00 0.93 0.07 0.00 0.003 0.00 0.00 0.96 0.04 0.004 0.00 0.00 0.00 0.95 0.055 0.00 0.00 0.00 0.00 1.00

Initial state probability $k

1 2 3 4 51.00 0.00 0.00 0.00 0.00

36

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

HMM retrieval

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,3

Figure 2.7: Encoding of the continuous data (thin grey line) and retrievalthrough interpolation (thick black line) using an HMM of 7 states.

Linear combination

x1

x2

µ1,Σ1

µ2,Σ2

µL,ΣL

Product

x1

x2

µ1,Σ1

µ2,Σ2

µP ,ΣPConditional distribution

x1

x2

µ,Σ

µ, Σ

Figure 2.8: Illustrations of the Gaussian density functions properties used in ourmodel. Left: Linear combination N (µL,ΣL) ∼ N (µ1,Σ1)+N (µ2,Σ2). Middle:Product N (µP ,ΣP ) ∼ N (µ1,Σ1) · N (µ2,Σ2). Right: Conditional probabilityp(x2|x1) ∼ N (µ, Σ).

2.5 Reproduction and extraction of constraints

through Gaussian Mixture Regression

(GMR)

The drawbacks induced by HMM to reproduce trajectories conducted us to

the search of a solution that would keep the covariance information contained

in the Gaussian distributions to generate data, with the additional constraint

of generating smooth trajectories that can be run on the robot. Indeed, as a

command is sent to the robot every millisecond, it is really important to find

a solution that can guarantee that the motion is smooth enough and free of

vibrations to avoid the use of important torques on the motors.

Knowing these constraints, we came to the use of regression to retrieve

37

smooth trajectories from a set of observations, and particularly to the use of

Gaussian Mixture Regression (GMR), by previously encoding the temporal and

spatial values of a set of trajectories in a joint representation using GMM, as

presented in Section 2.2.

We first state the principles of the regression model proposed here in a

generic form. From a set of trajectories ξ = ξt, ξs consisting of temporal and

spatial values, we first model the data by a joint probability distribution p(ξt, ξs).

A generalized version of the trajectories can then be computed by estimating

E[p(ξs|ξt)], and the constraints are extracted by estimating cov (p(ξs|ξt)). We

use here the notations E[·] and cov(·) to express respectively expectation and

covariance.

Gaussian Mixture Regression (GMR) are thus used to retrieve smooth gener-

alized trajectories with associated covariance matrices describing the variations

and correlations across the different variables. In a generic regression problem,

one is given a set of predictor variables X ∈ Rp and response variables Y ∈ R

q.

The aim of regression is to estimate the conditional expectation of Y given X

on the basis of a set of observations X,Y . In our case, we use regression to

retrieve smooth trajectories which are generalized over a set of observed trajec-

tories. The advantage of the approach is to provide a fast and analytic solution

to produce smooth and reliable trajectories which can be used to control a robot

efficiently (i.e., there is no need to manually check or smooth the results). Thus,

on the basis of a set of observations ξt, ξs, where ξs represents a position vec-

tor at time step ξt, the aim of the regression is to estimate the conditional

expectation of ξs given ξt. Then, by computing the conditional expectation of

ξs at each time step, the whole expected trajectory is retrieved. Thus, GMR

offers a way of extracting a single generalized trajectory made up from a set

of trajectories used to train the model, where the generalized trajectory is not

part of the dataset but instead encapsulates all of its essential features.

Compared to traditional regression approaches, the regression function is

not approximated directly. Instead, the joint density of the set of trajectories

is first estimated by a model from which the regression function is derived.

Modeling the joint density instead of a regression function has the advantage

of providing a density distribution of the responses instead of a single value. It

thus generates at the same time a mean response estimate ξs and a covariance

response estimate Σss, conditioned on the predictor variables ξt. In our system,

ξs is used as a generalized trajectory, while Σss is used to extract the constraints

on this trajectory. It thus permits to evaluate the covariance information not

only at specific positions, which was the case when using HMM in our previous

approach, but continuously along the movement.

Thus, the modeling phase becomes the important part of the algorithm in

terms of processing, while the regression phase itself is processed very quickly,

which is advantageous because the reproduction of smooth trajectories is fast

enough to be used at any appropriate time by the robot, while the joint den-

38

sity modeling is computed in a phase of interaction that does not particularly

need fast computation. Indeed, the estimation of E[p(ξs|ξt)] is computed by a

weighted sum of linear models.

Although the theoretical considerations of GMR have been studied in the

machine learning literature several years ago (Ghahramani & Jordan, 1994;

Cohn, Ghahramani, & Jordan, 1996; Sung, 2004; Kambhatla, 1996), the theory

has come out with only few applications, which is surprising since GMM and

associated EM learning algorithms are well established in various practical fields

of research. Moreover, the approach is easy to implement, and satisfies robust-

ness and smoothness criterions that are common to different fields of research

including robotics.

Gaussian Mixture Regression (GMR) is based on the theorem of Gaussian

conditioning and on the linear combination properties of Gaussian distributions

(see also Figure 2.8):

Conditional Gaussian densities:

Let x ∼ N (µ,Σ) be defined by

x =

(

x1

x2

)

, µ =

(

µ1

µ2

)

, Σ =

(

Σ11 Σ12

Σ21 Σ22

)

.

The conditional probability p(x2|x1) is then defined by

p(x2|x1) ∼ N (x2; µ, Σ), (2.3)

with

µ = µ1 + Σ12(Σ22)−1(x2 − µ2),

Σ = Σ11 − Σ12(Σ22)−1Σ21.

Linear combination of Gaussian densities:

If x1 ∼ N (µ1,Σ1) and x2 ∼ N (µ1,Σ1), the linear transformation

A1x1 + A2x2 + c, where A1 and A2 define transformation matrices,

and c defines constant offset vector, follows the distribution

A1x1 +A2x2 + c ∼ N (µ,Σ), (2.4)

with

µ = A1µ1 +A2µ2 + c,

Σ = A1Σ1A>

1 +A2Σ2A>

2.

For a GMM encoding the set of trajectories ξ = ξt, ξs, we represent the

39

temporal and spatial values of the Gaussian component k separately as

µk = µt,k, µs,k , Σk =

(

Σtt,k Σts,k

Σst,k Σss,k

)

.

For each component k, the expected distribution of ξs,k given the temporal

value ξt is defined by Equation (2.3), which gives

p(ξs,k|ξt, k) = N (ξs,k; ξs,k, Σss,k), (2.5)

ξs,k = µs,k + Σst,k(Σtt,k)−1(ξt − µt,k),

Σss,k = Σss,k − Σst,k(Σtt,k)−1Σts,k.

Then, the K Gaussian distributions N (ξs,k, Σss,k) are mixed according to

priors βk, i.e., we define

p(ξs|ξt) =K∑

k=1

βk N (ξs; ξs,k, Σss,k), (2.6)

where βk = p(k|ξt) is defined by the probability of the component k to be

responsible for ξt, i.e.,

βk =p(k)p(ξt|k)

∑Ki=1 p(i)p(ξt|i)

=πkN (ξt;µt,k,Σtt,k)

∑Ki=1 πiN (ξt;µt,i,Σtt,i)

.

An estimation of Equation (2.6) can be computed by using the linear com-

bination properties of Gaussian distribution. Using Equation (2.4), an estima-

tion of the conditional expectation of ξs given ξt is thus defined by p(ξs|ξt) ∼

N (ξs, Σss), where the parameters of the Gaussian distribution are defined by

ξs =

K∑

k=1

βk ξs,k , Σss =

K∑

k=1

β2k Σss,k.

Thus, by evaluating ξs, Σss at different time steps ξt, a generalized form

of the motions ξ = ξt, ξs and associated covariance matrices Σss describing

the constraints are computed.

When considering regression methods, a compromise must be considered

between having an accurate estimation of the response and having a smooth

response, known as the bias-variance tradeoff. An appropriate model complexity

must then be defined in order to: (1) have a low bias between the estimate and

the real values (accuracy of the estimation with respect to the observed data);

and (2) have a low variance on the estimate (smoothness of the estimation). In

GMR, the bias-variance tradeoff is directly related to the number of Gaussian

components considered in the GMM.

40

20 40 60 80 100

−0.05

0

0.05

0.1

GMM

ξt

ξ s20 40 60 80 100

−0.05

0

0.05

0.1

GMR

ξt

ξ s

Figure 2.9: Illustration of a Gaussian Mixture Regression (GMR) process usedto compute the conditional density p(ξs|ξt). The joint density p(ξt, ξs) is firstmodeled by a 2-dimensional GMM of 3 components (left). A sequence of tem-poral data ξt is then used as query points to retrieve a sequence of spatialdistributions with means ξs and variances Σss (right). The expected Gaussiandistributions are thus represented along the trajectory by the Gaussian distri-bution N (ξs, Σss).

2.5.1 Illustrative examples

Figure 2.9 illustrates a regression process using the GMM representation. Figure

2.10 illustrates the bias-variance tradeoff when defining the number of GMM

components to encode the data. We see that with 6 Gaussian components, the

model is more accurate but the generalized trajectory is less smooth than when

using 3 Gaussian components.

Figures 2.11 and 2.12 present examples of GMM encoding and reproduc-

tion through regression, with respectively 5 and 7 Gaussian components. For

qualitative comparison, see also Figures 2.3, 2.4, 2.6 and 2.7 using the same

trajectories. Compared to the other approaches, we see that the generalized

trajectories extracted by regression fit qualitatively the data very well, which

is confirmed by a quantitative analysis based on a root mean square error mea-

sure comparing the reproduced signal with the original dataset (Figure 2.13).

We also see in this graph that the GMR approach is much less sensitive to the

number of states than a HMM continuous approach.

2.6 Reproduction of trajectories by considering

multiple constraints

In the last section, we have proposed and discussed several ways of extracting

the constraints from a set of trajectories and reproducing a generalized version

of these trajectories. We now consider the problem of finding an optimal con-

troller for the robot when considering multiple constraints that are considered

as independent and computed separately in different GMMs.

If multiple constraints are considered (e.g., considering actions on different

objects), the joint probability (i.e., the probability of two events in conjunction)

is computed by estimating p(ξs|ξt) = p(ξ(1)s |ξt)·p(ξ

(2)s |ξt) (independence assump-

41

Encoding in GMM using 3 components

20 40 60 80 100

−0.05

0

0.05

0.1

GMM

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

GMR

ξt

ξ s

Encoding in GMM using 6 components

20 40 60 80 100

−0.05

0

0.05

0.1

GMM

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

GMR

ξt

ξ s

Figure 2.10: Illustration of the bias-variance tradeoff when defining the numberof GMM components to encode the data and when using this representation forregression purpose.

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

GMR

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,3

Figure 2.11: Encoding of the continuous data (in thin grey line) and retrievalthrough GMR (in thick black line), using a GMM of 5 states.

42

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,1

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,2

GMR

20 40 60 80 100

−0.05

0

0.05

ξt

ξ s,3

Figure 2.12: Encoding of the continuous data (in thin grey line) and retrievalthrough GMR (in thick black line), using a GMM of 7 states.

0

0.01

0.02

0.03

0.04

RM

S e

rror

HMMaverage

(7 states)

HMMkeypoints(5 states)

HMMcontinuous(5 states)

HMMcontinuous(7 states)

GMR(5 states)

GMR(7 states)

Figure 2.13: Comparison of the reproduction efficiency of the different processesbased on HMM and GMM/GMR, for the models (from left to right) presentedin Section 2.4.2 and Figure 2.3 (HMM average, 7 states), in Section 2.4.3 andFigure 2.4 (HMM keypoints, 5 states), in Section 2.4.4 and Figure 2.6 (HMMcontinuous, 5 states), in Figure 2.7 (HMM continuous, 7 states), in Section 2.5and Figure 2.11 (GMR, 5 states) and in Figure 2.12 (GMR, 7 states). A rootmean square (RMS) error measure is used to compare the methods, by takingthe original data as reference and computing the RMS distance point by pointwith the generalized trajectory. The mean error and standard deviation for thedifferent methods are represented here.

43

tion), and by retrieving a generalized version of the trajectories by evaluating

E[p(ξs|ξt)]. By using standard statistical properties of Gaussian distributions,

namely linear transformation, product and conditional distribution, we will see

in the next section that these probabilities can be defined easily in the case of

a GMM. We will also see in Section 2.6.2 that this result can be interpreted as

an optimization of a metric of imitation (by deriving a cost function measuring

the similarity of demonstrated and reproduced trajectories).

2.6.1 Reproduction of trajectories using a direct

computation

After having extracted a set of M constraints represented the set of Gaussian

distributions ξ(h)s , Σ

(h)ss Mh=1 through the GMR process, there is a need to com-

bine optimally the constraints to reproduce the skill in a new situation. This is

achieved by considering the multiplicative properties of Gaussian distributions:

Product of Gaussian densities:

The product of two Gaussian distributions N (µ1,Σ1) and N (µ2,Σ2)

is defined by

C · N (µ,Σ) = N (µ1,Σ1) · N (µ2,Σ2), (2.7)

with

C = N (µ1;µ2,Σ1 + Σ2),

µ = (Σ−11 + Σ−1

2 )−1(Σ−11 µ1 + Σ−1

2 µ2),

Σ = (Σ−11 + Σ−1

2 )−1.

The reproduced trajectory ξ′ is influenced by the set of generalized trajecto-

ries ξ(h)Mh=1. However, the influence is not constant in time and the constraints

are directly reflected by the covariance matrices Σ(h)ss Mh=1. The final expected

trajectory ξ′ and associated covariance Σ′ss are thus computed by multiplying

the Gaussian distributions at each time step, i.e., by considering the different

constraints as independent and rewriting p(ξ′s|ξt) =∏Mh=1 p(ξ

(h)s |ξt) in terms of

Gaussian distributions, i.e.,

N (ξ′s,j ,Σ′ss,j) ∼

M∏

h=1

N (ξ(h)s,j , Σ

(h)ss,j) ∀j ∈ 1, . . . , T.

Using the Gaussian distribution multiplicative properties in Equation (2.7),

44

ξ′ and Σ′ss are then computed using

ξ′s,j =(

∑Mh=1 (Σ

(h)ss,j)

−1)−1 (∑Mh=1 (Σ

(h)ss,j)

−1ξ(h)s,j

)

Σ′ss,j =

(

∑Mh=1 (Σ

(h)ss,j)

−1)−1 ∀j ∈ 1, . . . , T. (2.8)

2.6.2 Reproduction of trajectories by deriving a metric

of imitation

We now show that the result presented in Equation (2.8) can similarly be com-

puted by deriving a metric of imitation. To measure the similarity between a

candidate position ξs and a desired position ξs, both of dimensionality (D− 1),

a weighted Euclidean distance measure can be defined as

D−1∑

i=1

wi(ξs,i − ξs,i)2 = (ξs − ξs)

> W (ξs − ξs),

where W is a (D − 1) × (D − 1) covariance matrix. By setting W = Σ−1, we

define a weight matrix as the inverse of the covariance matrix observed during

the demonstrations. In other words, if a variable is not consistent across the

demonstrations (large variance), the associated weight is low, which indicates

for the reproduction that this variable is not of high importance for the task.

Let us consider now a task where two constraints must be fulfilled, repre-

sented as two GMR representations ξ(1)s , Σ

(1)ss and ξ

(2)s , Σ

(2)ss . We define a

metric of imitation performance H (i.e., cost function for the task) by

H(ξs, ξ(1)s , Σ(1)

ss , ξ(2)s , Σ(2)

ss ) = (ξs − ξ(1)s )> (Σ(1)ss )−1 (ξs − ξ(1)s )

+ (ξs − ξ(2)s )> (Σ(2)ss )−1 (ξs − ξ(2)s ). (2.9)

To find an optimal trajectory ξs minimizing H, we solve ∂H∂ξs

= 0. We

consider symmetric matrix W> = W (which is the case for inverse covariance

matrices), and use the following second order derivative properties (Petersen &

Pedersen, 2007):

∂∂x

(Ax− s)>W (Ax− s) = −2A>W (Ax− s) .

∂∂xAx = A> .

By deriving the metric H, we thus obtain

−2 (Σ(1)ss )−1 (ξs − ξ(1)s ) − 2 (Σ(2)

ss )−1 (ξs − ξ(2)s ) = 0

⇔ (Σ(1)ss )−1 ξs + (Σ(2)

ss )−1 ξs = (Σ(1)ss )−1 ξ(1)s + (Σ(2)

ss )−1 ξ(2)s

⇔ ξs =(

(Σ(1)ss )−1 + (Σ(2)

ss )−1)−1 (

(Σ(1)ss )−1 ξ(1)s + (Σ(2)

ss )−1 ξ(2)s ))

.

We then see that by extending this result to M different constraints, the

result becomes similar to the result presented in Equation (2.8).

45

2.6.3 Illustrative examples

An illustrative example for the reproduction process when considering multiple

constraints is presented in Figure 2.14, using the direct computation approach

presented in Section 2.6.1. The datasets ξ(1) and ξ(2) describing the two dif-

ferent constraints are represented separately with two GMMs of 3 states (first

and second row). GMR is then used to retrieve the expected distributions repre-

sented respectively by the Gaussian distributions N (ξ(1)s , Σ

(1)ss ) and N (ξ

(2)s , Σ

(2)ss )

(represented in dashed line in the third row). The final expected distribution is

then retrieved by multiplying the distributions of each constraint at each time

step, i.e., by computing p(ξs|ξt) ∼ p(ξ(1)s |ξ

(1)t ) · p(ξ

(2)s |ξ

(2)t ), where p(ξs|ξt) is

represented by the Gaussian distribution N (ξs, Σss). Through regression, we

thus extract a smooth trajectory ξs (represented in solid line) and associated

constraints Σss satisfying the two sets of constraints. Note that this process

guarantees that the generalized trajectory remains within the range defined by

the different constraints, but does not guarantee that this trajectory is permis-

sible (e.g., presence of an obstacle during the reproduction).

A toy problem is also presented in Figure 2.15 to illustrate the use of a

system in a simple 2D screen interface, where the task consists of moving a

mouse cursor to a File icon, and bringing it to a Trashcan icon (we do not

take into consideration the mouse click here). The initial position of the cursor

and the icons on the desktop are generated randomly for each demonstration.

GMMs are then used to encode the actions constraints with respect to the two

icons, i.e., the absolute trajectory is expressed relatively to the initial position of

each icon. The extraction of constraints are presented in Figures 2.16 and 2.17.

The reproductions of the skill in new situations that have not been observed

during the demonstrations are then presented in Figure 2.18.

2.7 Recognition, classification and evaluation

of a reproduction attempt - Toward

defining a metric of imitation

2.7.1 Evaluation in GMM

A model can recognize gestures by estimating the likelihood that the observed

data could have been generated by the model. The log-likelihood of a model Θ

defined by parameters πk, µk,ΣkKk=1 in the latent space, when testing a set of

N datapoints ξ, is defined by

L(ξ,Θ) =

N∑

j=1

log (p(ξj |Θ)) , (2.10)

46

Extraction of the first constraint ξ(1)s , Σ

(1)ss

20 40 60 80 100

−0.05

0

0.05

0.1

ξ(1)t

ξ(1)

s

20 40 60 80 100

−0.05

0

0.05

0.1

ξ(1)t

ξ(1)

s

20 40 60 80 100

−0.05

0

0.05

0.1

ξ(1)t

ξ(1)

s

Extraction of the second constraint ξ(2)s , Σ

(2)ss

20 40 60 80 100

−0.05

0

0.05

0.1

ξ(2)t

ξ(2)

s

20 40 60 80 100

−0.05

0

0.05

0.1

ξ(2)t

ξ(2)

s

20 40 60 80 100

−0.05

0

0.05

0.1

ξ(2)t

ξ(2)

s

Reproduction using the two sets of constraints ξ(1)s , Σ

(1)ss and ξ

(2)s , Σ

(2)ss

10 20 30 40 50 60 70 80 90 100

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

ξt

ξ s

Figure 2.14: Illustrative example of the regression result when considering 2 in-dependent constraints and a direct computation approach. First and second row:Independent extraction of two constraints. Third row: Retrieval of a general-ized trajectory and associated constraints (solid line) by combining informationfrom the two independent constraints (dashed line).

47

Figure 2.15: Illustration of the drag-and-drop task used as toy problem to showthe continuous reproduction of a skill by considering multiple constraints onthe trajectories relative to different objects (here, a File icon and a Trashcanicon). Two demonstrations in different initial situations are depicted here, usedto extract the constraints of the task at a trajectory level.

20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

ξt

ξ(1)

s,1

Data

20 40 60 80 100

−0.05

0

0.05

0.1

ξt

ξ(1)

s,2

20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

ξt

ξ(1)

s,1

GMM

20 40 60 80 100

−0.05

0

0.05

0.1

ξt

ξ(1)

s,2

20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

ξt

ξ(1)

s,1

GMR

20 40 60 80 100

−0.05

0

0.05

0.1

ξt

ξ(1)

s,2

Figure 2.16: Generalization and extraction of constraints for the trajectoriesrelative to the first object (the File icon). Left: The 2D dataset composed of10 demonstrations. Middle: GMM of 3 components used to encode the data.Right: Extraction of constraints through GMR. We see that the trajectory ishighly constrained in the middle of the task (when approaching the File icon).

48

20 40 60 80 100−0.1

0

0.1

ξt

ξ(2)

s,1

Data

20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

ξt

ξ(2)

s,2

20 40 60 80 100−0.1

0

0.1

ξt

ξ(2)

s,1

GMM

20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

ξt

ξ(2)

s,2

20 40 60 80 100−0.1

0

0.1

ξt

ξ(2)

s,1

GMR

20 40 60 80 100

−0.1

−0.05

0

0.05

0.1

ξt

ξ(2)

s,2

Figure 2.17: Generalization and extraction of constraints for the trajectories rel-ative to the second object (the Trashcan icon). Left: The 2D dataset composedof 10 demonstrations. Middle: GMM of 3 components used to encode the data.Right: Extraction of constraints through GMR. We see that the trajectory ishighly constrained at the end of the task, when bringing the File icon to theTrashcan icon.

−0.1 0 0.1

−0.1

−0.05

0

0.05

0.1

ξs,1

ξ s,2

−0.1 0 0.1

−0.1

−0.05

0

0.05

0.1

ξs,1

ξ s,2

−0.1 0 0.1

−0.1

−0.05

0

0.05

0.1

ξs,1

ξ s,2

Figure 2.18: Reproduction of the drag-and-drop task for different initial situ-ations that have not demonstrated. The cursor, File icon and Trashcan iconare respectively represented on the graphs as a circle, a cross and a plus sign.We see that for the different initial configuration, the different constraints ofthe tasks are fulfilled at a trajectory level, namely grabbing the File icon andbringing it to the Trashcan icon.

49

where p(ξj |Θ) is the probability that ξj is generated by the model Θ, computed

using Equations (2.1) and (2.2). Similarly, in the original data space, the log-

likelihood of a model Θ′ described by parameters π′k, µ

′k,Σ

′kKk=1, when testing

a set of N datapoints x, is defined by

L′(x,Θ′) =

N∑

j=1

log (p(xj |Θ′)) . (2.11)

In our experiments, an absolute threshold and a relative threshold (difference

between the first two highest log-likelihoods) are used to determine whether a

trajectory is recognized or not. The aim of the absolute threshold is to select

gestures sharing enough similarities with the model, while the aim of the rela-

tive threshold is to select a gesture belonging to a model only if the gesture is

sufficiently distant from the other models. The log-likelihood of a model Θ can

also be used to evaluate a reproduction attempt that consists of N datapoints

ξ in the latent space (or x in the original data space).

By considering a dataset of input and output variables X,Y , Sung (2004)

suggested the use of E[p(Y |X)] to evaluate the goodness-of-fit of the model for

regression purpose and E[p(X,Y )] to evaluate classification (as described above).

Indeed, the author showed that the best mixture model used for classification

does not necessary coincide with the best mixture model used for regression.

For the specific kind of regression that we consider in our framework (only one

temporal input variable and multivariate spatial output variables), the difference

between considering the GMM-based metric described above and a GMR-based

metric L′′(ξ,Θ) =∑Nj=1 log (p(ξj |ξt,Θ)) is however negligible. We will thus use

Equation (2.10) throughout this thesis to evaluate the log-likelihood of a model.

2.7.2 Evaluation in HMM

The evaluation of an observation or a reproduction attempt ξ = ξ1, ξ2, . . . , ξT

in HMM requires the use of a forward-backward procedure (Rabiner, 1989),

that we describe briefly here. Let us first define qt as one of the K states

underlying the observation at time t. We define the forward variable αk,t as

the probability of partial observation of sequence ξ1, ξ2, . . . , ξt of length t and

of being in state k at time t, i.e., αk,t = p(ξ1, ξ2, . . . , ξt, qt = k). Knowing the

parameters $, a, µ,Σ of the HMM, and assuming that each state of the HMM

is represented by a single Gaussian distribution, αk,t is computed inductively,

50

starting from αk,1, as follows.

Initialization:

αk,1 = $k N (ξ1;µk,Σk) ∀k ∈ 1, . . . ,K.

Induction:

αk,t+1 =(

∑Ki=1 αi,t ai,k

)

N (ξt+1;µk,Σk)∀k ∈ 1, . . . ,K

∀t ∈ 1, . . . , T − 1.

Termination:

p(ξ) =∑Ki=1 αi,T .

(2.12)

The evaluation of a reproduction attempt ξ = ξ1, ξ2, . . . , ξT of length T

in the latent space is thus computed by using the forward variable defined in

Equation (2.12) to compute p(ξ) =∑Ki=1 αi,T . the probability of a model Θ is

usually represented in its log-likelihood form

L(ξ,Θ) = log (p(ξ|Θ)) , (2.13)

as for the GMM representation (see Equation (2.10)).

2.8 Learning of the model parameters

Until now, the HMM and GMM parameters were supposed to be known in

advance. This section is concerned with the automatic estimation of these pa-

rameters. We first describe the GMM and HMM evaluation in a batch mode,

i.e., by using the whole dataset for estimation. We then discuss how to incre-

mentally learn the parameters in a GMM when the different demonstrations are

provided one-by-one, by defining two update mechanisms using only the latest

observation to refine the model.

2.8.1 Batch learning of the GMM parameters through

Expectation-Maximization (EM) algorithm

Training of the GMM parameters is usually performed in a batch mode (i.e.,

using the whole training set) by Expectation-Maximization (EM) algorithm

(Dempster & Rubin, 1977), which is a simple local search technique that guaran-

tees monotone increase of the likelihood during optimization (see Section 2.7.1).

By considering Equations (2.1) and (2.2), we define pk,j as the posterior proba-

bility p(k|ξj), computed using the Bayes theorem p(k|ξj) =p(k)p(ξj |k)

∑

Ki=1 p(i)p(ξj |i)

, and

Ek as the sum of posterior probabilities (used to simplify the notation). The

parameters πk, µk,Σk, Ek of the GMM are then estimated iteratively until

convergence, starting from a rough estimation of the parameters by k-means

51

segmentation (MacQueen, 1967), as follows.

E-step:

p(t+1)k,j =

π(t)k

N (ξj ;µ(t)k,Σ

(t)k

)∑

Ki=1 π

(t)i

N (ξj ;µ(t)i,Σ

(t)i

),

E(t+1)k =

∑Nj=1 p

(t+1)k,j .

M-step:

π(t+1)k =

E(t+1)k

N,

µ(t+1)k =

∑Nj=1 p

(t+1)k,j

ξj

E(t+1)k

,

Σ(t+1)k =

∑Nj=1 p

(t+1)k,j

(ξj−µ(t+1)k

)(ξj−µ(t+1)k

)>

E(t+1)k

.

The iteration stops when the increase of the log-likelihood at each iteration

becomes too small, that is, when L(t+1)

L(t) < C1, with the log-likelihood L defined

in Equation (2.10). The threshold C1 = 0.01 is set in this chapter and in all of

the further experiments.

Other batch training methods for GMM have been proposed as alternatives

to the Maximum Likelihood Estimation (MLE). For example, Sung (2004) pro-

posed an Iterative Pairwise Replacement Algorithm (IPRA) to learn the model’s

parameters. This method iteratively merges similar pairs of components and

estimates the parameters of the new component using a method based on mo-

ments, which is faster than EM update. This is particularly important for his

method since a very large number of components is estimated in his frame-

work. While the parameters found by IPRA are poor estimates in a MLE sense,

the principal advantage of the approach is that it can estimate the number of

Gaussian components K required to fit the dataset without estimating MLE for

every GMM. In empirical studies, the author showed that IPRA tends to pro-

duce smoother results when the mixture model is used for regression. However,

he also showed that MLE training of the model resulted in better classification

performance. When considering classification, Sung (2004) thus suggested to

use the IPRA estimate of the parameters to initialize a MLE estimation. As we

consider both recognition and reproduction simultaneously in our application,

treating these issues in separate models is not optimal and using MLE remains

a coherent option whose principal advantage is to allow the use of simple and

flexible EM update mechanism. To select automatically the number of compo-

nents in a GMM prior to its estimation through EM algorithm, an alternative

approach using curvature information will be presented further in Section 2.10.2.

To produce smooth results when the mixture model is used for regression, we

also propose an alternative based on a more simple regularization scheme that

will be presented further in Section 2.11.1, where the sharper fit induced by

MLE estimation can be easily soften by bounding the covariance matrices.

52

20 40 60 80 100

−0.05

0

0.05

0.1

Random initialization

ξt

ξ s20 40 60 80 100

−0.05

0

0.05

0.1

EM step 1

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

EM step 2

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

EM step 3

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

EM step 4

ξt

ξ s20 40 60 80 100

−0.05

0

0.05

0.1

EM step 5

ξt

ξ s

Figure 2.19: Illustration of the EM iterative learning algorithm. For this 1 DOFtrajectory, the algorithm converges to a local optimum in 5 steps.


Figure 2.19 illustrates the EM process used to learn the GMM parameters. The

initial centers of the Gaussian distributions (with constant diagonal covariance

matrices) are voluntary set to inappropriate random values to show the conver-

gence performance of the algorithm.9

2.8.2 Batch learning of the HMM parameters through

Baum-Welch algorithm

To learn the parameters of the HMM, we use the Baum-Welch algorithm, which

shares similarities with the EM algorithm presented in Section 2.8.1. Even if the

two learning processes (EM for GMM and EM for HMM) are often discussed

separately, it has been shown that they can be treated in the same framework

by using a similar notation (Bilmes, 1998; Xuan, Zhang, & Chai, 2001).

First, we need to define a backward variable βk,t as the probability of partial

observation of sequence ξt, ξt+1, . . . , ξT of length (T−t), given that the current

state is k at time t, i.e., βk,t = p(ξt, ξt+1, . . . , ξT |qt = k). Similarly to the forward

variable defined in Equation (2.12), βk,t is computed inductively starting from

βk,T . Knowing the parameters $, a, µ,Σ of the HMM, β is computed as

9Note that this convergence performance decreases when considering dataset of higherdimensionality.

53

follows.

Initialization:

βk,T = 1 ∀k ∈ 1, . . . ,K.

Induction:

βk,t =∑Ki=1 ak,i N (ξt+1;µk,Σk) βi,t+1

∀k ∈ 1, . . . ,K

∀t ∈ T − 1, T − 2, . . . , 1.

(2.14)

We then define the variable γk,t as the probability of being in state k at time

t given the observation sequence ξ, i.e., γk,t = p(qt = k|ξ). It is computed by

considering the forward-backward variables computed in Equations (2.12) and

(2.14)

γk,t =αk,t βk,t

p(ξ)=

αk,t βk,t∑Ki=1 αi,t βi,t

.

Next, we define the variable ζi,j,t as the probability of being in state i at time

t and in state j at time t+1, given an observation sequence ξ = ξ1, ξ2, . . . , ξT of

length T , i.e., ζi,j,t = p(qt = i, qt+1 = j|ξ). Knowing the parameters $, a, µ,Σ

of the HMM, ζi,j,t is computed by considering the forward-backward procedure

ζi,j,t =αi,t ai,j N (ξt+1;µj ,Σj) βj,t+1

p(ξ)

=αi,t ai,j N (ξt+1;µj ,Σj) βj,t+1

∑Ki=1

∑Kj=1 αi,t ai,j N (ξt+1;µj ,Σj) βj,t+1

,

where the numerator represents p(qt = i, qt+1 = j, ξ).

Summation of ζi,j,t over j is equivalent to the previously defined variable

γk,t, i.e., γk,t =∑Kj=1 ζi,j,t. Summation of ζi,j,t over t, i.e.,

∑T−1t=1 ζi,j,t is the

expected number of transitions from state i to state j. Similarly,∑T−1t=1 γk,t is

the expected number of transitions from state k.

If $(u), a(u), µ(u),Σ(u) are the current estimates of the HMM parameters,

and considering ζ(u) and γ(u) as additional variables computed from these es-

timates, we can re-estimate the initial probability $k of being in state k, the

states transitions probabilities ai,j , and the output distributions for each state

by using

$(u+1)k = γ

(u)k,1 ∀k ∈ 1, . . . ,K,

a(u+1)i,j =

∑T−1t=1 ζ

(u)i,j,t

∑T−1t=1 γ

(u)i,t

∀i ∈ 1, . . . ,K

∀j ∈ 1, . . . ,K,

µ(u+1)k =

∑Tt=1 γ

(u)k,tξt

∑

Tt=1 γ

(u)k,t

∀k ∈ 1, . . . ,K,

Σ(u+1)k =

∑Tt=1 γ

(u)k,t

(ξt−µ(u+1)k

)(ξt−µ(u+1)k

)>

∑

Tt=1 γ

(u)k,t

∀k ∈ 1, . . . ,K.

54

To make a link with the EM estimation used for GMM (Section 2.8.1), the

variable γk,t used for HMM corresponds to the posterior probability pk,j used

for GMM.

2.8.3 Incremental learning of the GMM parameters

We have seen in Section 2.5 that a single demonstration of a skill is usually not

sufficient to extract the task’s particularities and goals. Compared to a batch

learning procedure, the benefits of an incremental learning approach (Pardowitz

et al., 2007; Saunders et al., 2006) or a dynamical learning approach (Ito et al.,

2006; Vijayakumar et al., 2005) is that the interaction can be performed in an

online manner, and allows to observe the results immediately.

As seen in Section 2.8.1, the traditional GMM learning procedure starts from

an initial estimate of the parameters and uses the Expectation-Maximization

(EM) algorithm to converge to an optimal local solution. In its basic version,

this batch learning procedure is not optimal for RbD, where new incoming data

should allow to refine a model of the gesture without keeping the whole training

data in memory.

The problem of incrementally updating a GMM by taking into account only

the new incoming data and the previous estimation of the GMM parameters has

already been proposed for on-line data stream clustering. Song and Wang (2005)

suggested to create a new Gaussian component to encode the new incoming data,

and to create a compound model by merging the similar components of the old

GMM with the new GMM. The main drawback of the suggested approach is that

it is computationally expensive and tends to produce more components than the

standard EM algorithm. Arandjelovic and Cipolla (2006) suggested to use the

temporal coherence properties of data streams to update the GMM parameters.

Their model assume that data are varying smoothly in time, which allows to

adjust the GMM parameters when new data are observed. The authors proposed

a method to update the GMM parameters for each newly observed datapoint

by allowing splitting and merging operations on the Gaussian components when

the current number of Gaussian components was not representing well the new

data.

In RbD, the issue is different in the sense that we do not need to model

on-line streams. Nevertheless, we need to adjust an already existing model of a

stream of datapoint when a new stream of datapoint is observed and recognized

by the model. We then suggest two approaches to update the model’s param-

eters: (1) a direct update method that takes inspiration from Arandjelovic and

Cipolla (2006), re-formulating the problem for a generic observation of multiple

datapoints; and (2) a generative method that uses a batch learning process per-

formed on data generated by Gaussian Mixture Regression. The two methods

are described next.

55

Direct update method

The idea is to adapt the EM algorithm presented in Section 2.8.1 by separating

the parts dedicated to the data already used to train the model and the one

dedicated to the newly available data, with the assumption that the set of

posterior probabilities p(k|ξj)Nj=1 remain the same when the new data ξj

Nj=1

are used to update the model. Note that this is true only if the new data are close

to the model, which is expected in our framework since the novel observation

is first recognized by the model before being used to re-estimate the model,

i.e., the novel observations consists of similar gestures to the ones encoded in

the model. Thus, the model is first created with N datapoints ξj and updated

iteratively during T EM-steps (see Section 2.8.1), until convergence to the set

of parameters π(T )k , µ

(T )k ,Σ

(T )k , E

(T )k Kk=1. When a new gesture is recognized

by the model (see Section 2.7), T EM-steps are again performed to adjust the

model to the new N datapoints ξj , starting from the initial set of parameters

π(0)k , µ

(0)k , Σ

(0)k , E

(0)k Kk=1 = π

(T )k , µ

(T )k ,Σ

(T )k , E

(T )k Kk=1. We assume that the

parameters E(0)k Kk=1 remain constant during the update procedure, i.e., we

assume that the cumulated posterior probability does not change much with

the inclusion of the novel data in the model. Using the notation pk,j = p(k|ξj),

we then rewrite the EM procedure as follows.

E-step:

p(t+1)k,j =

π(t)k

N (ξj ;µ(t)k,Σ

(t)k

)∑

Ki=1 π

(t)i

N (ξj ;µ(t)i,Σ

(t)i

),

E(t+1)k =

∑Nj=1 p

(t+1)k,j .

M-step:

π(t+1)k =

E(0)k

+E(t+1)k

N+N,

µ(t+1)k =

E(0)kµ

(0)k

+∑N

j=1 p(t+1)k,j

ξj

E(0)k

+E(t+1)k

,

Σ(t+1)k =

E(0)k

(

Σ(0)k

+(µ(0)k

−µ(t+1)k

)(µ(0)k

−µ(t+1)k

)>)

E(0)k

+E(t+1)k

+∑N

j=1 p(t+1)k,j

(ξj−µ(t+1)k

)(ξj−µ(t+1)k

)>

E(0)k

+E(t+1)k

.

The iteration stops when L(t+1)

L(t) < C2, where the threshold C2 = 0.01 is set

in this chapter and in the further experiments presented in this thesis.

Generative method

We also propose an alternative to the direct update mechanism presented above,

using a stochastic method to update the parameters. An initial GMM model

πk, µk,ΣkKk=1 is first created using the batch EM algorithm described in Sec-

tion 2.8.1. The corresponding GMR model is also estimated as described in

Section 2.5. When new data are available and recognized by the model, re-

56

gression is performed on the current model to generate stochastically new data

by considering the current GMR model µj , ΣjTj=1. By creating a new dataset

composed of these generated data and the new observed data ξjTj=1, the GMM

parameters are then refined again by the batch EM algorithm. To do so, we

define α ∈ [0; 1] as the learning rate, n = n1 +n2 as the number of samples used

for the learning procedure, where n1 ∈ N and n2 ∈ N are respectively the num-

ber of examples duplicated from the new observation (copies of the observed

trajectories) and number of examples generated stochastically by the current

model. The new training dataset is then defined by

ξi,j = ξj if 1 < i ≤ n1

ξi,j ∼ N (µj , Σj) if n1 < i ≤ n∀j ∈ 1, . . . , T,

with

n1 = [n α].

Here, [·] is the notation for the nearest integer function. The training set of

n trajectories is then used to refine the model by updating the current set of

parameters πk, µk,ΣkKk=1 using the batch EM algorithm presented in Section

2.8.1. α ∈ [0, 1] can be set to a fixed learning rate or depend on the current

number of demonstrations used to train the model. In this case, when a new

demonstration of N datapoints is available (and knowing that N datapoints

from previous demonstrations were used to train the model), α is set to N

N+N.

Identically, when all demonstrations have the same number of datapoints, α can

be computed recursively for each newly available demonstration. Starting from

α(0) = 1, we define the induction as

α(t+1) =α(t)

α(t) + 1. (2.15)

This recursive learning rate will be considered in this chapter and in the

further experiments, with a fixed number of samples n = 5 and number of time

steps T = 100.


Figure 2.20 illustrates the encoding results of GMM when updating incremen-

tally the parameters using respectively the direct update method and the gener-

ative method. We see that the two approaches converges to a quasi-similar local

optimum found by the EM iterative processes.

57


20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 1

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 2

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 3

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 4

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 5

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 6

ξt

ξ sGenerative method

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 1

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 2

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 3

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 4

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 5

ξt

ξ s

20 40 60 80 100

−0.08−0.06−0.04−0.02

00.02

Demonstration 6

ξt

ξ s

Figure 2.20: Illustration of the incremental learning of data using the directupdate method (top) and the generative method (bottom). The graphs showthe consecutive encoding of the data in GMM, after convergence of the EMalgorithms. Both algorithms only use the latest observed trajectory representedin black point to update the models (no historical data is used).

58

2.9 Reduction of dimensionality and latent

space projection

In high dimensions (e.g., in the robot’s joint space), estimating a GMM by

EM becomes difficult due to the sparsity of the data when considering small

dataset. Indeed, as the likelihood function flattens, it becomes difficult to find

a local optimum that provides an efficient model of the data. This curse of

dimensionality is a well-known problem of GMM (Scott, 2004).

To reduce the dimensionality of the data, we can benefit from the fact that

the collected data present redundancies for the majority of the tasks considered.

As the degree and type of redundancies can differ from one task to another, we

are here looking for a latent space onto which we could project the original

dataset to find an optimal representation for the given task. For example,

an optimal latent space for a writing task could be typically represented by a

projection of the 3-dimensional original Cartesian position of the hand onto a

2-dimensional latent space defined by the writing surface. Similarly, a waving

gesture could be represented as a single 1-dimensional cyclic pattern.

In this thesis, we take the perspective that we do not know in advance the

optimal form of the functions to project human motion data, and assume that

a latent space of motion extracted through a linear combination of the data in

the original data space is sufficient for the further analysis of the movement. As

mentioned in the introduction, PCA is a common approach to project human

motion data and has been used in various fields of research (Chalodhorn et al.,

2006; Urtasun et al., 2004). This method can also be used in our framework

to reduce the dimensionality of the dataset for further analysis. We take the

perspective that the whole dataset can be first projected globally into a latent

space of motion, where the non-linearities of the signals are further represented

by a local encoding of the data through GMM.

Linear decomposition of the data can be formulated as a Blind Source Sep-

aration (BSS) problem, where the observed trajectories xs are assumed to be

composed of statistically independent trajectories. The goal is then to estimate

the mixing matrix A to recover the underlying independent trajectories ξs. To

do so, we consider a linear transformation of the dataset xs of N data points

in the original centered data space of dimensionality (d− 1) to a latent space of

motion of dimensionality (D − 1), which produces a new dataset ξs of N data

points. The transformation is defined by

xs − xs = A ξs, (2.16)

where xs ∈ RN×(d−1) is a matrix containing the means of the training set

xs for each dimension, and A is the transformation matrix. This goal can

not be achieved in practice due to the lack of general measure of statistical

independence. However, other related criteria can be used to approximate the

59

decomposition into statistically independent components. Among the multiple

criterions proposed to decompose linearly the data, we suggest to use either:10

• Principal Component Analysis (PCA) finding analytically a mixing matrix

A projecting the dataset xs onto uncorrelated components ξs, with the

criterion of ordering the dimensions with respect to the variability of the

projected data.

• Canonical Correlation Analysis (CCA) finding analytically a mixing ma-

trix A projecting the dataset xs onto a set of uncorrelated components ξs,

with the criterion of having maximum temporal correlation within each

component.

• Independent Component Analysis (ICA) finding iteratively a mixing ma-

trix A projecting the dataset xs onto independent components ξs, with

the criterion of having components that are as non-Gaussian as possible.

These methods require different assumptions on the dataset, an can pro-

duce different results for the projection. These assumptions depends on the

type of signals and domain of activity and are often the critical question to se-

lect one or the other method. PCA assumes that the important information is

contained in the energy of the signals, CCA assumes that the important infor-

mation is contained in the temporal ordering of the observations and that the

observations are composed of signals with different autocorrelation functions,

while ICA assumes that the components follow non-Gaussian distributions. We

briefly describe here these different methods.

2.9.1 Principal Component Analysis (PCA)

PCA determines the directions along which the variability of the data is maximal

(Jolliffe, 1986), by extracting the eigenvectors of the covariance matrix of the

dataset xs, Σxx = E[xsx>

s]. Eigenvectors vxi and associated eigenvalues λi are

calculated by solving for

Σxx vxi = λiv

xi | ∀i ∈ 1, . . . , d.

The mixing matrix is then defined as A = vx1 , vx2 , . . . , v

xK, where K is

the minimal number of eigenvectors used to obtain a satisfying representation

the data, i.e., such that the projection of the data onto the reduced set of

eigenvectors covers at least 98% of the data’s spread,∑Ki=1 λi > 0.98. A is thus

an orthogonal transformation that diagonalizes the covariance matrix Σxx.

2.9.2 Canonical Correlation Analysis (CCA)

Another way of separating a mixed signals xs is to find a linear combination of xs

that correlates most with a linear combination of the temporally delayed version

10For a survey on dimensionality reduction techniques, see, e.g., Fodor (2002).

60

of the signal ys = xs(t+ 1) (Borga, 1998). By defining the linear combinations

ξx = (wx)>xs and ξy = (wy)>ys (ξx, ξy ∈ R), the correlations between ξx and

ξy are given by11

ρ =E[ξxξy]

√

E[ξx(ξx)>] E[ξy(ξy)>]

=(wx)> Σxy w

y

√

((wx)> Σxx wx) ((wy)> Σyy wy),

where Σxx and Σyy are the within-set covariance matrix and Σxy is the between-

sets covariance matrix. Computing the maximum of ρ gives the directions of

maximum data correlation. Setting to zero the partial derivatives of ρ with

respect to wx and wy gives12

(Σ−1xx Σxy Σ−1

yy Σyx) wxi = ρ2

i wxi

(Σ−1yy Σyx Σ−1

xx Σxy) wyi = ρ2

i wyi

∀i ∈ 1, . . . , d.

Hence, wxi and wyi are found by computing the eigenvectors of the matrix

(Σ−1xx Σxy Σ−1

yy Σyx) and (Σ−1yy Σyx Σ−1

xx Σxy). The corresponding eigenvalues

ρ2i are the squared canonical correlation values. The eigenvectors corresponding

to the largest eigenvalue ρ21 are the vectors wx1 and w

y1 that maximize the cor-

relation between ξx and ξy, i.e., that maximize the autocorrelation (at lag one)

of the resulting signal ξs,1 = (wx1 )>xs. wx2 and w

y2 give the second best linear

transformation providing signal ξs,2 = (wx2 )>xs, which is uncorrelated with ξs,1.

Following this, the linear transformation matrix A = wx1 , wx2 , . . . , w

xK is finally

built. A necessary condition for CCA is that the source signals ξs have different

autocorrelation functions.

2.9.3 Independent Component Analysis (ICA)

ICA defines a generative model where the data variables are assumed to be

linear mixtures of unknown latent variables and where the mixing system is

also unknown. The latent variables are assumed to be non-Gaussian and mu-

tually independent, and they are called the independent components of the

observed data. ICA is a recently developed variant of factor analysis, that can

perform Blind Source Separation (BSS), i.e., capable to extract iteratively the

independent components of the data. Thus, ICA finds a decomposition of the

data minimizing a measure of Gaussianity (e.g. kurtosis, negentropy, mutual

information measures), which can be interpreted as producing components ξs

that are as independent as possible. Indeed, in many practical situations, it

11By considering Y = X(t + 1), a maximum autocorrelation approach is used. However,instead of considering only a data point and its next neighbor, CCA can be generalized tomaximize the correlation between a data point and a linear combination of a neighboringregion of this data point. Doing so, CCA not only find an optimal combination of the signalsbut also optimal filters for each signal.

12See Borga (1998) for the detailed computation.

61

has also the effect of reducing the statistical dependence of the data. ICA

assumes that the components follow a non-Gaussian distribution. This assump-

tion can be illustrated by considering the addition of two Gaussian distribution.

Indeed, as seen previously, the addition of two Gaussian distributions is an-

other Gaussian distribution. Therefore, if we combine signals with Gaussian

distributions, we see that it will be difficult to extract the original signals from

the resulting signal. Similarly, we have seen that the linear projection of a

multi-dimensional Gaussian distribution produces another Gaussian distribu-

tion. These two facts suggest that one way of identifying the ”interesting”

features in a multi-dimensional dataset would be to find the projections onto

which the signals are as non-Gaussian as possible. While the aim of PCA is

to find projections onto which the signals present high variance (second order

statistics), ICA finds projections onto which the signals present high kurtosis

(fourth order statistics).

In our experiments, a fixed-point iteration scheme is used to compute an

estimation of ICA, also referred as the FastICA algorithm (Hyvarinen, 1999).

This iteration scheme is one of the most widely used algorithm for performing

independent component analysis because it is computationally efficient and yet

statistically robust.

ICA requires pre-whitening of the data, which is done by pre-processing the

data by PCA, where the dimensionality is also reduced at this preprocessing

stage by discarding the components with lowest variance. The mixing matrix

A is square, and is updated iteratively from a starting point usually selected

randomly. The components ξs found by ICA are presented in a random order.

Depending on the application, re-ordering of the components with respect to

their measure of Gaussianity does not often provide satisfying results. To pre-

vent this, we use in our experiments the mixing matrix A found by PCA as

an initial estimate, with the principal advantage of providing the same starting

point at each run, and the principal disadvantage of initializing the iterative

search with an estimate composed of second-order statistics, which can in some

cases converge to a poor local optimum. However, as PCA already finds a latent

space of motion using second order statistics, it can also speed up ICA conver-

gence because a great part of the job has already been done by PCA. In other

words, ICA can concentrate on higher order moments.


Figure 2.21 illustrates the use of PCA, CCA and ICA to find an optimal latent

space representation of the motion and to reduce the dimensionality of the

dataset. In this example, we see that CCA and ICA could present an advantage

for the further encoding of the data in GMM or HMM. Indeed, the different

segments characterizing the trajectories are better separated by using CCA and

ICA, which could be advantageous here for further processing.

62

−0.0868

0

0.0742

−0.0685

0

0.0925−0.0766

0

0.0843

xs,1

Original data space

xs,2

xs,3

−0.0835 0 0.0835−0.0835

0

0.0835

ξs,1

ξ s,2

PCA

−0.0252 0 0.0252−0.0252

0

0.0252

ξs,1

ξ s,2

CCA

−1.8817 0 1.8817−1.8817

0

1.8817

ξs,1

ξ s,2

ICA

Figure 2.21: Top: Illustration of the 3-dimensional dataset in original dataspace. Bottom: Projection of the data in a 2-dimensional latent space by usingeither PCA, CCA and ICA.

2.9.5 Discussion on the different projection techniques

PCA, CCA and ICA share the common constrain of finding a linear transforma-

tion A that produces mutually uncorrelated components ξs. Uncorrelatedness is

a reasonable constraint since the components ξs are assumed to be independent,

but it is still a weaker assumption than statistical independence. Uncorrelated-

ness can be achieved by different linear transformations, and therefore requires

another criterion to find a solution. While PCA maximizes variance (the energy

of the signal), ICA minimizes Gaussianity or other related measure, and CCA

maximizes autocorrelation of the resulting components.

The disadvantage of PCA is that it strongly depends on the range of values

(or magnitude) of the different components. This is a severe drawback when

considering a robot endowed with multiple sensors collecting various information

and where the variance of each sensory component is unrelated to the impor-

tance of the received information. For example, if one would collect Cartesian

and joint angle data and perform PCA on this joint dataset, the directions

of the principal components could vary depending on the joint angle measure

adopted (e.g., degrees or radians). Even by using a single modality (e.g., joint

angles), since each joint angle is characterized by its own range of values, PCA

can lose important information concerning the joints with small ranges of mo-

tion. For example, the shoulder humeral rotation (axial rotation) usually shows

63

low variability but is really important for manipulation skills (e.g., to pass an

object from one hand to the other). A solution to this issue would be to nor-

malize the different joint angles with respect to their magnitude. However, this

measure can be task-dependent or difficult to define properly. With PCA, the

first component gives information about the direction of the following compo-

nent (perpendicular to the first direction), which is not the case for ICA where

the components are truly independent, i.e., no information about the second

component is given by the first component.

Correlation measures the relations across the components that are invariant

to their magnitude. While both PCA and ICA are concerned with the shapes

of the distribution of the data (variance and non-Gaussianity), CCA focuses

on the temporal characteristic of the signals. Indeed, an important property of

canonical correlations is that they are invariant to affine transformation, and,

thus, invariant to scalings of the signals. In both PCA and ICA, the temporal

correlations are not taken into account, i.e., the solutions are unchanged if the

data are re-ordered arbitrary. Thus, depending on the data, ICA makes the BSS

problem unnecessary difficult. By ignoring the temporal relations within the

signals, relevant information is discarded. This is a drawback when analyzing

human motion data, since these signals have important temporal information

causing autocorrelation in the signals. CCA could then solve the BSS problem

with less computational effort by utilizing this autocorrelation information.

However, in spite of these theoretical considerations, we already note here

that the further tests performed with real-world data (presented further in Sec-

tion 4.3) will reveal that CCA and ICA have in fact very few advantage over

PCA for the datasets considered, at least when the data are further processed

by GMM or HMM.

2.10 Number of components in GMM/HMM

and initialization

Until now, we have considered that the optimal number of Gaussian components

in each GMM (or HMM) used in the previous sections was known in advance.

In this section, we discuss how this number of components can be evaluated.

we also discuss the subsequent problem of initializing the Gaussian components

so that the iterative EM update algorithm (Section 2.8.1) can converge to an

appropriate local optimum (in a maximum likelihood sense). We first presents

one of the most usual criterion used for model selection, Bayesian Information

Criterion (BIC), and then suggest an alternative approach using the particular

continuous properties of human gestures, based on curvature information.

64

Figure 2.22: Selection of the optimal number of components for HMM. Theselection is performed in a static phase (estimation of several Gaussian MixtureModels). The dynamic analysis is then provided by a single estimation of HiddenMarkov Model, using the number of states found by the Bayesian InformationCriterion.

2.10.1 Model selection based on Bayesian Information

Criterion (BIC)

A drawback of EM is that the optimal number of components K in a model

may not be known beforehand, and a tradeoff must be determined between

optimizing the model’s likelihood (a measure of how well the model fits the data)

and minimizing the number parameters (i.e., the number of states used to encode

the data). One common method to do that consists of estimating multiple

models with increasing number of components, and selecting an optimum based

on some model selection criterion (Vlassis & Likas, 2002). Cross validation is a

common method to do that, but it has the disadvantage to require additional

demonstrations to form a test set, which is inconvenient in RbD. To avoid this

process, several criterions based on information theory have been proposed;

among them, Bayesian Information Criterion (BIC) (Schwarz, 1978) is one of

the most commonly used criterion.13 The BIC score is defined as

SBIC = −L +np

2log(N),

where L is the log-likelihood of the model (Section 2.7.1) using the demonstra-

tions as testing set, np is the number of free parameters required for a GMM

of K components, i.e., np = (K − 1) + K(

D + 12D(D + 1)

)

for a GMM with

full covariance matrix, and N is the number of D-dimensional datapoints. The

first term of the equation measures how well the model fits the data, while the

second term is a penalty factor which aims at minimizing the number of param-

eters. Thus, multiple GMMs are estimated and the BIC score is used to select

the optimal number of GMM components K.

When using HMM, model selection can be performed in the GMM initializa-

tion phase. Multiple GMMs are thus estimated, initialized by a rough k-means

13This criterion is similar to the Minimum Description Length (MDL) criterion.

65

estimate. The best model is then selected and a single HMM estimation is

performed (Figure 2.22).

2.10.2 Model selection and initialization based on

trajectory curvature segmentation

The main drawback of BIC is that multiple models must be estimated. More-

over, initialization by k-means is sometimes not optimal when considering tem-

porally continuous trajectories, because the centers of the clusters are usually

defined initially at random. We first suggest to initialize these centers by consid-

ering prior knowledge on the dataset. Indeed, knowing that we are considering

trajectories that are bounded in time and equally distributed in time, we can

initialize the centers by also distributing the initial k-means centers equally in

time.

Sung (2004) showed that the optimal number of parameters K for regression

is usually lower than for density estimation, which is illustrated by the observa-

tion that regression using different values of K can result in the same regression

function. In this section, we tackle the model selection paradigm in a similar

spirit by using the specific particularities of the data considered in our RbD

framework, namely temporally continuous trajectories, and analyzing the cur-

vature of these trajectories to provide information on the number of components

required for the further encoding in GMM.

As noted by Ghahramani and Jordan (1994), when encoding trajectories,

the mixture of Gaussian distributions competitively partitions the input space

and learns a linear regression surface in each partition. Each Gaussian compo-

nent encodes a portion of the trajectory that is quasi-linear in a hyperspace,

i.e., the first eigenvalue of the covariance matrix is very large compared to the

remaining values (usually more than an order of magnitude). The associated

eigenvector is mainly directed toward the temporal axis (this point will also

be further discussed in Section 2.11.2). Then, the portions of trajectories in-

between two Gaussian distributions present generally sharper turns.14 When

considering temporally continuous trajectories, we thus developed an alterna-

tive to the initialization of GMM by k-means, when considering continuous

trajectories, by using an initialization process based on the segmentation of the

signals with generalized curvature information. We then take advantages of the

temporal information encapsulated in the representation, i.e., by using the se-

quential ordering of the points and the continuous characteristics of the motion

data.

Let us consider a (D−1)-dimensional curve ξ(t). We define a Frenet frame for

the curve, which is a moving frame used to describe naturally local properties

of a CD-curve in terms of a local reference system. It is defined by the set

14Note that this statement is true for most of the human movements tested, but that itcan not generalized to some types of regular movements such as circles or helixes, which area particular subset of signals in 2D and 3D signals with constant generalized curvatures.

66

of orthonormal vectors e1(t), e2(t), . . . , eD−1(t), constructed using the Gram-

Schmidt orthogonalization process. This process starts with one basis vector

e1(t), defined by the first derivative of ξ(t)

q1(t) =∂ξ(t)

∂t,

e1(t) =q1(t)

||q1(t)||. (2.17)

The next basis vectors are computed by orthogonally projecting the j-th

derivative of ξ(t) using the previous basis vectors, in a recursive manner. Us-

ing Equation (2.17), the rest of the vectors ej(t)(D−1)j=2 are then recursively

computed with

qj(t) =∂jξ(t)

∂tj−

j−1∑

i=1

ei(t)>

(

∂jξ(t)

∂tj

)

ei(t),

ej(t) =qj(t)

||qj(t)||. (2.18)

Using the parameters defined in Equations (2.17) and (2.18), the generalized

curvatures χj(t)(D−2)j=1 are defined by

χj(t) =ej+1(t)

>

(

∂ej(t)∂t

)

||∂ξ(t)∂t

||. (2.19)

For a (D− 1)-dimensional curve ξ(t), this process involves the computation

of the (D− 1)-th derivatives of ξ(t), and the first derivative of ej(t)(D−2)j=1 . To

do so, we locally approximate the curve by a (D − 1)-dimensional polynomial

function (smoothing process), using a Gaussian window of variance T100 . The

derivatives can then be computed analytically, and are re-sampled to provide

trajectories of T datapoints.15

For the special case of 3-dimensional paths, e1(t), e2(t), e3(t) define respec-

tively the tangent, normal and binormal vectors, and e3(t) can be computed

using e3(t) = e1(t)× e2(t). The first generalized curvature χ1(t) = κ(t) is called

curvature and measures the deviance of ξ(t) from being a straight line rela-

tive to the osculating plane. The second generalized curvature χ2(t) = τ(t) is

called torsion and measures the deviance of ξ(t) from being a plane curve, i.e.,

if the torsion is null, the curve lies completely in the osculating plane. Then,

curvature is a measure of the rotation of the Frenet frame about the binor-

mal vector e3(t), whereas torsion is a measure of the rotation of the Frenet

frame about the tangent vector e1(t). The Darboux vector ω(t) provides a con-

cise way of interpreting curvature and torsion geometrically, and is defined by

ω(t) = τ(t)e1(t) + κ(t)e3(t) with norm ||ω(t)|| =√

τ(t)2 + κ(t)2 (Kehtarnavaz

15Note that as ξ(t) is represented as a finite sampling of only 100 points, approximation ofthese derivatives by numerical computation does not provide satisfying results.

67

& deFigueiredo, 1988). Similarly, using Equation (2.19), we define the norm

Ω(t) for a generic (D − 1)-dimensional curve as

Ω(t) =

√

√

√

√

D−2∑

i=1

χi(t)2.

This value captures the multiple features of the curve ξ(t) in terms of gener-

alized curvatures (Kehtarnavaz & deFigueiredo, 1988). Thus, the local maxima

of Ω(t) provide points to segment the trajectory,16 where the data between two

segmentation points represent parts of the trajectory whose directions do not

much vary. The initial means and covariance µk,ΣkKk=1 are then computed

by considering the different portions of the signals separated by two consecutive

segmentation points.

In a batch learning mode, when considering multiple demonstrations, the

number of segmentation points can vary, as well as their positions. In this case,

k-means clustering can be used to generalize over the different examples, where

the number of clusters is defined by the most common number of segmentation

points extracted for the different trajectories. A multivariate Gaussian is then

fit to each portion of the trajectories between two centers of the clusters, and

the number of Gaussian components used in the GMM is then directly defined

by the number of clusters minus one.


Figure 2.23 presents an example of the segmentation process based on the total

curvature information.

2.11 Regularization of the GMM parameters

In this section, we introduce two simple but very helpful regularization schemes

that we will consider throughout the experiments when estimating the GMM

parameters.

2.11.1 Bounding the covariance matrix when learning

the GMM parameters through EM

In a GMM, when the variables are locally highly dependent, the covariance ma-

trix encoding this portion of the signal is close to a singular matrix, resulting in

unstable parameter estimates. The associated drawbacks are twofold: (1) the

trajectory retrieved by regression can present sharp edges and bumpy transi-

tions between two neighboring Gaussian components; (2) the incremental EM

16We consider the beginning and the end of the trajectory as segmentation points.

68

20 40 60 80 100

−0.1

0

0.1

ξ s,1

20 40 60 80 100

−0.1

0

0.1

ξ s,2

20 40 60 80 100−0.1

0

0.1

0.2

ξ s,3

20 40 60 80 100

5

10

15

ξt

Ω

−0.1−0.05

00.05

0.10.15

−0.1

0

0.1−0.1

0

0.1

0.2

ξs,2ξs,1

ξ s,3

Figure 2.23: Illustration of the curvature-based segmentation process used toinitialize the GMM parameters. Left column: The first three graphs representthe 3D data to segment and the last graphs shows the Ω(t) trajectory used tosegment the data by extracting the local maxima (represented by points). Rightcolumn: The segmented data in 3D Cartesian space, where each segmentationpoints defined a sharp turn in the motion path.

69

Without bounding covariance matrices

20 40 60 80 100

−0.05

0

0.05

Data

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMM

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMR

ξt

ξ s

By bounding covariance matrices

20 40 60 80 100

−0.05

0

0.05

Data

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMM

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMR

ξt

ξ s

Figure 2.24: Illustration of the regression problems that might be encounteredwhen one or multiple covariance matrices in the model is close to singular. Firstrow: Training with classic EM algorithm. Second row: Training with EM algo-rithm by bounding the covariance matrices at each iteration. The first columnpresents the observation data, the second column presents the GMM, and thethird column presents the generalized trajectory retrieved by GMR. We see inthe first row that the transitions between neighboring Gaussian componentsoverfits the corners around time steps 50 and 80, and that this effect is attenu-ated in the second row.

70

Encoding with small temporal variance

0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

Initialization with k−means

ξt

ξ s

0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

After EM convergence

ξt

ξ s

0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

GMR

ξt

ξ s

Encoding with large temporal variance

20 40 60 80 100

−0.05

0

0.05

0.1

Initialization with k−means

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1


ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

GMR

ξt

ξ sFigure 2.25: Illustration of the multimodal distributions issue when estimatingthe GMM parameters with EM, using k-means with random centers at initial-ization. Here, the number of Gaussian components have been overestimateddeliberately to show the influence of the temporal variance on the nature of thedistributions retrieved by regression. Indeed, this effect is soften when only afew number of Gaussian components is considered.

estimation can fail at updating the GMM correctly when new data are avail-

able (the covariance matrices close to singularities are stuck to their current

estimate). To ensure numerical stability of the covariance matrix, the simple

regularization scheme that we suggest here is to add a constant diagonal matrix

at each iteration of the EM algorithm. It means that in the update mechanism

described in Section 2.8.1, we replace Σ with Σ+δI, where I is the identity ma-

trix and δ is a scalar that is defined proportionally to the variance of the data.

Figure 2.24 illustrates the advantage for regression of bounding the covariance

matrices.

2.11.2 Single mode restriction when retrieving data

through GMR

Prior to encoding the motion in GMM, we also suggest to adequately stretch the

temporal values so that the temporal variables have a larger range of values than

for the spatial variables. This can be set by simply considering different units for

the representation of data. For example, when considering joint angles, we define

t ∈ 1, 2, . . . , T with T = 100 as the temporal range, while the joint angles are

71

Encoding with small temporal variance

0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

Initialization using regular time interval

ξt

ξ s

0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1


ξt

ξ s

0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

GMR

ξt

ξ sEncoding with large temporal variance

20 40 60 80 100

−0.05

0

0.05

0.1

Initialization using regular time interval

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1


ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

GMR

ξt

ξ s

Figure 2.26: Illustration of the importance of the GMM parameters initializationto avoid the problem of the multi-peaked distributions when performing GMR.Here, the GMM parameters are initialized using regular temporal intervals toavoid this effect, and we see that the estimation of the parameters is robustto the different ranges of temporal values. For a comparison, see Figure 2.25,where the GMMs were initialized by k-means with random centers.

72

expressed in radians, i.e., with a maximum range of values θ ∈ [−π, π].17 The

aim is to get rid of the multi-modal distributions that could appear during the

GMR process. Having multi-modal distributions is not a modeling error per se,

but the multimodal distributions are not tackled appropriately by GMR (Figure

2.25). Indeed, at each time step, a single Gaussian distribution is estimated to

model the multimodal distribution. As a single covariance estimation is used to

estimate the distribution of data, it becomes difficult to extract efficiently the

task constraints.

Figure 2.25 shows that initialization of the GMM parameters with k-means

(using random initialization of the cluster centers) is influenced by the range

of values used to describe the temporal and spatial components. In this figure,

an example is presented where the system fails at encoding efficiently the data

if the temporal range of values is not appropriate. In the first row, where the

range of values for the temporal component t is smaller than for the spatial com-

ponent, we see that EM converged to a solution which is correct in a likelihood

sense, but presents poor regression properties. Indeed, when performing GMR,

multiple Gaussian distributions are describing the data with similar weights and

the expected distributions at these time steps present multimodal distributions

(multiple peaks), which are estimated by a single distribution. The results of

GMR thus show a correct generalized trajectory, but the associated covariance

matrices fails at capturing relevant information on the task constraints. In the

second row, as the range of values for the temporal component ξt is an order of

magnitude larger than for the spatial component ξs, the expected distribution

when performing GMR at each time step is close to a unimodal distribution,

i.e., the expected distribution presents a single peak.

Another solution to avoid this issue is to slightly change the k-means algo-

rithms by distributing the initial cluster centers equally in time, as suggested in

Section 2.10.2. Similarly, we can initialize the GMM parameters by segmenting

the trajectories with respect to the temporal values of each datapoint, i.e., by

considering each subset of data satisfying (i−1)TK

< ξt ≤iTK

, and use this subset

to compute analytically µi and Σi for each Gaussian distribution i ∈ 1, . . . ,K.

Figure 2.26 shows that the multi-peaked distributions resulting from the

EM learning process disappear when initializing the GMM parameters by reg-

ular temporal intervals. Thus, the trajectories resulting from GMR become

insensitive to the temporal scaling factor.

17When the data are previously projected in a latent space of motion through PCA, therange of values can be in some cases approximately an order of magnitude larger than theinitial range, but is still lower than the temporal range of values.

73

2.12 Temporal alignment of the demonstrated

trajectories through Dynamic Time

Warping (DTW)

We have seen in Sections 2.3 and 2.4 that Hidden Markov Models (HMMs) could

be used to encapsulate the temporal variations of the trajectories, previously

encoded in Gaussian Mixture Models (GMMs). In Section 2.5, we suggested

the use of GMM/GMR as a most appropriate approach to extract the con-

straints in a continuous form, by encoding the trajectories directly in a mixture

of Gaussians and considering the temporal component as an additional dimen-

sion. This way, GMM treats the temporal and spatial components indifferently,

and it becomes possible to retrieve analytically a smooth trajectory through re-

gression. Even if the advantage of GMM/GMR for reproduction and extraction

of constraints is clear, there remains a disadvantage concerning the robustness

of the model to the temporal variability across the demonstrated trajectories.

Indeed, the use of GMR is mainly relevant to extract variability and correlation

information concerning the spatial values, but ideally, it could be useful in some

cases to align the different demonstrations automatically before further process-

ing. Using HMM, this was carried out automatically by modeling the temporal

variability through transition probabilities.

To provide a similar option when considering a GMM/GMR approach, we

thus suggest to pre-process the trajectories presenting strong non-homogenous

temporal distortion by aligning these trajectories temporally using a pattern-

based approach. When required, Dynamic Time Warping (DTW) is thus used

as a template matching pre-processing step to temporally align the trajectories

(Chiu, Chao, Wu, & Yang, 2004). DTW is sometimes considered as a weaker

method than HMM, but it has the advantage of being simple and robust, and

can be used conjointly with GMR. DTW finds a non-linear alignment of the

demonstrated trajectories with respect to a reference trajectory. A distance

table is first built and a DTW path is searched by a dynamic programming

approach through the table, with slope limits to prevent degenerate warps.

Here, to improve the computational efficiency of the process, the alignment is

performed in the latent space of motion, which is usually of lower dimensionality

than the original data space.

Let us consider two trajectories ξAs and ξBs of length T . We define the dis-

tance between two datapoints of temporal index k1 and k2 by h(k1, k2) = ||ξAs,k1−

ξBs,k2 ||. A warping path S = slLl=1 is defined by L elements sl = k1, k2. The

warping path is subject to several constraints. Boundary conditions are given

by s1 = 1, 1 and sL = T, T. If sl = a, b and sl−1 = a′, b′, monotonicity

is given by a ≥ a′ and b ≥ b′, while continuity is defined by a − a′ ≤ 1 and

b − b′ ≤ 1. Dynamic programming is used to minimize∑Ll=1 sl, by defining

the cumulative distance γ(k1, k2) as the distance h(k1, k2) found in the current

74

20 40 60 80 100−0.1

−0.05

0

0.05

GMM without DTW

ξt

ξ s20 40 60 80 100

−0.1

−0.05

0

0.05

GMM with DTW

ξt

ξ s

20 40 60 80 100−0.1

−0.05

0

0.05

GMR without DTW

ξt

ξ s

20 40 60 80 100−0.1

−0.05

0

0.05

GMR with DTW

ξt

ξ sFigure 2.27: Illustration of the GMM parameters estimation (top) and GMRretrieval process (bottom) when the different trajectories are not aligned tem-porally (left column) and are pre-processed by DTW (right column).

cell and the minimum of the cumulative distances of the adjacent elements, by

defining the following recursion.

Initialization:

γ(1, 1) = 0.

Induction:

γ(k1, k2) = h(k1, k2) + min[

γ(k1−1, k2−1), γ(k1−1, k2), γ(k1, k2−1)]

.

Global and local constraints are usually defined to reduce the computational

cost of the algorithm and to limit the permissible warping paths. Here, we

experimentally fixed an adjustment window condition and a slope constraint

condition defining the maximum amount of warping allowed, as in Sakoe and

Chiba (1978). Knowing the warping path, it is then possible to align one of the

trajectory by taking the other trajectory as reference. In our framework, GMR

is used to retrieve a trajectory from the current GMM model, which is then

used as a reference trajectory to align the newly observed trajectories.


Figure 2.27 shows an example of temporal alignment of the different demon-

strations, where the model is trained incrementally. For each newly observed

trajectory, a generalized trajectory is computed by regression by using the cur-

rent model and is then used as a reference trajectory to re-align the observed

trajectory. The GMM is then updated with this new aligned observation. We

see that without DTW pre-processing (left column), the system has difficulties

75

in extracting an efficient encoding of the trajectories and associated constraints,

due to the strong distortion in time of the trajectories. Note that the generalized

trajectory retrieved through GMR still follows the main characteristics of the

trajectories demonstrated, but the local minimum and maximum of the original

data are not attained, i.e., the peaks of the retrieved trajectory are too much

soften. With DTW (right column), the essential features of the trajectories are

encoded more efficiently in GMM, and the trajectory retrieved by GMR follows

the characteristics of the demonstrated trajectories, by keeping the local mini-

mum and maximum as observed during the demonstrations. However, we also

see that the transitions between two consecutive Gaussian distributions is more

abrupt, due to the alignment of the trajectories resulting in elongated Gaussian

distributions. This drawback of DTW can still be reduced by bounding the

covariance matrices after each EM iteration (Section 2.11.1).

2.13 Use of prior information

We have seen in Section 2.5 that we can extract the constraints of a task through

GMR. When considering multiple constraints, we have seen that we can multiply

the Gaussian distributions or derive a metric of imitation to find a controller

taking into account both constraints equally. We now discuss how we can add a

priori information to the relative importance of these constraints without having

to extract these constraints by statistics. To do so, we consider the metric of

imitation presented in Section 2.6.2, and redefine Equation (2.9) by adding a

priori information on the constraints

H = w1 (ξs − ξ(1)s )> (Σ(1)ss )−1 (ξs − ξ(1)s )

+ w2 (ξs − ξ(2)s )> (Σ(2)ss )−1 (ξs − ξ(2)s ). (2.20)

Here, w1 and w2 define respectively the weights of the first and second

constraints considered for the task. Similarly, by using the direct computation

method presented in Section 2.6.1, priors information can be defined for each

constraint by multiplying directly each covariance matrix by the inverse of each

weight. Note that considering priors varying along the motion could also be

considered as a blending approach to mix smoothly different constraints.


Figure 2.28 presents the use of priors to modify the reproduction of a skill

composed of two separate constraints (see also Figure 2.14). The direct com-

putation method presented in Section 2.6.1 is used, where covariances w−11 Σ

(1)ss

and w−12 Σ

(2)ss are used to compute a controller for the reproduction of the task.

Note that the weights satisfy the relation w−11 w−1

2 = 1.

76

20 40 60 80 100

−0.05

0

0.05

0.1

w1 = 20,w2 = 0.05

ξt

ξ s20 40 60 80 100

−0.05

0

0.05

0.1

w1 = 1,w2 = 1

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

0.1

w1 = 0.05,w2 = 20

ξt

ξ s

Figure 2.28: Illustration of the retrieval process through GMR when using dif-ferent priors.

2.14 Extension to mixture models of different

density distributions

We have seen that GMM is a useful probabilistic representation to encode con-

tinuous signals. However, as the sensory system of the robot is not composed

solely on continuous signals, it would be useful to extend the proposed frame-

work to other types of signals. Indeed, Gaussian distribution is only a special

case of density function when considering a most general mixture of experts

framework. Indeed, mixture modelling is a popular approach for density ap-

proximation of continuous but also of binary or discrete data (McLachlan &

Peel, 2000). We briefly present here a possible extension of the model to the

use of binary signals, which is of particular interest for the robots that we will

consider in our experiments, since the sensors and actuation of the hands are

represented by a multivariate on/off binary signal, where the two-dimensional

binary trajectories represent the status of the left and right hand.

2.14.1 Generalization of binary signals through

Bernoulli Mixture Model (BMM)

We consider now ξs ∈ RN×2 as the dataset representing the activities of the

two hands (binary signals defining the open/close status of the two hands). To

encode in a probabilistic framework a set of N binary datapoints ξs = ξs,jNj=1,

where each datapoint j is of dimensionality (D − 1), i.e., ξs,j = ξs,j,iD−1i=1 ,

a mixture of multivariate Bernoulli distributions can be used similarly to the

use of a mixture of multivariate Gaussian distributions for encoding continuous

data (Carreira-Perpinan, 2002). The dependencies across the binary data are

captured thanks to the contribution of the different components of the mixture.

A mixture of (D − 1)-dimensional Bernoulli density function of parameters

77

20 40 60 80 100

0

1

Data

ξt

ξ s20 40 60 80 100

0

1

BMM

ξt

ξ s

Figure 2.29: Illustration of binary signals encoding in BMM, where the repro-duction (thick line) is estimated at each time step by computing [µk], where [·]is the notation for the nearest integer function.

or prototypes µk,iD−1i=1 , with µk,i ∈ [0, 1], is defined by Equation (2.1) and

p(k) = πk,

p(ξs,j |k) = B(ξs,j ;µk)

=

D−1∏

i=1

(µk,i)ξs,j,i(1 − µk,i)

1−ξs,j,i .

Similarly to the estimation of GMM parameters (Section 2.8.1), parame-

ters πk, µk are estimated by first defining the variables pk,j = p(k|ξj) and

E(t+1)k =

∑Nj=1 p

(t+1)k,j , and then using the EM algorithm to learn the parame-

ters iteratively.

E-step:

p(t+1)k,j =

π(t)k B(ξs,j ;µ

(t)k )

∑Ki=1 π

(t)i B(ξs,j ;µ

(t)i )

,

E(t+1)k =

N∑

j=1

p(t+1)k,j .

M-step:

π(t+1)k =

E(t+1)k

N,

µ(t+1)k =

∑Nj=1 p

(t+1)k,j ξs,j

E(t+1)k

.


Figure 2.29 presents an illustrative example of encoding through BMM.

78

HMM keypoints (5 states)

Data Model Reproduction

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02 1

2

3

4

5

ξs,1ξ s,2

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

HMM continuous (6 states)


−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

12

34

56

ξs,1

ξ s,2

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

GMM/GMR (6 states)


−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

−0.02 0 0.02

−0.02

−0.01

0

0.01

0.02

ξs,1

ξ s,2

Figure 2.30: Summary of the different methods proposed to encode a set oftrajectories, extract the constraints and reproduce a generalized version of thetrajectories by using an HMM keypoints approach (Section 2.4.3), an HMMcontinuous approach (Section 2.4.4) and a GMM/GMR approach (Section 2.5).The dataset used here is the one presented in Figure 2.21 after projection byCCA.

79

Table 2.4: Summary of the advantages/disadvantages of the different encodingapproaches from a theoretical and computational point of view, by using eitheran HMM keypoints approach (Section 2.4.3), an HMM continuous approach(Section 2.4.4) and a DTW/GMM/GMR approach (Section 2.5).

HMM HMM DTW, GMMkeypoints continuous and GMR

Example of the use of the algo-rithm

Section5.2

Section 4.2 Section 5.3-5.4

Representation of the temporaldistortions

(1) (2) (3)

Robustness of the encoding process (4)

Extraction of constraints (5) (6) (7)

Reproduction capabilities (8) (9) (10)

(1) Corresponding keypoints may be grouped incorrectly in different statesof the HMM for strong temporal distortion across the demonstrations.

(2) Double stochastic process to deal efficiently with temporal and spatialvariations (but also more parameters to learn simultaneously, comparedto GMM).

(3) Combination of a pattern-based approach to deal with temporal varia-tion (pre-processing by DTW) and a model-based approach to deal withspatial variation (encoding through GMM). The main disadvantage overHMM is that several parameters must be tuned for DTW.

(4) The extraction of keypoints requires a pre-processing phase sensitive tothe parameters of the segmentation process.

(5) The constraints are extracted only for the keypoints previously ex-tracted.

(6) The variability and correlation information is represented only for eachstate of the HMM.

(7) The variability and correlation information takes a continuous form andcan be evaluated at any place along the trajectory through GMR.

(8) Generative model (or retrieval of a sequence of states by observation)sensitive to the interpolation function used.

(9) Generative model (or retrieval of a sequence of states by observation)sensitive to the interpolation function used, where information at bothextremities of the trajectory is lost.

(10) Smooth reproduction using covariance information and simple handlingof multiple constraints.

80

2.15 Summary of the chapter

Throughout this chapter, we have presented different methods to encode mul-

tiple demonstrations of a motion in a probabilistic framework. We have shown

that the probabilistic representation can be used to extract which are the rel-

evant constraints of a skill in a continuous form, and to reproduce a skill by

generalizing over the different demonstrations encapsulated in the models. The

datasets used in this chapter consisted of simple low-dimensional trajectories

to illustrate the different processing steps of the algorithms and highlight their

important characteristics. The applications of these algorithms in human-robot

interaction experiments with real-world human motion data will be presented

further in Chapters 4 and 5.

Figure 2.30 presents a summary for the different encoding schemes developed

throughout this chapter to encode and retrieve a dataset consisting of several

trajectories by using either an HMM keypoints approach (Section 2.4.3), an

HMM continuous approach (Section 2.4.4) or a GMM/GMR approach (Section

2.5). This example highlights the drawbacks of the different approaches. We

see in the first row that the HMM keypoints approach fails at retrieving an

efficient generalization of the data due to an uncorrect segmentation of the data

into keypoints. Indeed, the quality of the reproduction drops significantly when

one of a keypoint is incorrectly retrieved in the reproduction process, as we can

see in this example where an additional keypoint (state 5) should be inserted

in-between state 1 and state 2. Indeed, with the HMM keypoints approach,

the reproduction process greatly depends on the quality of the segmentation

process, which is a severe drawback because it requires to fine-tune several

parameters depending on the trajectories considered. Moreover, the encoding

of the keypoints in Gaussian distributions does not provide information on the

correlation constraints across the different variables.

We see in the second row of Figure 2.30 that the HMM continuous approach

also fails at reproducing a satisfying generalization of the trajectories principally

due to the fact that only the centers of the Gaussian distributions are used for

retrieving a trajectory by interpolation. Indeed, when the trajectories present

long subparts with low curvature (quasi-straight segments in an hyperspace), as

we can see in this example, this encoding scheme does not produce satisfying

results because the relevant information contained in the covariance matrices

are not used during the reproduction process. The beginning and the end of the

trajectories are also truncated when using this method.

We finally see in the third row of Figure 2.30 that the GMM/GMR approach

produces the most satisfying results for the reproduction. This is mainly due

to the use of covariance information (contained in the GMM representation) for

the retrieval process. Moreover, this encoding scheme allows to represent local

constraints continuously along the trajectory in the form of covariance matrices.

Table 2.4 presents a summary of the advantages/disadvantages of the differ-

81

ent approaches from a theoretical and computational point of view. For our ap-

plication, the combination of DTW, GMM and GMR to extract the constraints

of a task and to reproduce a generalized version of the trajectories demonstrated

is clearly advantageous compared to HMM approaches, due to the analytic na-

ture of the regression process. Even if the representation of constraints based on

HMMs are not the most optimal solution, we preferred to present them in this

thesis to show the reasons that conducted us to the choice of GMM/GMR. Some

of the experiments that will be presented further in Chapters 4 and 5 use HMM

approaches to encode the constraints. We have indeed chosen to present these

experiments because they show useful results for RbD, even if the methods used

at the time of the experiments were probably not optimal in terms of encod-

ing of the constraints. The most up-to-date experiments using a GMM/GMR

approach will be presented further in Section 5.4.

82

3 Experimental setup


This chapter is organized as follows:

• Section 3.2 presents the two robots used in the experiments and the kines-

thetic teaching process and moulding process.

• Section 3.3 presents the stereoscopic vision system used to track the ob-

jects and the hands.

• Section 3.4 presents the motion sensors used to track the user’s body

gestures.

3.2 Humanoid robots

The experiments are conducted using two humanoid robots HOAP-2 and HOAP-

3 from the Fujitsu company (Figure 3.1), where HOAP stands for Humanoid for

Open Architecture Platform.1 The HOAP-2 has 25 degrees of freedom (DOFs),

of which only the 13 DOFs of the upper torso are used in the experiments. The

HOAP-3 has 28 degrees of freedom (DOFs), of which only the 16 DOFs of the

upper torso are used in the experiments (Figure 3.2). Compared to the HOAP-

2, the HOAP-3 is provided with additional degrees of freedom (axial rotations

1These robots are commercially available, see http://www.automation.fujitsu.com/ formore information.

Figure 3.1: HOAP-2 (left) and HOAP-3 (right) humanoid platforms used inthe experiments.

83

Figure 3.2: Illustration of the DOFs for the upper torso of the HOAP-3. Thebackdrivable DOFs are represented in white and the DOFs that can only becontrolled by the robot are represented in grey.

Stereo vision

(built-in)

HOAP-3 RobotSound speaker

(built-in)

Simulator

screen display

Real-time

control (Linux)

X-Sens+sound

proc. (Matlab)

Stereo-vision

proc. (OpenCV)

Microphone

(built-in)

Microphone

(built-in)

X-Sens motion

sensors

US

B

Blu

eto

oth

or

US

B

UDP UDP

Blu

eto

oth

or

US

B

audio out

audio in

VGA

Figure 3.3: Processing, protocols and communication channels used by thehuman-robot interaction setup using the HOAP-3 robot.

84

of the wrists and tilt rotation of the head). The remaining DOFs of the legs

are set to a constant position so as to support the robot in a sitting posture. A

dynamic simulator provided by Fujitsu has been used occasionally to play joint

angle trajectories during a debugging phase before controlling the real robot.

Two webcams in the HOAP-3 ’s head are used to track objects in 3D Cartesian

space. An external stereoscopic vision system is used for the experiments with

HOAP-2 to track the position of the hands based on skin color detection and

the initial positions of objects of specific colors. Alternatively, the initial posi-

tions of the objects can be recorded by a moulding process where the teacher

grabs one of the robot’s arm, moves it toward the object, puts the robot’s palm

around the object and presses its fingers against the object to let the robot feel

that an object is currently in its hand. When the object touches the palm, a

force sensor inside the robot’s palm is used to execute a moulding behaviour,

i.e. when the force sensor retrieve a value which is over a given threshold, the

robot briefly grasps and releases the object while registering its position in 3D

Cartesian space.

The kinesthetic teaching process consists in using the motor encoders of the

robot to record information while the teacher moves the robot’s arms. The

teacher selects the motors that he/she wants to control manually by slightly

moving the corresponding motors just a few milliseconds before the reproduction

starts. The selected motors are set to passive mode, which allows the user to

move freely the corresponding degrees of freedom while the robot executes the

task. The interaction when embodying the robot is more playful than using a

graphical simulation and it presents the advantage for the user to implicitly feel

the robot’s limitations in its real-world environment. In this way, the teacher

can provide partial demonstrations while the kinematics of each joint motion

are recorded at a rate of 1000 Hz. The trajectories are then resampled to a

fixed number of points T = 100. The robot is provided with motor encoders for

every DOF, except for the hands and the head actuators (Figure 3.2).

Figure 3.3 shows the different processing and communication channels used

by HOAP-3. The robot is controlled under a Real-time Linux and Linux en-

vironment with motor controllers implemented in C. The recognition, encoding

and retrieval functions, as well as the processing of the data coming from the

motion sensors are implemented in Matlab, which communicates with the robot

through the standard User Datagram Protocol (UDP) communication channel.

3.3 Stereoscopic vision

Two webcams are used to track a set of objects in 3D Cartesian space, based

on color matching in YCbCr color space, where only Cb and Cr are used to be

robust to changes in luminosity. The images of 320 × 240 pixels are processed

at a frame rate of 15 Hz by the OpenCV vision processing software, where each

object to track is pre-defined in a calibration phase by fitting a Gaussian dis-

85

Figure 3.4: Illustrations of the use of the external stereoscopic vision systemwith HOAP-2 to track hand paths and objects in the environment.

tribution on the CbCr subspace characterizing the color of the object. Then,

tracking is performed by using an adaptive search window similar to the Contin-

uously Adaptive Mean Shift (CamShift) algorithm (Bradski, 1998). Two exter-

nal webcams (Phillips ToUCam Pro II ) are used in the HOAP-2 setup (Figure

3.4), while two webcams integrated in the robot’s head (Logitech QuickCam)

are used for HOAP-3. To calibrate the webcams, the intrinsic and extrinsic

parameters are estimated by following the method proposed by Zhang (2000).

The error obtained between various positions of an object measured in a 3D

Cartesian space by the vision system and their real positions is 5.9 ± 2.8 mm

for the HOAP-2 setup in the configuration presented in Figure 3.4. Using the

HOAP-3 setup, the error is increased to 10.0 ± 5.7 mm, which is explained by

the reconstruction error on the depth value due to the closeness and alignment

of the cameras in this configuration.2

3.4 Motion sensors

Motion sensors are used in our experiment to record the user’s gestures by

collecting joint angle trajectories of the upper-body torso. The user’s movements

are recorded by up to 8 X-Sens motion sensors attached to the torso, upper-

arms, lower-arms, hands (at the level of the fingers) and back of the head.3 Each

sensor provides the 3D absolute orientation of each segment by integrating the

3D rate-of-turn, acceleration and earth-magnetic field at a rate of 50 Hz and

with a precision of 1.5 degrees. The data are sent to a computer for further

processing by wireless Bluetooth communication (or by USB connection). For

each joint, a rotation matrix is defined as the orientation of a distal limb segment

expressed in the frame of reference of its proximal limb segment. The kinematics

2For a complete description, the interested reader can refer to the technical report de-scribing the technical and evaluation aspects of the vision system developed here (Dubach,2004).

3These motion sensors are commercially available, see http://www.xsens.com for moreinformation.

86

Figure 3.5: Illustration of the use of X-Sens motion sensors to record the user’sgesture by collecting joint angle trajectories of the upper-body torso.

motion of each joint can then be computed by decomposing the rotation matrix

into joint angles (Figures 3.5 and 3.6).

Let us assume that the motion sensors return orientation matrix RW→1 and

RW→2, representing respectively the orientation of the proximal segment and

distal segment, both expressed in a static world referential W . We define the

orientation of the distal segment with respect to the proximal segment as R1→2

using the relation

RW→2 = RW→1R1→2

⇔ R1→2 = (RW→1)−1RW→2.

3.4.1 Calibration of the sensors

For each joint, we define a calibration posture by a rotation matrix R0. The

set of matrices for all the joints define a natural posture of the body. During

the calibration process, the user is requested to set his/her upper body into a

posture by copying the posture R0 of a stick figure displayed on a screen. To be

easily reproducible, the posture R0 corresponds to a standing posture where the

person is looking in front of him/her, with the arms hanging along the body. For

each segment, the rotation matrix recorded by the sensors during the calibration

phase is defined by R0. The difference between the desired calibration posture

R0 and the recorded calibration posture R0 is defined by expressing R0 in the

referential of R0, i.e., by computing R∆ = R0R−10 . After calibration, we define

similarly for each new posture acquired R∆ = RR−1, where R is the rotation

matrix recorded by the motion sensors. For further processing, we then use the

rotation matrix R in further processing, that takes into account the rotation

87

Flexion-extension Abduction-adduction Humeral rotation

Figure 3.6: Three successive rotations describing the rotation of the shoulderjoint following a Cardanic order convention (XYZ decomposition order). Thisdecomposition has the advantage of having an interpretation in terms of anatom-ical motion defined by a sequence of 3 angles, representing the flexion-extension,abduction-adduction and humeral rotation of the shoulder.

error R∆ by defining

R = R0R−10 R.

3.4.2 Defining joint angles from a rotation matrix

We then decompose this rotation matrix R as a succession of three rotations

along different axis. We consider a decomposition order following the Cardanic

convention (or Tait-Bryant convention),4 defined by 3 consecutive rotations

Rx(φ) → Ry(θ) → Rz(ψ) around axis X, Y ′ and Z ′′, i.e.,

R = Rz(ψ)Ry(θ)Rx(φ)

=

cψ −sψ 0

sψ cψ 0

0 0 1

cθ 0 sθ

0 1 0

−sθ 0 cθ

1 0 0

0 cφ −sφ

0 sφ cφ

=

cψcθ −sψcφ + cψsθsφ sψsφ + cψsθcφ

sψcθ cψcφ + sψsθsφ −cψsφ + sψsθcφ

−sθ cθsφ cθcφ

,

where the rotation angles φ, θ ψ define respectively the roll, pitch and yaw

angles, and where s· and c· define respectively the functions sin(·) and cos(·).

By decomposing the rotation matrix into

R =

r11 r12 r13

r21 r22 r23

r31 r32 r33

,

4This choice will be discussed later in Section 6.3.

88

we then find

θ = sin−1(−r31) , φ = tan−1( r32r33

) , ψ = tan−1(r21

r11). (3.1)

An alternative set of joint angles is given by5

θ = π − sin−1(−r31) , φ = tan−1( r32r33

) + π , ψ = tan−1(r21

r11) + π.

Figure 3.6 presents an illustration of this decomposition of a posture into a

set of joint angles (an example with the shoulder joint is considered).

5Using a trace of the previous joint angles, it is possible to select the most appropriate setof joint angles by keeping at each time step the nearest value to the recording in the previoustime step.

89

4 Testing and optimization ofthe parameters


This chapter presents preliminary experiments which are concerned with testing

and optimization issues of the different algorithms proposed in Chapter 2 by us-

ing human motion data. These experiments are concerned with the optimization

of the different techniques proposed in the framework, notably concerning the

projection in the latent space, the optimization of the number of parameters,

and testing of the incremental process.

The principal aim of this chapter is to stimulate the use of the different algo-

rithms in the further experiments presented in Chapter 5. Thus, the reader who

prefers to point directly to the essential applications of the RbD framework can

skip this chapter and move directly to the extraction of constraints experiments

presented in Chapter 5.

The rest of this chapter is organized as follows:

• Section 4.2 presents experiments to evaluate the robustness to noise con-

cerning the reproduction and recognition issues when the gestures are

encoded in different latent spaces using HMM.

• Section 4.3 discusses the selection of an optimal latent space of motion to

project the gestures.

• Section 4.4 discusses the selection of an optimal number of parameters to

represent a gesture in a GMM.

• Section 4.5 compares the robustness of the two incremental learning ap-

proaches suggested with a batch learning process.

4.2 Optimal latent space of motion for HMM

and robustness to noise

In this section, we compare the use of PCA (Section 2.9.1) with the use of ICA

(Section 2.9.3) to reduce the dimensionality of human motion data. We are here

interested in the case where the projected data are further encoded in an HMM.

This experiment has also been presented in Calinon and Billard (2005).

90

Figure 4.1: Left: Demonstrations of ”waving goodbye”, ”knocking on a door”and ”drinking from a glass” gestures. Right: Reproduction of a generalizedversion of the gestures. The trajectories of the demonstrator’s hand and theimitator’s hand (reconstructed by the stereoscopic vision system) are superim-posed to the image.

4.2.1 Experimental setup

The dataset consists of 48 gestures performed by 8 healthy student volunteers

who were asked to imitate a set of six gestures shown in a video recording. The

gestures consists of (1) waving goodbye; (2) knocking on a door ; (3) bringing a

cup to one’s mouth and putting it back on the table; (4)-(6) drawing the stylized

alphabet letters A, B and C (Figures 4.1 and 4.2).

Three X-Sens motion sensors attached to the torso, to the right upper-arm

and to the right lower-arm of the demonstrator are used to record the motion of

the shoulder joint (3 DOFs) and elbow joint (1 DOF), as presented in Section

3.4. An external color-based stereoscopic vision system tracks the 3D-position

of a marker placed on the demonstrator’s hand, as presented in Section 3.3. The

experiments are performed using HOAP-2 (Section 3.2), where only the robot’s

right arm (4 DOFs) is used for reproducing the gestures. The torso and legs are

91

Figure 4.2: Left: Demonstrations of gestures consisting in drawing stylizedalphabet letters ”A”, ”B” and ”C” on a board. Right: Reproduction of ageneralized version of the gestures by using another plane for drawing the al-phabet letters. The trajectories of the demonstrator’s hand and the imitator’shand (reconstructed by the stereoscopic vision system) are superimposed to theimage.

92

set to a constant and stable position to support the robot’s standing-up.

The complete dataset consists of 7 × 48 trajectories describing the 4 joint

angles of the arm and the 3 Cartesian components of the hand path, where each

demonstration is rescaled to T = 100 datapoints. The data are first projected

onto a low-dimensional subspace using either PCA or ICA. The resulting signals

are then encoded in a set of HMMs using a continuous approach, as presented

in Section 2.4.4. A generalized form of the trajectories is reconstructed by

interpolating between the key-points retrieved by the HMMs, and by projecting

back the data onto the robot’s workspace. For each gesture, ξx, ξθ are then

used to train a fully-connected continuous HMM with Kx+Kθ output variables.

For each experiment, the dataset is split equally into a training set and

a testing set. Once trained, the HMM is used to recognize whether an ob-

served gesture is similar to the ones encoded in the model. As presented in

Section 2.7.2, a gesture is said to belong to a given model when the associated

log-likelihood L is strictly greater than a fixed threshold (L > −100 in this

experiment). To compare the predictions of two concurrent models, we also

set a minimal threshold for the difference across the log-likelihoods of the two

models (∆L > 100 in this experiment). Thus, for a gesture to be recognized

by a given model, the voting model must be very confident (i.e., generating a

high L), while other models predictions must be sufficiently low in comparison.

Once recognized, the gesture is used to re-adjust the model’s parameters by as-

suming that the gestures used to train the model are still available. The HMM

is thus incrementally updated by using historical data through a batch learning

process.

We now define the process that we use to generate noisy signals for the

further evaluation of the recognition and reproduction performance of HMM in

a latent space of motion.

4.2.2 Noise generation

To analyze the efficiency of the model when perturbed with noise, we have

defined a process to create temporal and spatial noise with respect to parameters

defining the degree of noise added to the signals. The basic principle of the

process is to pick up randomly a subset of datapoints and to displace them in

time or space.

Temporal noise is created by generating non-homogeneous deformations in

time on the original trajectories (Figure 4.3). For a trajectory of T datapoints,

the algorithm goes as follows:

1. Select randomly ntT key-points in the trajectory, with a uniform distri-

bution of size T .

2. Displace each key-point randomly in time following a Gaussian distri-

bution centered on the key-point with standard deviation rtσt, where

93

0 10 20 30 40 50 60 70 80 90 100−4

−2

0

2

4selection of key−points

ξ

0 10 20 30 40 50 60 70 80 90 100−4

−2

0

2

4temporal noise added

ξ

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3selection of key−points

ξ

0 10 20 30 40 50 60 70 80 90 100−3

−2

−1

0

1

2

3spatial noise added

step

ξ

Figure 4.3: Noise generation process on a generic sinus curve, with parametersnt, rt, ns, rs=10%, 50%, 10%, 50%. First row: random selection of 10 key-points. Second row: temporal noise added to the data. Third row: randomselection of 10 key-points. Fourth row: spatial noise added to the data.

94

0 20 40 60 80 100−100

−50

0

50

θ 1

joint angles

0 20 40 60 80 100

−100

−50

0

50

θ 2

0 20 40 60 80 100

−100

−50

0

50

θ 3

0 20 40 60 80 100

−150

−100

−50

0

θ 4

step

0 20 40 60 80 100

−20

−10

0

10

20

x 1

hand path

0 20 40 60 80 100

0

10

20

30

40

x 2

0 20 40 60 80 100−20

−10

0

10

20

x 3step

0 20 40 60 80 100

−3

−2

−1

0

1

2

3

ξθ 1

ICA joint angles

0 20 40 60 80 100

−3

−2

−1

0

1

2

3

ξθ 2

step

0 20 40 60 80 100

−2

−1

0

1

2

ξx 1

ICA hand path

0 20 40 60 80 100

−2

−1

0

1

2

ξx 2

step

Figure 4.4: Decomposition by ICA and reconstruction of the trajectories fordrawing of the alphabet letter C. The training data with additional syntheticnoise nt, rt, ns, rs = 10%, 20%, 10%, 20% are represented in thin line. Su-perimposed to those, we show the reconstructed trajectories (in thick line). Firstcolumn: Joint angle trajectories. Second column: Hand paths. Third column: 2components resulting from ICA for the joint angle trajectories. Fourth column:2 components resulting from ICA for the hand paths.

σt is the mean standard deviation of the distribution of key-points, i.e.,

σt(T ) =√

1T−1

∑Ti=1(i−

T2 )2.

3. Reconstruct the noisy signal by interpolating between the key-points.

Then, spatial noise is created by adding white noise to the original signal

(Figure 4.3). The algorithm goes as follows:

1. Select nsT key-points in the trajectory, with a uniform random distribu-

tion of size T .

2. Displace each key-point randomly in space following a Gaussian distribu-

tion centered on the key-point with standard deviation rsσs, where σs is

the mean standard deviation in space, i.e., σs = σθ for the joint angles

and σs = σx for the hand path.

4.2.3 Encoding results

We first train an uncorrupted model with a dataset of 4 subjects performing the

6 different motions shown in Figures 4.1 and 4.2. The data are then projected in

a latent space of motion. The system finds that two PCA or ICA components

are sufficient to represent the hand path as well as the joint trajectories for

95

Recognition rate after PCA Recognition rate after ICA

10% 20% 30% 50% 50%

10%

20%

30%

40%

50%

rt (temporal noise)

rs (spa

tial n

oise

)

10% 20% 30% 50% 50%

10%

20%

30%

40%

50%

rt (temporal noise)

rs (spa

tial n

oise

)

Figure 4.5: Recognition rates when considering a testing dataset corrupted withnoise, using either PCA (left) or ICA (right) decomposition as a pre-processingstep. The results are presented as a function of the spatial and temporal noisers and rt, where the squares size is proportional to the recognition rate, between0% (smallest) and 100% (largest).

most gestures. The number of states in the HMM is then found by the BIC

criterion, as presented in Section 2.10.1. The letter A, waving, knocking and

drinking gestures are modelled by HMMs with 3 states. The letter B and letter

C gestures are respectively modelled by HMMs with 6 and 4 states. Figure 4.4

shows an example of the encoding and reproduction results when applying ICA

preprocessing. The remaining results are presented in Appendix A.1.

After training, the recognition performance is measured against a test set

of 4 other individuals performing the 6 motions. For an uncorrupted model, all

movements of the test set are recognized correctly.1

Then, to evaluate systematically the robustness of the system to recognize

and generate gestures against temporal and spatial noise, we generate two new

datasets to measure: (1) the recognition capabilities of the system (the dataset

consists of the original human data corrupted with spatial and temporal noise);

(2) the reconstruction capabilities of the system (the dataset consists of synthetic

data generated by a corrupted model). We now present the recognition results

of the system when faced with these two datasets.

4.2.4 Recognition performance

In order to measure the recognition performance of the HMM, we train a

model with an uncorrupted dataset of human gesture (note that the dataset

still encapsulated the natural variability of human motion), and test its recog-

1Note that, when reducing the number of components with PCA by considering∑K

i=1λi >

0.8 instead of∑K

i=1λi > 0.98, an error happened for one instance of the knocking on a door

motion, that was confused with the waving goodbye motion. This is not surprising, since bothmotions involve the same type of oscillatory component.

96

Table 4.1: Recognition rates as a function of the spatial noise to evaluate therecognition and reproduction robustness of the HMM in a latent space definedby either PCA or ICA. Dataset 1 refers to the testing set composed of originalhuman data corrupted with spacial and temporal noise. Dataset 2 refers to thetesting set composed of synthetic data generated by corrupted models.

Dataset 1 Dataset 2(recognition) (reproduction)PCA ICA PCA ICA

Human data 100% 100% - -Human data + rs (rs = 10%) 72.0% 75.3% 80.3% 86.0%Human data + rs (rs = 20%) 65.0% 73.0% 79.3% 81.0%Human data + rs (rs = 30%) 54.0% 66.7% 73.7% 82.3%Human data + rs (rs = 40%) 35.0% 34.7% 73.3% 84.3%Human data + rs (rs = 50%) 15.3% 13.0% 74.0% 81.3%

Figure 4.6: Trajectory retrieved by a model trained successively with a datasetcontaining (from left to right) rs=10%, 20%, 30%, 40%, 50% of noise, usingPCA decomposition as a pre-processing step. The training data are representedin thin line and the retrieved data in thick line.

nition rate against a corrupted dataset. The corrupted dataset is created by

adding spatial and temporal noise to the original human dataset with nt=10%,

rt=10%, 20%, 30%, 40%, 50% and ns=100%, rs=10%, 20%, 30%, 40%, 50%.

Comparative results for PCA and ICA preprocessing are presented in Figure 4.5

and in the left column of Table 4.1. We see that the recognition rate clearly de-

creases when the spatial noise is increased. It also decreases when the temporal

noise is increased, in agreement with the known robustness of HMM encoding

of time series when faced with temporal distortions.

4.2.5 Reconstruction performance

The same process is carried out to evaluate the reconstruction performance of

the system. We first train a set of corrupted models with human data corrupted

with temporal and spatial noise. We then regenerate a set of signals from each

97

Table 4.2: Automatic estimation of the dimensionality (D − 1) of the latentspace of motion, for the 10 gestures used in the experiment.

Gesture ID 1 2 3 4 5 6 7 8 9 10(D − 1) 4 2 3 3 4 3 4 2 2 3

of the corrupted models (Figure 4.6). We then measure the recognition rates

of the uncorrupted models (trained with uncorrupted human data) against the

set of reconstructed signals. Results are presented in the right column of Table

4.1. We see that the recognition performance is better for the regenerated

dataset than for the original corrupted dataset, which is not surprising, since

the signals regenerated from the corrupted models are by construction (through

the Gaussian estimation of the observations distribution) less noisy than the

corrupted trajectories used to measure the recognition performance (Dataset

1 ).

4.2.6 Discussion on the experimental results

Results show that the combinations PCA-HMM and ICA-HMM are both very

successful at reducing the dimensionality of the dataset and extracting the prim-

itives of each gesture, where the recognition and reconstruction performances

are very high. As expected, preprocessing of the data using PCA and ICA

removes the noise and makes the HMM encoding more robust. A second ad-

vantage of PCA/ICA encoding is that it reduces importantly the amount of

parameters required for encoding the gestures in the HMM (in contrast to using

raw data). This experiment shows that for the dataset used in the experiment,

ICA presents very few advantage over PCA, and that decorrelation may be a

sufficient global pre-processing step when the data are further encoded in an

HMM. Even if the uncorrelatedness assumption of PCA is weaker than statis-

tical independence, it is here a sufficient constraint to decompose this human

motion dataset, i.e., important information is mostly contained in the energy of

the signals.

4.3 Optimal latent space of motion for GMM

We set up another experiment to confirm that the results found in the previous

experiment are generalizable to trajectories encoded in GMM and generalized

through GMR, and compare the use of PCA (Section 2.9.1), CCA (Section 2.9.2)

and ICA (Section 2.9.3) to find an optimal latent space of motion to project

linearly the data. For this experiment, we use a dataset of 10 movements inspired

from the signals used by basketball officials to communicate various non-verbal

information such as scoring, clock-related events, administrative notifications,

98

1 2 3 4 5STOP CLOCK

FOR FOUL

One clenched fist,

other palm down pointing to

offender's waist

BECKONING-IN

Open palm,

wave towards the

body

ILLEGAL DRIBBLE

OR

DOUBLE

DRIBBLING

Patting motion

CANCEL SCORE

OR

CANCEL PLAY

Scissor-like action

with arms, once

across chest

TIME IN

Chop with hand

6 7 8 9 10BALL RETURNED

TO BACKCOURT

Wave arm

CARRYING

THE BALL

Half rotation,

forward direction

BLOCKING

(offence or defence)

Both hands on hips

DELIBERATE

FOOT BALL

Point arm to the foot

EXCESSIVE

SWINGING

OF ELBOWS

Swing elbow backwards

Figure 4.7: Selection of gestures used by Basketball officials.

Figure 4.8: Illustration of the different teaching modalities used in the exper-iment. Left: The user performs a demonstration of a gesture while wearingmotion sensors recording his upper-body movements (arms and head). Right:The user helps the robot reproduce the gesture by kinesthetic teaching, i.e.,correcting the movement by moving physically the robot’s limbs to their correctpostures.

99

1 2 3 4 5 6

0.6

0.8

1

∑d i=

1λi

Gesture 1

1 2 3 4 5 6

0.6

0.8

1

Gesture 2

1 2 3 4 5 6

0.6

0.8

1

Gesture 3

1 2 3 4 5 6

0.6

0.8

1

Gesture 4

1 2 3 4 5 6

0.6

0.8

1

Gesture 5

1 2 3 4 5 6

0.6

0.8

1

d

∑d i=

1λi

Gesture 6

1 2 3 4 5 6

0.6

0.8

1

d

Gesture 7

1 2 3 4 5 6

0.6

0.8

1

d

Gesture 8

1 2 3 4 5 6

0.6

0.8

1

d

Gesture 9

1 2 3 4 5 6

0.6

0.8

1

d

Gesture 10

Figure 4.9: Automatic estimation of the dimensionality (D − 1) of the latentspace of motion, for the 10 gestures used in the experiment. d represents thecandidate dimensionality and

∑di=1 λi the cumulative sum of eigenvalues. For

each graph, the selected dimensionality (D − 1) is represented by a point. Thedashed line shows the threshold used (at least 98% of the variance retained bythe (D − 1) first components).

types of violations or types of fouls. These gestures are presented in Figure 4.7.2

Officials’ signals in basketball provide a rich gesture vocabulary, characterized

by non-linearities at the level of the joint angles, which make them attractive

for researchers to use as test data (Gross & Shi, 2001; Tian & Sclaroff, 2005).

The robot first observes the user performing one of the gesture using motion

sensors attached to the user’s body, as presented in Section 3.4. This motion

is then refined by moving the robot’s limbs physically while performing the

gesture, as presented in Section 3.2. The gesture is refined partially by grasping

some of the robot’s limbs and moving the desired DOFs of the robot while the

robot controls the remaining DOFs.3 To do that, the user first selects the DOFs

that he/she wants to control by moving slightly the corresponding joints before

demonstration. The robot detects the motion and set the corresponding DOFs

to a passive mode. For each gesture, 3 demonstrations using motion sensors and

3 demonstrations using kinesthetic teaching are provided. Figure 4.8 illustrates

the use of these two different modalities to provide the demonstrations.

First, we show in Figure 4.9 and Table 4.2 the results of an automatic esti-

mation of the dimensionality (D − 1) of the latent space of motion, for the 10

different gestures used in the experiment.

Figure 4.10 presents an example for the first gesture of the data projected in

a latent space of motion using either PCA, CCA and ICA. The figures for the

2Images reproduced from FIBA Central Board (2006) with permission.3Note that the number of variables that the user is able to control with his/her two arms is

lower than when using the motion sensors. With kinesthetic teaching, the demonstrator canonly control a subset of joint angles (the arms), while the robot is controlling the remainingjoints (the head and the hands).

100

Enco

din

gin

ala

tent

spac

eof

mot

ion

−2 0 2−2

−1

0

1

2

ξ1

ξ 2

−2 0 2−2

−1

0

1

2

ξ1

ξ 2

−2 0 2−2

−1

0

1

2

ξ1

ξ 2−2 0 2

−2

−1

0

1

2

ξ1

ξ 3

−2 0 2−2

−1

0

1

2

ξ1

ξ 3

−2 0 2−2

−1

0

1

2

ξ1

ξ 3

−2 0 2−2

−1

0

1

2

ξ1

ξ 4−2 0 2

−2

−1

0

1

2

ξ1ξ 4

−2 0 2−2

−1

0

1

2

ξ1

ξ 4

−2 0 2−2

−1

0

1

2

ξ2

ξ 3

−2 0 2−2

−1

0

1

2

ξ2

ξ 3

−2 0 2−2

−1

0

1

2

ξ2

ξ 3

−2 0 2−2

−1

0

1

2

ξ2

ξ 4

−2 0 2−2

−1

0

1

2

ξ2

ξ 4

−2 0 2−2

−1

0

1

2

ξ2ξ 4

−2 0 2−2

−1

0

1

2

ξ3

ξ 4

−2 0 2−2

−1

0

1

2

ξ3

ξ 4

−2 0 2−2

−1

0

1

2

ξ3

ξ 4

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1000.8

11.21.41.61.8

t

θ 2

50 100

−0.35−0.3

−0.25−0.2

−0.15

t

θ 3

50 100

−1−0.5

00.5

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 1000.8

11.21.41.6

t

θ 6

50 100

0.2

0.4

0.6

t

θ 7

50 100

−0.5

0

0.5

1

t

θ 8

50 100−1.5

−1

−0.5

t

θ 9

50 100−0.2

00.20.40.60.8

t

θ 10

50 100

−0.20

0.20.40.6

t

θ 11

50 1000

0.2

0.4

0.6

t

θ 12

50 100−0.2

00.20.40.60.8

t

θ 13

50 100−0.15

−0.1

−0.05

0

t

θ 14

50 100−0.1

−0.05

0

0.05

t

θ 15

50 100−0.04−0.02

00.020.04

t

θ 16

Figure 4.10: Encoding of Gesture 1 in a latent space of motion. Top cell:Generalized trajectories represented in a latent space of motion by using a lineartransformation defined by PCA (solid line), CCA (dotted line) and ICA (dash-dotted line). Bottom cell: The generalized trajectories from PCA, CCA andICA are projected back in the original data space (they are superimposed).

101

Table 4.3: Estimation of the number of components K in the GMM, for the 10gestures used in the experiment, when using BIC, curvature-based method or amanual segmentation from an expert. Only the first demonstration is used toselect the optimal number of components.

Gesture ID 1 2 3 4 5 6 7 8 9 10BIC 4 4 5 4 3 4 4 4 4 5Curvature-based 5 3 3 6 6 6 3 3 3 3Manual segmentation 5 4 4 5 6 6 4 4 4 4

remaining gestures are presented in Appendix A.2. After modeling the data in

these respective latent space using a GMM representation, a generalization of

the data is computed by GMR (represented respectively in solid, dotted, and

dash-dotted line). The generalized trajectories are then projected back in the

original data space. We see that the different criterions used to project the

data find different solutions for the projections. However, we see that for these

10 gestures, the different criterions do not affect much the reconstructed data

qualitatively (we see that the different re-projected data are superimposed).

This is confirmed quantitatively by the mean difference of (1.2 ± 0.5) · 10−7

between the log-likelihood values of the GMMs computed in the original data

space by projecting back the Gaussian components estimated either in the PCA,

CCA and ICA latent space. We can then conclude that for the data used in

the experiment, the different criterions considered to find a latent space retrieve

different projections but are equivalent for the further encoding in GMM in

terms of log-likelihood. Following these results, we will assume that PCA will

be used as a sufficient pre-processing step for the rest of the experiments that

se will present in this thesis.

4.4 Optimal selection of the number of

parameters in the GMM

In this section, we are now interested at automatically estimating the optimal

number of components K in the GMM, by using the same communicative ges-

tures dataset as in the previous section (Figure 4.7). We consider the methods

presented in Sections 2.10.1 and 2.10.2, based respectively on the Bayesian In-

formation Criterion (BIC) and on the curvature analysis of the trajectory. The

results are contrasted to a manual estimation of the number of components,

and are presented in Table 4.3. We see that the BIC and curvature-based algo-

rithms usually provide estimates that are close to the selection set up by expert

segmentation, with a difference of one component in most of the case. BIC is

slightly better at estimating the number of parameters, but requires the com-

putation of multiple models to provide this estimation. The curvature-based

algorithm has also the advantage of providing initial estimates for the GMM.

102

12 3 45 6 78 910

−800

−600

−400

−200

0

200

400

SB

IC

Gesture 1

12 34 56 78 910

−200

−100

0

100

200

Gesture 2

12 3 45 6 78 910−600

−400

−200

0

200

Gesture 3

12 34 56 78 910−600

−400

−200

0

200

Gesture 4

12 34 56 78 910

−800

−600

−400

−200

0

200

400

Gesture 5

12 3 45 6 78 910−600

−400

−200

0

200

k

SB

IC

Gesture 6

12 34 56 78 910

−800

−600

−400

−200

0

200

400

k

Gesture 7

12 3 45 6 78 910

−200

−100

0

100

200

k

Gesture 8

12 34 56 78 910

−200

−100

0

100

200

k

Gesture 9

12 34 56 78 910−600

−400

−200

0

200

k

Gesture 10

Figure 4.11: Automatic estimation by BIC of the number of components Kin the GMM, for the 10 gestures used in the experiment. For each graph, thedotted line represents the inverse log-likelihood −L, the dashed line representsnp

2 log(N) and the solid line represents the final score SBIC. The point representsthe selected number of components (minimum of SBIC).

103

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100

−1

0

1ξ s,3

20 40 60 80 100

−1

0

1

ξ s,4

20 40 60 80 100

510152025

ξt

Ω

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100

−1

0

1

ξ s,3

20 40 60 80 100

−1

0

1

ξ s,4

ξt

Figure 4.12: Segmentation of Gesture 1 by the curvature-based method (leftcolumn), used to initialize the GMM parameters (right column). In the leftcolumn, the trajectory Ω (last row) containing curvature information is usedto defined local maxima to segment the trajectory ξ. In the right column, theGMM parameters initialized by the curvature segmentation (dashed line) arethen updated by EM (solid line).

C C’ K K’−2

02468

Gesture 1

L

C C’ K K’−2

02468

Gesture 2

C C’ K K’−2

02468

Gesture 3

C C’ K K’−2

02468

Gesture 4

C C’ K K’−2

02468

Gesture 5

C C’ K K’−2

02468

Gesture 6

L

C C’ K K’−2

02468

Gesture 7

C C’ K K’−2

02468

Gesture 8

C C’ K K’−2

02468

Gesture 9

C C’ K K’−2

02468

Gesture 10

C C’ K K’−2

02468

All gestures

Figure 4.13: Comparison for the 10 gestures of the log-likelihood L when ini-tialized with a curvature-based segmentation process and when initialized withk-means (using the same number of components). C and C’ defines respectivelythe log-likelihood L at initialization (using a curvature-based process) and afterEM convergence. K and K’ defines respectively the log-likelihood L at initial-ization (using k-means) and after EM convergence. The last graph presents theresults averaged over the 10 gestures.

104

The automatic estimation of K by BIC is presented in Figure 4.11. An example

of the automatic estimation of K for Gesture 1 by curvature-based analysis is

presented in Figure 4.12. This figure also shows the initialization of the GMM

parameters following this segmentation process, where the GMM parameters

are then updated by Expectation-Maximization (EM) algorithm (Section 2.8.1).

The graphs for the remaining gestures are presented in Appendix A.3. We see

that the curvature-based segmentation process provides for most of the gestures

a qualitatively satisfying estimate for the initialization of the GMM parame-

ters. Figure 4.13 presents a quantitative estimation based on log-likelihoods to

measure the quality of the initial estimates of the GMM parameters when us-

ing either the curvature-based segmentation method or a k-means initialization

process, using the same number of components (extracted by curvature-based

segmentation). We see that k-means finds in most of the case a better initial

estimate (in terms of log-likelihood) of the parameters, which is often really

close to the local optimum (EM increases only slightly the log-likelihood). The

initialization based on curvature-based segmentation still provides quite decent

estimates (which can then be refined by EM), and remains an alternative to the

k-means initialization with the principal advantage of extracting simultaneously

an estimate of the number of parameters required to model the gesture (no need

to evaluate multiple models). Despite of these promising results, BIC will be

used in the further experiments presented in this thesis, since BIC still produces

a slightly better estimation of the number of parameters used to encode the ges-

tures (Table 4.3), and provides a better estimate (in terms of log-likelihood) for

the convergence of the EM algorithm to a local optimum when estimating the

GMM parameters (Figure 4.13).

4.5 Robustness evaluation of the incremental

learning process

The two incremental teaching procedures presented in Section 2.8.3 are used

to teach the different basketball officials’ signals presented in Figure 4.7 to a

humanoid robot, using two different modalities to convey the demonstrations

(Figure 4.8), where the GMM parameters are updated after each demonstration.

This experiment has also been presented in Calinon and Billard (2007b).

In this experiment, we use PCA to decompose the data (Section 2.9.1) and

BIC to select an optimal number of GMM components (Section 2.10.1).4 The

dimensionality of the latent space and the number of Gaussian components used

to encode the data, estimated automatically by the system, have been presented

in Sections 4.3 and 4.4 (see Figure 4.9, Figure 4.11 and Table 4.3). The original

data space of 16 DOFs is then reduced to a latent space of 2-4 dimensions, which

is a suitable dimensionality to estimate the GMM parameters using an EM

4Note that only the first demonstration is used to find the optimal number of components.

105


20 40 60 80 100−0.2

−0.1

0

0.1

0.2Demonstration 1

ξt

ξ s,1

20 40 60 80 100−0.2

−0.1

0

0.1

0.2Demonstration 2

ξt

ξ s,1

20 40 60 80 100−0.2

−0.1

0

0.1

0.2Demonstration 3

ξt

ξ s,1

20 40 60 80 100−0.2

−0.1

0

0.1

0.2Demonstration 4

ξt

ξ s,1

20 40 60 80 100−0.2

−0.1

0

0.1

0.2Demonstration 5

ξt

ξ s,1

20 40 60 80 100−0.2

−0.1

0

0.1

0.2Demonstration 6

ξt

ξ s,1

Figure 4.14: Illustration of the incremental learning process using the directupdate method (Section 2.8.3), which is a modified version of the EM algo-rithm for incremental update of the GMM parameters by using only the newobservations. The graphs show the consecutive encoding of the data in GMMafter convergence of the EM algorithm (only the first variable of Gesture 2 inthe latent space is represented). The algorithm only uses the latest observedtrajectory to update the model (represented in black line).

algorithm. 3-5 GMM components are required to encode the different gestures.

Figure 4.14 shows an example of the encoding results of GMM when updat-

ing incrementally the parameters using the direct update method presented in

Section 2.8.3. Note the particular importance of regularizing the parameters by

bounding the covariance matrices, as presented in Section 2.11.1, when using

the direct update method. We see respectively in the first and second rows of

Figure 4.14 that the first set of three demonstrations using motion sensors and

the last set of three demonstrations using kinesthetic teaching present similar-

ities within each set, but are quite different across the sets. We see that after

the sixth demonstration, the two incremental training processes still refine effi-

ciently the model to represent the whole training set, without using historical

data to update the model.

The aim of this experiment is to compare the efficiency of the different

training processes, namely Batch (B), Incremental - direct update (IA) and In-

cremental - generative (IB), presented respectively in Sections 2.8.1 and 2.8.3.

To do so, we keep in memory each demonstration and each model updated after

having observed the corresponding demonstrations. For each gesture, we then

compute the log-likelihood of the model when faced with the different demon-

strations (movements already observed and remaining movements). Figure 4.15

presents the average results for the 10 gestures. We see that after the first

106

1 2 3 4 5 6 12

3 45 6−13

−8

−3

2

7

Data

Batch training method (B)

Model

−L

1 2 3 4 5 6 12

3 45 6−13

−8

−3

2

7

Data

Incremental − Direct update (IA)

Model

−L

1 2 3 4 5 6 12

3 45 6−13

−8

−3

2

7

Data

Incremental − Generative (IB)

Model

−L

Figure 4.15: Evolution of the inverse log-likelihood −L when trained with thedifferent methods. After each demonstration, the current model is comparedwith the whole dataset (data already observed and remaining data). The resultsare averaged over the 10 different gestures. The 3D vertical bars and the verticalsegments represent respectively the means and the variances.

107

B IA IB−15

−10

−5Gesture 1

−L

B IA IB−15

−10

−5Gesture 2

B IA IB−15

−10

−5Gesture 3

B IA IB−15

−10

−5Gesture 4

B IA IB−15

−10

−5Gesture 5

B IA IB−15

−10

−5Gesture 6

−L

B IA IB−15

−10

−5Gesture 7

B IA IB−15

−10

−5Gesture 8

B IA IB−15

−10

−5Gesture 9

B IA IB−15

−10

−5Gesture 10

B IA IB−15

−10

−5All gestures

Figure 4.16: Inverse log-likelihood −L of the models after observation of thesixth demonstration and when faced with the six demonstrations used to trainthe corresponding models. Results are presented for the different gestures andfor the different teaching approaches. B corresponds to a batch training pro-cedure. IA and IB correspond to an incremental training procedures using re-spectively the direct update method and the generative method. The last graphpresents the inverse log-likelihood averaged over the whole set of gestures.

demonstration, Model 1 describes very well Data 1 (the log-likelihood is high,

and −L is thus low). This first model also describes partially Data 2-3, but

describes poorly Data 4-6. From the fourth demonstration, Model 4 begins to

provide a good representation for the whole training set, which is finally op-

timized in Model 6. We see that the different training methods do not show

significant differences. Particularly, we see that after the sixth demonstration,

the log-likelihoods of Model 6 trained with one of the incremental method are

almost constant for Data 1-6. Thus, each data are equally represented by the

model following a linear learning rate, i.e., the model encodes the previously

observed data as well as the newly observed data.

For each gesture, a closer look at the log-likelihoods of the final model (Model

6 ) is presented in Figure 4.16, where the last inset shows the average for the 10

gestures. It shows that the negative log-likelihoods for the incremental methods

IA and IB are only a little bit higher than for the batch method B, i.e., the

resulting GMM representations for IA and IB are quite as good as for B.5

Thus, we see that both direct update and generative methods are efficient at

refining the GMM representation of the data. The loss of performance induced

by the incremental training procedures is negligible compared to the benefit

of the methodology. Indeed, a learning system that can forget historical data

is advantageous in terms of memory capacity and scaling-up issue. With the

proposed experiment, the differences of processing speed between the batch and

incremental procedures are insignificant (all the processes run in less than 1

5Note that this is not systematic (see Gesture 2).

108

Gesture 1

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,4

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Gesture 2

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Figure 4.17: Reproduction using Gaussian Mixture Regression (GMR) in a la-tent space of motion (Gestures 1-2 ), with models trained with the batch (B)and incremental training methods (IA and IB), representing respectively thedirect update and generative method.

109

Figure 4.18: Reproduction of the 10 gestures by using an incremental trainingprocess (generative method), after having observed 6 demonstrations for eachgesture. The resulting trajectories of the hands are represented when the gen-eralized joint angle trajectories are executed by the robot. Note that all thegestures start with the arms hanging along the body.

110

second using a Matlab implementation running on a standard PC).6

Figure 4.17 shows the reproduction of Gestures 1-2 for the different training

methods, using Gaussian Mixture Regression (GMR) in the latent space. The

results for the remaining gestures are presented in Appendix A.4. We see that

there is no significant qualitative difference in the trajectories generated by the

different models. Figure 4.18 presents the hands paths in a Cartesian space

corresponding to the gestures learned by the generative incremental training

method in a latent space of motion, where the motion have been projected back

in the joint angle data space and run on the robot. We see that the essential

characteristics of the motion are well retrieved by the models.

For the dataset considered in this experiment, we do not see any benefit

of using one or the other method to incrementally learn the data. The direct

update approach is recommended over the generative method because it does

not rely on a stochastic process to update the model. However, the dataset

must present slow modification properties between two consecutive examples,

because the direct update method assumes that the cumulated posterior prob-

ability does not change much with the inclusion of the novel data in the model.

This constraint is partially fulfilled when considering data that are first recog-

nized by the model (i.e., sharing enough similarities with the previous data)

before updating of the parameters. The generative update method is still use-

ful when considering datasets showing high variability between two consecutive

demonstrations. This issue will be discussed later in Section 6.6.2.

6However, when considering larger datasets, the incremental learning approach wouldpresent advantages by updating the model faster than for a batch learning process.

111

5 Extraction of constraintsexperiments


This chapter presents the core of the experiments developed throughout this

thesis. It illustrates the development of the framework by increasing gradually

the difficulty of the issues tackled by the extraction of constraints and repro-

duction issues. We first start by describing our first attempts at using HMM

to show how to reproduce a skill by using a single controller (direct controller

or inverse kinematics controller). Then, the problem of combining the different

controllers is proposed. Finally, new challenges are introduced by presenting

experiments concerning the incremental learning of a skill, the use of multiple

constraints on different objects, and the use of different modalities to convey the

demonstrations. Note that this last set of experiments (Section 5.4) shows the

most recent advances of the proposed RbD framework. The chapter is organized

as follows:

• Section 5.2 presents an experiment using HMMs to encode joint angle and

hand path data. In this experiment, the constraints related to these two

representations of the motion are extracted by the model and are used for

reproduction by selecting either a direct controller (using the joint angle

representation) or an inverse kinematics controller (using the hand path

representation).

• Section 5.3 presents experiments combining both representations to find

an optimal controller for the reproduction of the task using both joint

angle and hand path information.

• Section 5.4 finally presents three experiments to show the advantages of

using an incremental teaching process that puts the user in the loop of

the robot’s learning.

5.2 Extraction of constraints and reproduction

by using a single controller

This experiment presents one of our early attempts at encoding gestures in HMM

and using this representation to extract what is the most effective representa-

tion of the data for the task considered (joint angle or hand path information).

112

”Dots” condition ”No dots” condition

Imitat

or

Dem

onst

rato

r

Imitat

or

Dem

onst

rato

rFigure 5.1: Illustration of the goal-directed imitation issue tackled by the exper-iment. The demonstrator and imitator respectively stand on the left-hand sideand on the right-hand side of the table. We consider here a mirror imitationstrategy. With the presence of objects (left column), the reproduction of anipsilateral gesture by using a mirror ipsilateral gesture is obvious (second row).Following this strategy, a contralateral gesture should thus be reproduced by amirror contralateral gesture (third row). However, it is more difficult to deter-mine here if the reproduction task is still satisfied when the imitator chooseshis/her other arm (the one closer to the dot) to reach for the dot. Withoutobjects (right column), this issue disappears, i.e., the goal or constraints for theimitator is now turned toward reproducing the same gesture as the one of thedemonstrator.

113

Figure 5.2: Experimental setup.

The selection of an optimal controller is then based on the extraction of the

variability in one or the other representation. A reproduction using either the

hand path (using an inverse kinematics controller) or the joint angles (using

a direct controller) is thus selected. This experiment has also been described

in Calinon, Guenter, and Billard (2005). It extends the framework for solving

the what-to-imitate problem in incorporating the notion of goal preference and

inherently proposes a method for optimization of the reproduction, namely the

how-to-imitate issue (Nehaniv & Dautenhahn, 2000). The complete system is

validated in the HOAP-2 humanoid platform, reproducing an influential exper-

iment from developmental psychology.

Indeed, the scenario suggested is inspired from the experiments of Bekkering

et al. (2000) showing that imitation is goal-directed. The authors used an

imitation game to show that the hands paths and hands-object relationships

have different levels of importance, following a hierarchy of relevance. They

also suggested that the use of the different levels mainly depends on the working

memory capacities of the infants/adults. While infants focus on a single level,

adults use multiple levels simultaneously with preferences for the levels of highest

relevance in the hierarchy.

The scenario of our experiment is a replication of a set of psychological

experiments conducted with young children and adults (Bekkering et al., 2000).

In these experiments, the authors showed that children have a tendency to

substitute ipsilateral for contralateral gestures when the dots are present. In

contrast, when the dots are absent from the demonstration, the number of

114

substitutions drops significantly. Thus, despite the fact that the gesture is

the same in both conditions, the presence or absence of a physical object (the

dot) affects importantly the reproduction. When the object is present, object

selection takes the highest priority. Children, then, nearly always direct their

imitation to the appropriate target object, at the cost of selecting the ”wrong”

hand. When removing the dots, the complexity of the task (the number of

constraints to satisfy) decreases, and, hence, constraints of lower importance can

be fulfilled (such as producing the same gesture or using the same hand). An

illustration of this issue is presented in Figure 5.1. Note that similar experiments

conducted with adults have corroborated these results, by showing that the

presence of a physical object affects the reproduction. In that case, the response

latency was used as a comparative measure, instead of using the proportion of

errors observed in the reproduction attempts.

The experiment starts with the (human) demonstrator and the (robot) im-

itator standing in front of a table, facing each other (Figure 5.2). On both

sides of the table, two colored dots (red and green) have been stamped at equal

distance to the demonstrator and imitator’s starting positions. In a first set of

demonstrations, the demonstrator reaches for each dot alternatively with left

and right arm. If the demonstrator reaches for the dot on the left-hand side

of the table with his/her left arm, it is said to perform an ipsilateral motion.

If conversely the demonstrator reaches the dot on the right hand side of the

table with his/her left arm, it is said to perform a contralateral motion. Then,

the demonstrator produces the same ipsilateral and contralateral motions, but

without the presence of dots.

Each of these motions are demonstrated five times consecutively. In each

case, the demonstrator starts from a similar starting position. While observing

the demonstration, the robot tries to make sense of the experiment and extracts

the demonstrator’s intention underlying the task by determining a set of con-

straints for the task in a statistical manner. When the demonstration ends,

the robot computes the trajectory that satisfies best the constraints extracted

during the demonstration and generates a motion that follows this trajectory.

The experiments of Bekkering et al. (2000) are informative in RbD to de-

termine how to prioritize constraints (that we will also name goals throughout

this experiment) in a given task, and as such help us solve the correspondence

problem (Alissandrakis et al., 2002). For instance, in the particular scenario

proposed, knowing the trajectory of the demonstrator’s arm and hand path

might not allow us to determine unequivocally the joint angle trajectories of

the robot’s arm. Indeed, depending on where the target is located, several con-

straints (goals) might compete and satisfying all of those would no always lead

to a solution. For instance, in the case of a contralateral motion, the robot’s

arm is too small to both reach the target and perform the same gesture. In that

case, the robot should find a trade-off between satisfying each of the constraints.

This amounts to determining the importance of each constraint with respect to

115

Figure 5.3: Information flow across the system used in this experiment. The tra-jectories are first segmented into a set of keypoints. The sequences of keypointsare then encoded in an HMM, as presented in Section 2.4.3. The trajectoriesfor the reproduction are then generated by interpolation through the sequenceof keypoints regenerated by the HMM.

Figure 5.4: Example of a left-right continuous HMM used in this experiment,with 5 states and 2 output distributions describing yt and y′t.

one another.


The demonstrator’s motions are recorded by 5 X-Sens motion sensors attached

to the torso and the upper- and lower-arms (Section 3.4). The joint angle

trajectories of the shoulder joint (3 degrees of freedom) and the elbow (1 degree

of freedom) are reconstructed by taking the torso as referential. The external

stereoscopic vision system presented in Section 3.3 is used used to track the

3D-position of the dots, the demonstrator’s hands and the robot’s hands. The

HOAP-2 robot presented in Section 3.2 is used, where trajectory control affects

only the two arms (4 DOFs each) in this experiment. The torso and legs are set

to a constant position to support the robot’s standing-up posture.

In order to reduce the dimensionality of the dataset to a subset of critical

features, a pre-processing phase automatically segments the joint angle trajecto-

116

1

0

0 σmax0 σmin

wei

ghtw

standard deviation σ

Figure 5.5: Function used to transform a standard deviation σ to a weight factorw ∈ [0, 1]. σmin corresponds to the accuracy of the sensors. σmax represents themaximal standard deviation measured during a set of demonstrations generatedby moving randomly the arms in the environmental setup during one minute.

ries and the hand path into a set of keypoints, which are subsequently encoded

in Hidden Markov Models (HMMs), as presented in Section 2.4.3 (Figure 5.3).

The segmentation strategy is similar to the one presented in Section 2.10.2, but

the process implemented here is much simpler by considering only zero crossing

velocity.

Each of the 4 joint angle trajectories is encoded in a left-right continuous

HMM (Figure 5.4). The number of states is determined by the sequence with the

highest number of keypoints in the training set. Each hidden state represents a

keypoint j in the trajectory, and is associated with a stochastic representation,

encoding two variables yj , y′j, namely the time lag between two keypoints and

the joint angle value. The hand path is encoded in the same way, with 3 output

distributions to encode the Cartesian components.

The transition probabilities P (qt=j|qt−1=i) and the output distribution

p(yt|qt=i) are estimated by the Baum-Welch iterative method, as presented in

Section 2.8.2. The forward-algorithm is used to estimate a log-likelihood value

that an observed sequence could have been generated by one of the model, as

presented in Section 2.7.2. The Viterbi algorithm is used to generate a general-

ization of a trajectory over the demonstrations, by retrieving the best sequence

of hidden states and the associated keypoint components, as presented in Sec-

tion 2.4.3. The corresponding trajectory is then reconstructed by applying a

3rd-order spline fit when using the Cartesian trajectory, and by applying a co-

sine fit when using the joint angle trajectory. The cosine fit corresponds to a

cycloidal velocity profile, and keeps the keypoints as local maxima or minima

during the reproduction.1

5.2.2 Effects of the priors on the metric of imitation

Let D = x1, x2, . . . , xT and D′ = x′1, x′2, . . . , x

′T be a demonstration and a

reproduction of a variable x. In order to measure the performance of the robot

1Note that a more cohesive alternative is suggested in Section 2.4.3, by using Piecewisecubic Hermite polynomial functions for each trajectory retrieved by the model.

117

imitation, we define in this experiment a cost function H as

H(D,D′) = 1 − f(e), (5.1)

e =1

T

T∑

t=1

|x′t − µt|,

where H ∈ [0, 1] gives an estimate of the quality of the reproduction.2 Opti-

mizing the imitation consists of minimizing H, where H = 0 corresponds to a

perfect reproduction. e is a measure of deviation between the observed data

D′ and the training data D, computed through the HMM representation of the

data. The Viterbi algorithm is first used to retrieve the best sequence of states

q1, q2, . . . , qT , given an observation D = x1, x2, . . . , xT of length T , where

µ1, µ2, . . . , µT is the sequence of means associated with the sequence of states.

A transformation function f(·), defined in Figure 5.5, normalizes and bounds

each variable within minimal and maximal values. This eliminates the effect

of the noise, intrinsic to each variable, so that the relative importance of each

variable can be compared. Let us consider now a dataset of K variables. The

metric H considered in this experiment is given by

H =1

K

K∑

i=1

wi Hi(Di, D′i), (5.2)

where wi ∈ [0, 1] is a measure for the relative importance of each cost function.

These weights reflect the variance of the data during the demonstration. To

evaluate this variability, we use the statistical representation provided by the

HMM. If q1, q2, . . . , qT is the best sequence of states retrieved by the model,

and if σi1, σi2, . . . , σ

iT is the associated sequence of standard deviations of vari-

able i, we define

wi = f(1

T

T∑

t=1

σit). (5.3)

If the variance of a given variable is high, i.e., showing no consistency across

demonstrations, this suggests that satisfying some particular constraints on this

variable will have little bearing on the task. The factors wi in the cost function

equation reflect this assumption; if the standard deviation of a given variable

is low, the value taken by the corresponding wi is close to 1. This way, the

corresponding variable will have a strong influence in the reproduction of the

task.

Extracting the relative importance of each set of variables requires to ob-

serve several demonstrations performed with sufficient variability to estimate

the relevance of the different sets of variables. To speed-up this extraction of

2Note that in this experiment, we do not use a cost function based on likelihood of the HMM(Section 2.7.2) to be consistent with the other components that will be introduced further toenrich the cost function. Indeed, as these components are not based on a probabilistic model,a simple RMS error computation is sufficient here.

118

constraints, we suggest in this experiment to introduce goal-preference parame-

ters αi, extending the cost function to express explicitly how a constraint should

be prioritized over another. In this experiment, the cost function is thus applied

to 4 sets of variables, namely the joint angle trajectories, the hand path, the

hand-objects relationships at the end of the motion, and the laterality of the

motion (which hand is used).

Let D = Θ, X, d, h and D′ = Θ′, X ′, d′, h′ be the datasets generated by

the demonstrator and imitator respectively. Let θ1, θ2, θ3, θ4 be the general-

ized joint angle trajectories over the demonstrations, x1, x2, x3 the generalized

Cartesian trajectory of the hand, o1,1, o1,2, o1,3 and o2,1, o2,2, o2,3 the 3D lo-

cation of the first and second dot respectively. dk = xk−ok defines the distance

between the hand and object k at the end of a trajectory. h = 1 and h = 2

corresponds respectively to the usage of the left and right arm respectively. We

make here the assumption that only one hand is used at the same time. With

N = 4 joint angles for each arm and O = 2 objects, we define the general cost

function Htot as

Htot = α11

N

N∑

i=1

wi1 H1(θi, θ′i)

+ α21

3

3∑

j=1

wj2 H2(xj , x

′j)

+ α31

3O

O∑

k=1

P∑

j=1

wkj3 H3(dkj , d

′kj)

+ α4 w4 H4(h, h′).

The factors αi (with∑

i αi = 1 and 0 ≤ αi ≤ 1) are fixed by the experi-

menter and determine the relative importance of each subset of data, i.e., the

importance of each constraint (or goal) in the overall task (α1 for the joint angle

trajectories, α2 for the hand path, α3 for the hand-object distance, and α4 for

the laterality of the movement).3 The cost function H1, H2 and H3 are given by

Equation 5.2. H4 is given by H4(h, h′) = |p(h = 1)− p(h′ = 2)|, where p(h = 1)

is the probability of using the left arm during the demonstrations. wH1 , wH2

and wH3 follow Equation 5.3, and are set to 0 if the corresponding component

is missing (i.e., if the dot is not detected by the vision system). w4 is given by

w4 = 2 |p(h = 1) − 0.5|, and represents the importance of using either the left

or right hand (laterality of the imitation). It is defined by the probability with

which either hand has been used over the whole set of demonstrations (w4 = 0

if there is no preference).

119

90

-900.80

sfe[

deg

]

time[sec]

Demonstrations

0.80time[sec]

Generalization

0.80time[sec]

Reproduction

10-15

25

0

23

-2

Z[cm]

X[cm]Y[cm]

Z[cm]

10-15

25

0

23

-2

Z[cm]

X[cm]Y[cm]

Z[cm]

10-15

25

0

23

-2

Z[cm]

X[cm]Y[cm]

Z[cm]

Figure 5.6: Left: 5 demonstrations of the joint angle trajectories(top) and handpaths (bottom), for a contralateral motion with right hand. Middle: Trajectoryin joint and Cartesian space retrieved from the HMM model encoding the 5demonstrations. Right: Reproduction of a new motion by the robot to minimizethe cost function H, with αi set to the same value (right column). The pointsin the graphs represent the keypoints segmented and retrieved by the HMMs.The square and the circle show the position of the two dots on the table. Onlythe shoulder flexion-extension is represented for the joint angles. The trajectoryis retrieved with a cosine fit for the joint angles, and with a 3rd-order spline fitfor the hand path.

Table 5.1: Values of the cost function H for after a demonstration of a contralat-eral motion with right hand. The optimal value for the controller is representedin bold, which will be used for the reproduction of the skill. With no goal-preference, the robot tries to imitate in mirror-fashion the movement, with orwithout the dots (first and second columns). By constraining the importance ofthe different sets of variables, i.e., with a goal-directed cost function, the robotimitates with an ipsilateral controller when the dots are present (third column),i.e., by using its closest hand to the dot. When the dots are absent, it selects acontroller to reproduce the movement in mirror-fashion (Fourth column).

Without priors With priors(α1=α2=α3=α4) (α1=

12α2=

14α3) , α4=0

Dots No dots Dots No dotsContralateral demo 0.16 0.14 0.22 0.11

Ipsilateral demo 0.36 0.47 0.08 0.16

120

Demonstrations Reproductions

Ipsi

late

raldem

oC

ontr

alat

eral

dem

o

Figure 5.7: Left: Demonstration of an ipsi- (top) and contralateral (bottom)motion of the right arm. Right: Reproduction of the optimal motion candidateby using only statistics to extract the goals (α1 = α2 = α3 = α4).

Demonstration Reproduction

Ipsi

late

raldem

oC

ontr

alat

eral

dem

o

Figure 5.8: Same experiment as in Figure 5.7, by adding a notion of goal-dominance to the general imitation cost function (α1 = 1

2α2 = 14α3 , α4 = 0).

This time, the contralateral motion is reproduced by doing an ipsilateral motionwith the other arm, i.e., by using the arm which is closer to the dot.

121

”Dots” condition ”No dots” condition

Ipsi

late

raldem

osC

ontr

alat

eral

dem

os

Figure 5.9: Summary of the experiment presented in Figures 5.7 and 5.8 byusing the left arm for the demonstrations.

5.2.3 Experimental results

Table 5.1 presented the values of the cost function H for the different models

of gesture that can be used for the reproduction, when a contralateral motion

is demonstrated.

the different conditions on the factors αi and in the two different situations

(with or without the dots).4 As expected by the dataset used in this experiment,

presenting low variability across the demonstrations, we find little variation (wjiclose to 1) in the joint angle trajectories, the hand paths, the hand-object dis-

tances and the laterality, which forces the satisfaction of all constraints during

the reproduction of the motion. In such a situation, since the robot does not

share the same embodiment as that of the user, it is not able to satisfy every con-

straint. By introducing the factors αi in the cost function H, the constraints to

fulfill are prioritized. Thus, less demonstrations are usually required to extract

the task constraints.

In order to test the influence of setting a priori the preference of each set of

variables for the reproduction, we have computed H with two sets of values:

• α1 = α2 = α3 = α4 (general cost function).

• α1 = 12α2 = 1

4α3 , α4 = 0 (goal-directed cost function).

3We make the assumption that a mirror imitation is used in preference by the robot.4When the target dots are not present, the associated weights w

j3

values are null. Then, thehand path and joint angle trajectories become the remaining relevant features to reproduce.

122

Demonstrations Reproduction

Figure 5.10: Left cell: Illustrative schema of the what-to-imitate issue. A 2DOFs manipulator arm produces two demonstrations of a given task, namelydrawing of an S figure, starting with different joint angle configurations. Thepath of the end-effector (given by x1, x2) is invariant, while the joint trajec-tories θ1, θ2 vary importantly. Right cell: Illustrative schema of the how-to-imitate issue. As the manipulator considered for reproduction has a differentembodiment, it can generate several alternatives to reproduce the demonstratedtask, from purely satisfying the joint angle trajectories (left) to satisfying onlythe hand path (right). The issue is then to find an appropriate trade-off betweensatisfying the constraints related respectively to the joint angles and path of theend-effector.

Figure 5.11: Illustration of the correspondence problem. Left: Demonstration.Middle: Reproduction using different body. Right: Reproduction using differentobject.

Table 5.1 gives the values of the cost functions H for each condition. The

optimal values are then used to select a controller for the reproduction of the

task. Figures 5.7 and 5.8 show the resulting trajectories tracked by the vision

system (see also Figure 5.9). Due to its limited range of motion and to its

shorter arm, the robot can not reach for the dot with the contralateral motion.

By adding a notion of goal-dominance to the general imitation cost function,

the robot reproduces the contralateral motion of the demonstrator by doing an

ipsilateral motion with the other arm, which is closer to the dot to reach (Figure

5.8).

123

Demonstrations Reproduction

Figure 5.12: Left: Illustration of the what-to-imitate issue. Right: Illustrationof the how-to-imitate issue.

5.3 Extraction of constraints and reproduction

combining multiple controllers

In the cost function presented in the previous section, the variability on the

different sets of variables was used to select either a direct controller or an

inverse kinematics controller. Here, we consider the case where an optimal

controller is found by combining both constraints.

To illustrate this issue, let us first consider the toy example of a planar

manipulator with two DOFs, performing two demonstrations of a given task,

each time starting from a different joint angle configuration (Figure 5.10, left).

By observing the trajectories of the joints θ1, θ2 and the paths followed by

the end-effector of the manipulator x1, x2, we see that the first set of vari-

ables varies importantly while the second set of variables remains fairly constant

across the demonstrations. Thus the information conveyed by the end-effector

path appears to be more reliable than that conveyed by the joint angles. In

other words, the task appears to put stronger constraints on the end-effector

path than on the joint angle trajectories. Thus, to reproduce the task, one

would give more weight to reproducing the end-effector path than the joint an-

gle trajectories. Once we have measured the relative importance of each set

of variables and incorporated this measure into a cost function H, we can de-

termine a trajectory for the joint angles and end-effector of the imitator’s arm

which is optimal with respect to H. The problem may not be as straightforward

as it seems since the demonstrator and the imitator may differ significantly in

their embodiment (arm lengths, number of degrees of freedom for each arm).

This issue is illustrated in the right cell of Figure 5.10, where the 2 segments

of the imitator’s manipulator differ in length from those of the demonstrator.

If the imitator’s arm replays directly the joint angle trajectories of the demon-

strator’s arm, we would end up with a very different path for the end-effector

than the one demonstrated. Conversely, replaying the end-effector path (using

124

an inverse kinematics controller) would result in major differences in the joint

angle trajectories. This figure illustrates the effect of generating different alter-

natives, from purely satisfying the joint trajectories to satisfying only the hand

path.

A similar issue is considered when transferring a skill from a human to a

robot, i.e., when considering either a different embodiment or different objects

to manipulate (Figure 5.11). Figure 5.12 provides an example for a Chess Task

consisting in grabbing a chess piece and moving it two squares forward. The

picture on the left shows the path followed by the robot’s hand during kines-

thetic teaching when starting from two different initial locations. To extract the

constraints of the task (i.e., to determine what-to-imitate), the robot computes

the variations and correlations across the variables. In this task, this analysis

will reveal weak correlations at the beginning of the motion, as a large set of

paths can be used to reach for the chess piece depending on the hand’s initial

position. However, the analysis will measure a strong correlation for grabbing

the piece and pushing it towards the desired location without hitting the other

chess pieces on the chessboard. The right snapshot of Figure 5.12 illustrates the

how-to-imitate issue. Once trained to perform a task in a particular context, the

robot should be able to generalize and reproduce the skill in a different context.

In this example, the robot should be able to grab and move the chess piece two

squares forward wherever it may be on the chess board. Since the demonstrated

joint angles and hand path are mutually exclusive in the imitator space, it is not

possible to fulfill both constraints at the same time. Depending on the situation,

the robot may have to find a very different joint angles configuration than the

one demonstrated. In order to do this, the robot computes the trajectory which

gives the optimal trade-off between satisfying the constraints of the task and its

own body constraints. In the next sections, we describe and discuss how this

trade-off can be computed.

5.3.1 Reproduction in joint angle space using a direct

computation

As humanoid robots arms are kinematically redundant, it would be useful to

treat both hand paths x and joint angles θ in the same representational space.

This can be achieved by considering their first order derivatives, i.e., by esti-

mating the velocities from the set of trajectories collected by the robot as

xs,i,j = xs,i,j − xs,i,j−1

θs,i,j = θs,i,j − θs,i,j−1

∀i ∈ 1, . . . , n

∀j ∈ 2, . . . , T.

where the indices s, i, j represent respectively the type of value (here, a spatial

value), the identification of the demonstration (the number of gestures provided

to the robot) and the time steps elapsed from the beginning of the demonstra-

tion.

125

θs and xs are kinematically constrained. By considering an iterative, locally

linear solution to the inverse kinematics, we can relate the two variables with a

Jacobian matrix J by writing

x′s,i,j = J(θs,i,j−1) θs,i,j∀i ∈ 1, . . . , n

∀j ∈ 2, . . . , T.(5.4)

Similarly, we define

θ′s,i,j = J(θs,i,j−1)† xs,i,j

∀i ∈ 1, . . . , n

∀j ∈ 2, . . . , T,(5.5)

where J(θs,i,j−1)† is the pseudo-inverse of J(θs,i,j−1). Since the generalized joint

angle velocities and the generalized end-effector velocities can be mutually ex-

clusive in the imitator workspace (due to the different embodiment between the

demonstrator and the imitator and to the separate generalization processes),

xs and x′s can diverge in Cartesian space (similarly, θs and θ′s can diverge in

joint angle space). To fulfill both constraints simultaneously, we use the product

properties of Gaussian distribution to find a controller satisfying the two inde-

pendent constraints (see Equation (2.7) in Section 2.6.2). In joint angle space,

using Equations (2.4) and (5.5), we obtain

xs ∼ N (ˆxs, Σxss)

⇒ θ′s ∼ N (J† ˆxs, J†Σθss(J

†)>). (5.6)

Using Equations (2.7), (5.6) and θs ∼ N (ˆθs, Σ

θss), the product of the two

Gaussian distributions describing θs and θ′s is equivalent (up to a scaling factor)

to the Gaussian distribution N (µ, Σ) defined by parameters

µ =(

(Σθss)−1 + J>(Σxss)

−1J)−1 (

(Σθss)−1 ˆθs + J>(Σxss)

−1 ˆxs

)

,

Σ =(

(Σθss)−1 + J>(Σxss)

−1J)−1

. (5.7)

5.3.2 Reproduction in joint angle space by deriving a

metric of imitation

The result presented in Equation (5.7) can also be found by deriving a metric

of imitation performance (see also Section 2.6.2).

Let θ, x be, respectively, the generalized joint angle trajectories and the

generalized hand path extracted from the demonstrations. Let θ, x be the

candidate trajectories for reproducing the motion. A metric of imitation per-

formance (i.e., cost function for the task) H is defined by

H = (θ − θ)> W θ (θ − θ)

+ (x− x)> W x (x− x), (5.8)

126

where W θ and W x are weight matrices defined respectively by W θ = (Σθ)−1

and W x = (Σx)−1. By considering the additional variables c1 and c2 defined by

c1,i,j = θs,i,j − θs,i,j−1

c2,i,j = xs,i,j − xs,i,j−1

∀i ∈ 1, . . . , n

∀j ∈ 2, . . . , T,

we rewrite Equation (5.8) as

H = (θ − c1)> W θ (θ − c1)

+ (x− c2)> W x (x− c2). (5.9)

The problem is now reduced to finding a minimum of Equation (5.9), when

subjected to the body constraint x = Jθ. Since H is a quadratic function, the

problem can be solved analytically by Lagrange optimization. We define the

Lagrangian as

L(θ, x, λ1) = (θ − c1)> W θ (θ − c1)

+ (x− c2)> W x (x− c2)

+ λ>1(x− J θ),

where λ1 is the vector of associated Lagrange multipliers. To solve ∇L = 0, we

consider the property W> = W of symmetric matrices (the inverse covariance

matrices are by definition symmetric), and we derive respectively ∂L

∂θ= 0 and

∂L∂x

= 0, which gives

−2 W θ (θ − c1) − J>λ1 = 0, (5.10)

−2 W x (x− c2) + λ1 = 0. (5.11)

By combining Equations (5.10) and (5.11), we find

W θ (θ − c1) + J>W x (x− c2) = 0,

and solving for θ, we finally obtain

θ =(

W θ + J>W xJ)−1 (

W θ c1 + J>W x c2)

.

By considering W θ = (Σθ)−1 and W x = (Σx)−1, we then see that this

result is similar to the µ found in Equation (5.7). The main advantage of

considering a direct computation method, as presented in Section 5.3.1, is that

it not only determines an optimal controller satisfying both constraints, but is

also computes the resulting constraints for this controller, represented as Σ in

Equation (5.7).

127

5.3.3 Reproduction in latent space by deriving a metric

of imitation

Let us now consider a task with three types of constraints: (1) at the joint angle

level (angular trajectories θ); (2) at the hand path level (absolute trajectories

x); (3) at the hand-object relationship level (trajectories relative to an object

y).5 We assume in this experiment that the two arms of the robot can be used.

For each of the n demonstrations, ys is defined as a distance vector between the

initial position of the object os ∈ R6 (the Cartesian position in R

3 is defined

two times for the two hands), and the position of the hands xs, i.e., we define

ys,i,j = xs,i,j − os,i∀i ∈ 1, . . . , n

∀j ∈ 1, . . . , T.(5.12)

Similarly to Equation (5.4), θs and xs are kinematically constrained, but we

consider this time the inverse kinematics issue in the latent space of motion, by

computing

xs = J(θ) θs

⇔ Axξxs = J(

Aθξθs + θs)

Aθ ξθs

⇔ ξxs = J(

ξθs)

ξθs , (5.13)

with J(

ξθs)

= (Ax)†J(

Aθξθs + θs)

Aθ.

Here, (Ax)† is the pseudo-inverse of Ax, J is the Jacobian in the original

data space, and J is the Jacobian in the latent space.

Equation (5.12) can be expressed in terms of velocities, and re-written in

the latent space, providing a relation between ξxs and ξys , i.e., we define

ys = xs − os

⇔ Ay ξys = Axξxs − os

⇔ ξys = Az ξxs − (Ay)†os, (5.14)

with Az = (Ay)†Ax.

Here, os is the initial velocity of objects (that we will consider to be equal

to zero in this experiment) and (Ay)† is the pseudo-inverse of Ay. In the latent

space, let ξθs , ξxs , ξ

ys be, respectively, the generalized joint angle trajectories,

the generalized hand path and the generalized hands-object distance vectors

extracted from the demonstrations. Let ξθs , ξxs , ξ

ys be the candidate trajectories

for reproducing the motion. The metric of imitation performance (i.e., cost

5More generically, note that this task can also be considered as being constrained by thejoint angles and by actions on two different objects, where the first object is fixed in theenvironment.

128

function for the task) H is given by

H = (ξθs − ξθs )> W θ (ξθs − ξθs )

+ (ξxs − ξxs )> W x (ξxs − ξxs ) (5.15)

+ (ξys − ξys )> W y (ξys − ξys ).

By considering the additional variables c1, c2 and c3 defined by

c1,i,j = ξθs,i,j − ξθs,i,j−1

c2,i,j = ξxs,i,j − ξxs,i,j−1

c3,i,j = ξys,i,j − ξ

ys,i,j−1

∀i ∈ 1, . . . , n

∀j ∈ 2, . . . , T,

we rewrite Equation (5.15) as

H = (ξθs − c1)> W θ (ξθs − c1)

+ (ξxs − c2)> W x (ξxs − c2) (5.16)

+ (ξys − c3)> W y (ξys − c3).

The problem is now reduced to finding a minimum of Equation (5.16), when

subjected to the body constraint and environmental constraint defined respec-

tively by Equations (5.13) and (5.14). Since H is a quadratic function, the

problem can be solved analytically by Lagrange optimization. We define the

Lagrangian as

L(ξθs , ξxs , ξ

ys , λ1, λ2) = H + λ>1(ξ

xs − J ξθs )

+ λ>2(ξys −Az ξxs + (Ay)†os),

where λ1 and λ2 are the vectors of associated Lagrange multipliers. To solve

∇L = 0, we consider symmetric matrix W> = W and derive respectively ∂L

∂ξθs

=

0, ∂L

∂ξxs

= 0 and ∂L

∂ξys

= 0, which gives

−2 W θ (ξθs − c1) − J>λ1 = 0, (5.17)

−2 W x (ξxs − c2) + λ1 − (Az)>λ2 = 0, (5.18)

−2 W y (ξys − c3) + λ2 = 0. (5.19)

Using Equations (5.18) and (5.19), we find

λ1 = 2 W x (ξxs − c2) + (Az)> 2 W y (ξys − c3). (5.20)

Using Equations (5.20) and (5.17), we find

W θ (ξθs − c1) + J>W x (ξxs − c2) + J> (Az)> W y (ξys − c3) = 0.

129

Solving for ξθs , we obtain

ξθs =(

W θ + J>W xJ + (AzJ)>W y(AzJ))−1

·(

W θ c1 + J>W x c2 + (AzJ)>W yc4)

,

with c4 = (Ay)†os + c3.

We can then compute iteratively the joint angle trajectories with:

ξθs,i,j = ξθs,i,j−1 + ξθs,i,j∀i ∈ 1, . . . , n

∀j ∈ 2, . . . , T.

The joint angle trajectories used for reproduction are finally found using the

relation θs = Aθξθs + θs.


Using the process described in Section 5.3.3, three tasks are taught to the

HOAP-2 robot, of which only the 11 DOFs of the arms and torso were re-

quired in the experiments (8 DOFs for the arms, 1 DOF for the torso, and 2

binary commands to open and close the robot’s hands). The remaining DOFs

of the legs were set to a constant position, so as to support the robot in an

upright posture facing a table. The three tasks are illustrated in Figure 5.13,

where the robot is taught through kinesthetics, as presented in Section 3.2. By

letting the user physically move its limbs, the robot senses its own motion by

registering the joint angle data through the motor encoders. The initial position

of the different objects are registered by helping the robot grasp and release the

objects.

In these experiments, we make the assumption that the important sensory

information are coming from: (1) the posture of the robot, defined by the joint

angles provided by the motor encoders actuating the upper-body part; (2) the

absolute positions of the hands of the robot in a Cartesian space, calculated by

direct kinematics using the joint angles; (3) the relative positions between the

hands and the initial position of the object, calculated from the absolute initial

position of the object and the absolute positions of the hands in a Cartesian

space; (4) The open/close status of the two hands of the robot.

The different datasets are first projected in latent spaces of motion by using

PCA, as presented in Section 2.9.1. The different trajectories are then aligned

temporally using the DTW approach presented in Section 2.12. The continuous

trajectories are encoded in GMM using a batch learning procedure, as presented

in Section 2.8.1. Generalized version of these trajectories are retrieved through

GMR, as presented in Section 2.5. The open/close status of the two hands are

encoded in BMM, as presented in Section 2.14.1.

The hand-object directional vectors and the nonlinear combination of the

joint angles used to retrieve the position of the hands are hard-coded instead of

130

Ches

sTas

kB

uck

etTas

kSuga

rTas

k

Figure 5.13: Illustration of the kinesthetic teaching process for the 3 tasksconsidered in this experiment. Chess Task: Grabbing and moving a chess piecetwo squares forward. Bucket Task: Grabbing and bringing a bucket to a specificposition. Sugar Task: Grabbing a piece of sugar and bringing it to the mouth,using either the right or left hand.

131

1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

(D − 1)λ

1 2 3 4 5 60

0.2

0.4

0.6

(D − 1)1 2 3 4 5 6

0

0.2

0.4

0.6

(D − 1)

Figure 5.14: Estimation of the number of components required to reduce thedimensionality of the data space for the Bucket Task, using eigenvalues (solidline for the joint angles, dashed line for the hands paths and dash-dotted linefor the hands-object relationship). The point corresponds to the number ofdimensions retained to represent at least 98% of the data variance.

being extracted autonomously by the system, which creates additional redun-

dant information in the dataset. As we are considering manipulation tasks, we

decided to include this information at hand in our system, since the importance

of this information (i.e., position of the hands in a fixed frame of reference) is

meaningful for the skill considered and remains general for a broad range of

tasks. This way, the number of examples needed to extract the important as-

pects of the task is reduced, since it alleviates the need of learning the direct

kinematics of the robot.6

Depending on the task, the robot is provided with 4 to 7 demonstrations by

an expert user, where each demonstration is rescaled to T = 100 datapoints.

Note that the number of examples required for an efficient reproduction of the

task depends on the teaching efficiency of the user. Indeed, an expert teacher

produces demonstrations that are exploring as much as possible the variations

allowed by the task, while a naive user can demonstrate the task several times

in the same manner without fully exploiting the constraints required by the

task. This issue will be illustrated later in Section 5.4 and discussed in Section

6.2, showing how this issue can be leveraged by considering incremental active

teaching scenarios.

Once trained, the robot is required to reproduce each task under different

constraints by placing the object at different locations in the robot’s workspace.

This procedure aims at demonstrating the robustness of the system when the

constraints are transposed to different locations within the robot’s workspace.

5.3.5 Experimental results

Figures 5.14 and 5.15 provide an example for the Bucket Task to respectively

select the dimensionality of the latent space and the number of GMM compo-

nents used to encode the projected data. The results for the other tasks are

summarized in Table 5.2. We see that the dimensionalities of the latent space

6Note that choosing initially the adequate task-dependent variables may not be trivial formore complex paradigms.

132

1 2 3 4 5 6 7 8 9 10800

900

1000

1100

K

SB

IC

1 2 3 4 5 6 7 8 9 10

2700

2800

2900

3000

K1 2 3 4 5 6 7 8 9 10

2600

2700

2800

2900

K

Figure 5.15: Estimation of the number of Gaussian components required tomodel the trajectories in the latent space for the Bucket Task (solid line forthe joint angles, dashed line for the hands paths and dash-dotted line for thehands-object relationship). Adding more Gaussian components increases thelog-likelihood, but also increases the number of parameters. The BIC criteriondefines a trade-off to select an optimal number of parameters.

Table 5.2: Number of parameters extracted automatically by the system.

Data space Latent spacen T (d− 1) (D − 1) K

Chess Task θ 7 100 9 4 5obj1: chess board (fixed) x 7 100 6 4 5obj2: chess piece y 7 100 6 4 4

h 7 100 2 (2) 1

Bucket Task θ 7 100 9 5 4obj1: table (fixed) x 7 100 6 4 7obj2: bucket y 7 100 6 4 6

h 7 100 2 (2) 1

Sugar Task - left θ 4 100 9 2 5obj1: mouth (fixed) x 4 100 6 3 5obj2: piece of sugar y 4 100 6 3 6

h 4 100 2 (2) 1

Sugar Task - right θ 4 100 9 2 6obj1: mouth (fixed) x 4 100 6 3 6obj2: piece of sugar y 4 100 6 3 6

h 4 100 2 (2) 1

133

Joi

nt

angl

es

10 20 30 40 50 60 70 80 90 100−1

012

θ 1

10 20 30 40 50 60 70 80 90 100−1

012

θ 2

10 20 30 40 50 60 70 80 90 100−2−1

01

θ 310 20 30 40 50 60 70 80 90 100

−2−1

01

θ 4

...

10 20 30 40 50 60 70 80 90 100−3−2−1

0

θ 9

t

GM

Min

late

nt

spac

e

10 20 30 40 50 60 70 80 90 100

−202

ξθ 1

10 20 30 40 50 60 70 80 90 100

−202

ξθ 2

10 20 30 40 50 60 70 80 90 100−2

02

ξθ 3

10 20 30 40 50 60 70 80 90 100−2

02

ξθ 4

10 20 30 40 50 60 70 80 90 100

−202

ξθ 5

t

GM

Rin

late

nt

spac

e

10 20 30 40 50 60 70 80 90 100

−202

ξθ 1

10 20 30 40 50 60 70 80 90 100

−202

ξθ 2

10 20 30 40 50 60 70 80 90 100−2

02

ξθ 3

10 20 30 40 50 60 70 80 90 100−2

02

ξθ 4

10 20 30 40 50 60 70 80 90 100

−202

ξθ 5

t

Figure 5.16: Probabilistic encoding and extraction of constraints for the jointangles θ in the Bucket Task.

134

Han

ds

pat

hs

10 20 30 40 50 60 70 80 90 1000

100200300

x1

10 20 30 40 50 60 70 80 90 1000

100200300

x2

10 20 30 40 50 60 70 80 90 100−100

0100200

x3

10 20 30 40 50 60 70 80 90 100−200−100

0100

x4

10 20 30 40 50 60 70 80 90 1000

100200300

x5

10 20 30 40 50 60 70 80 90 100−100

0100200

x6

t

GM

Min

late

nt

spac

e 10 20 30 40 50 60 70 80 90 100−200

0

200

ξx 1

10 20 30 40 50 60 70 80 90 100−300−200−100

0100

ξx 2

10 20 30 40 50 60 70 80 90 100−200

0

200

ξx 3

10 20 30 40 50 60 70 80 90 100−200

0

200

ξx 4

t

GM

Rin

late

nt

spac

e 10 20 30 40 50 60 70 80 90 100−200

0

200

ξx 1

10 20 30 40 50 60 70 80 90 100−300−200−100

0100

ξx 2

10 20 30 40 50 60 70 80 90 100−200

0

200

ξx 3

10 20 30 40 50 60 70 80 90 100−200

0

200

ξx 4

t

Figure 5.17: Probabilistic encoding and extraction of constraints for the handspaths x in the Bucket Task.

135

Han

ds-

obje

ctre

lation

ship

10 20 30 40 50 60 70 80 90 1000

100200300

y1

10 20 30 40 50 60 70 80 90 100−100

0100200

y 210 20 30 40 50 60 70 80 90 100

−1000

100200

y 3

10 20 30 40 50 60 70 80 90 100−200−100

0100

y 4

10 20 30 40 50 60 70 80 90 100−100

0100200

y 5

10 20 30 40 50 60 70 80 90 100−100

0100200

y6

t

GM

Min

late

nt

spac

e 10 20 30 40 50 60 70 80 90 100

−200

0

200

ξy 1

10 20 30 40 50 60 70 80 90 100−300−200−100

0100

ξy 2

10 20 30 40 50 60 70 80 90 100−200

0

200

ξy 3

10 20 30 40 50 60 70 80 90 100−200

0

200

ξy 4

t

GM

Rin

late

nt

spac

e 10 20 30 40 50 60 70 80 90 100

−200

0

200

ξy 1

10 20 30 40 50 60 70 80 90 100−300−200−100

0100

ξy 2

10 20 30 40 50 60 70 80 90 100−200

0

200

ξy 3

10 20 30 40 50 60 70 80 90 100−200

0

200

ξy 4

t

Figure 5.18: Probabilistic encoding and extraction of constraints for the hands-object relationship y in the Bucket Task.

136

10 20 30 40 50 60 70 80 90 100

0

0.2

0.4

0.6

0.8

1

h1

10 20 30 40 50 60 70 80 90 100

0

0.2

0.4

0.6

0.8

1

h2

t

Figure 5.19: Encoding and generalization of the open/close binary commandsof the two hands in the Bucket Task. h1 and h2 correspond respectively tothe opening signal of the right/left hand. The different demonstrations arerepresented in thin line and encoded in Bernoulli Mixture Model, as presentedin Section 2.14.1. The binary signal generalized over the demonstrations isrepresented in thick line. In this task, as the bucket is grasped by the twohands at nearly the same time, the system finds that a single component issufficient to describe the multivariate binary signals.

Figure 5.20: Legend for the graphs in Figures 5.21-5.26.

137

60 80 100 120 140 160 180 200100

120

140

160

180

200

220

x1 [mm]

x 2 [mm

]

1

Figure 5.21: Reproduction of the Chess Task with an initial position of the ob-ject which is close to the generalized trajectories. The three snapshots representthe motion at the different time steps A, B and C represented on the graph.The snapshots also show the Cartesian trajectory of the hand tracked by thestereoscopic vision system.

60 80 100 120 140 160 180 200100

120

140

160

180

200

220

x1 [mm]

x 2 [mm

]

160 80 100 120 140 160 180 200 220

80

100

120

140

160

180

200

220

x1 [mm]

x 2 [mm

]

2

60 80 100 120 140 160 180 200 220100

120

140

160

180

200

220

240

260

x1 [mm]

x 2 [mm

]

3

Figure 5.22: Bottom left: Map depicting the mean values taken by the costfunction H for the Chess Task, with respect to the initial position of the chesspiece on the board. The graphs 1, 2 and 3 show the reproduction attempts forthe corresponding locations on the map (see also the legend in Figure 5.20).

138

-100 -50 0 50 100 150

50

100

150

200

250

x1 [mm]

x 2 [mm

]

1

Figure 5.23: Reproduction of the Bucket Task with an initial position of theobject which is close to the generalized trajectories. The three snapshots rep-resent the motion at the different time steps A, B and C represented on thegraph. The snapshots also show the Cartesian trajectories of the hands trackedby the stereoscopic vision system.

-100 -50 0 50 100 150

50

100

150

200

250

x1 [mm]

x 2 [mm

]

1

-100 -50 0 50 100 150

50

100

150

200

250

x1 [mm]

x 2 [mm

]

2

x1 [mm]

x 2 [mm

]

20

40

60

80

100

120

140

160

-100 -80 -60 -40 -20 0 20 40 600

10

20

30

40

50

60

70

80

90

100H

-100 -50 0 50 100 150

50

100

150

200

250

x1 [mm]

x 2 [mm

]

3

Figure 5.24: Bottom left: Map depicting the mean values taken by the costfunction H for the Bucket Task, with respect to the initial position of the bucketon the table. The graphs 1, 2 and 3 show the reproduction attempts for thecorresponding locations on the map (see also the legend in Figure 5.20).

139

-160 -140 -120 -100 -80 -60 -40 -20 0

20

40

60

80

100

120

x1 [mm]

x 2 [mm

]1

Figure 5.25: Reproduction of the Sugar Task with an initial position of the ob-ject which is close to the generalized trajectories. The three snapshots representthe motion at the different time steps A, B and C represented on the graph.The snapshots also show the Cartesian trajectory of the hand tracked by thestereoscopic vision system.

for the Chess Task and Bucket Task are higher than for the Sugar Task, prob-

ably due to the facts that both hands are used simultaneously in these cases

(for the Chess Task, the left hand is used to bend over the table). For the

Sugar Task, one hand is usually motionless while the other is performing the

task. The number of Gaussian components are between 4 and 7 (automatically

extracted by BIC). Note that the total number of parameters depends quadrat-

ically on the dimensionality of the latent space and linearly on the number of

Gaussian components, i.e., the total number of parameters used to encode the

data are nPCA = (D − 1)(d − 1) and nGMM = (K − 1) +K(

D + 12D(D + 1)

)

.

A single component is found by BMM to encode efficiently the hands binary

signals h, because one hand or two hands are closed simultaneously for each

task considered.

Figures 5.16, 5.17, 5.18 and 5.19 show a complete encoding example for the

Bucket Task. In Figures 5.16, 5.17 and 5.18, the first rows show the acquisition

of data and generalized trajectories for the three different constraints θ, x and y.

θ1 represents the torso joint angle trajectories. θ2, . . . , θ5 and θ6, . . . , θ9 rep-

resent respectively the right/left arm joint angles. x1, x2, x3 and x4, x5, x6

represent respectively the right/left hand paths. y1, y2, y3 and y4, y5, y6

represent respectively the right/left hand paths with respect to the bucket. The

demonstrations are represented in grey line and the generalized signal recon-

structed from the latent space is represented in thick line. The second rows

140

-160 -140 -120 -100 -80 -60 -40 -20 020

40

60

80

100

120

140

x1 [mm]

x 2 [mm

]

3

0 20 40 60 80 100 120 14020

40

60

80

100

120

x1 [mm]

x 2 [mm

]

4

-160 -140 -120 -100 -80 -60 -40 -20 0

20

40

60

80

100

120

x1 [mm]

x 2 [mm

]

1

0 20 40 60 80 100 120 140

20

40

60

80

100

120

x1 [mm]

x 2 [mm

]

2

x1 [mm]

x 2 [mm

]

100

120

140

160

180

200

220

240

260

-150 -100 -50 0 50 100 1500

200

400

600

800

1000

1200H

120

140

160

180

200

220

240

260

100

-150 -100 -50 0 50 100 150x1 [mm]

x 2 [mm

]

Figure 5.26: Bottom left: Map depicting the mean values taken by the costfunction H for the Sugar Task, with respect to the initial position of the pieceof sugar on the box. The graphs 1, 2, 3 and 4 show the reproduction attempts forthe corresponding locations on the map (see also the legend in Figure 5.20). Twodifferent areas are visible for the left and right arm. Bottom right: Selection ofa left/right arm controller depending on the value of H (black areas correspondto Hleft < Hright).

141

show the reduction of dimensionality and temporal alignment, representing the

original trajectories projected onto a latent space of lower dimensionality us-

ing PCA (Section 2.9.1) and processed by DTW (Section 2.12). The resulting

signals ξ are encoded in GMM using a batch mode (Section 2.8.1). The third

rows show the extraction of the constraints, where the generalized trajectories ξ

in the latent space are represented in thick line, with corresponding covariance

matrices Σss represented as envelopes around ξ. We see in Figure 5.18 that the

trajectories for y are highly constrained when the user grasps the object at time

steps 30-50, i.e., the generalized trajectories present narrow envelopes for each

dimension, which means that the relative position of the hands with respect

to the bucket does not allow much variation. We also see in Figure 5.17 that

the trajectories for x are highly constrained at the end of the motion, since the

ending-positions vary very little across the demonstrations (the bucket is always

placed at a specific location on the table after being grasped). Additionally, the

joint angles in Figure 5.16 provide a constraint on the gestures used to achieve

the task. They can present some redundancies with the hands paths but they

still provide useful additional information. Indeed, as the robot’s arms permit

to position the hands in the space with different joint angle configurations, an

inverse kinematics problem arises when hands paths are reproduced. Adding

the constraint of matching the joint angles solution demonstrated by the user to

the inverse kinematics solution produces results that are looking more natural.

Figures 5.21 and 5.22 show the reproduced trajectories for the Chess Task,

depending on the initial position of the chess piece (see also Figure 5.20 for

legend). Knowing that the right shoulder position is x1, x2 = 100,−35, we

see on the map of Figure 5.22 that the best location to reproduce the motion

is to initially place the chess piece in front of the right arm, see graph (1).

In graphs (2) and (3), the chess piece is placed initially at different positions

unobserved during the demonstrations.

Figures 5.23 and 5.24 show the reproduced trajectories for the Bucket Task,

depending on the initial position of the bucket (see also Figure 5.20 for legend).

The optimal trajectory which satisfies the learned constraints follows globally

the demonstrated hands paths, still using the demonstrated object-hands tra-

jectories when approaching the bucket. Note also that an example of failure

using this task will be presented later in Figure 6.6 (and discussed in Section

6.6.3).

Figures 5.25 and 5.26 show the reproduced trajectories for the Sugar Task,

depending on the initial position of the piece of sugar (see also Figure 5.20 for

legend). In this task, a box is centered in front of the robot and two different

gestures are taught: (1) the robot is taught how to grasp with its right hand

the piece of sugar located at the far right on the top of the box; (2) the robot is

taught how to grasp with its left hand a piece of sugar located at the far left on

the top of the box. A model of the skill is then created for each left and right

arm. Depending on the situation, an optimal controller for the reproduction

142

is then selected by evaluating each metric respectively. We see in the bottom-

right inset of Figure 5.26 that the limit between the left/right controller is clearly

situated at x1 = 0 (i.e., on the symmetry axis of the robot).

In conclusion, the essential features of the three tasks have been extracted

successfully using a continuous representation. An optimal controller for the

reproduction is computed by deriving a metric of imitation, which finds for

each task a solution that matches as best as possible the object-hands rela-

tionships, the hands paths and the joint angle trajectories demonstrated by the

user. Depending on the task, these variables have different importance, which

is extracted by observing a human expert demonstrating several times a similar

skill in varying situations.

By projecting the original data onto a latent space of motion and encod-

ing the resulting data in GMM, the system is able to generalize the skill over

different situations (over different initial position of the object). However, it is

important to note that the generalization capabilities (or degree of extension of

the skill to varied situations) depends directly on the variations in the demon-

strations provided to the robot and on the dimensionality of the latent space

obtained by the system. For example, if the system detects that the hands are

constrained to moving an object in a plane (2-dimensional hands paths latent

space), the system will not be able to generalize over initial position of the object

that is not in this plane. For small changes in the initial position of the object

between the demonstrations and the reproduction, we have seen that the robot

was able to correctly adapt its gestures and reproduce the important qualitative

features of each task, namely: (1) grasping and moving the chess piece with a

specific relative path; (2) grasping the bucket with two hands and moving it to

a specific location; (3) grasping the piece of sugar and bringing it to its mouth

with either left or right arm. These essential features are reproduced even if

none of these high-level goals were explicitly represented in the robot’s control

system.

5.4 Extraction of constraints by using

incremental teaching methods

In the previous section, we presented a set of experiments using a probabilistic

framework for extracting the relevant components of a task by observing multi-

ple demonstrations of it, using Principal Component Analysis (PCA), Gaussian

Mixture Model (GMM) and Gaussian Mixture Regression (GMR). Training was

performed in a batch mode, which means that the refinement of the model was

not possible without the use of historical data.

We discuss in this section the importance of adding social components to

the teaching paradigm, not only in the user interface but also in the teaching

process. We adopt here the perspective that the transfer of skills can take ad-

143

Figure 5.27: Different modalities are used to convey the demonstrations andscaffolds required by the robot to learn a skill. The user first demonstrates thewhole movement using motion sensors (left) and then helps the robot refine itsskill by kinesthetic teaching (right), that is, by embodying the robot and puttingit through the motion.

vantages of several psychological factors that the user might encounter during

teaching. Thus, the teacher is no longer considered solely a model of expert be-

havior but becomes an active participant in the learning process (Yoshikawa et

al., 2006). First, we suggest using different modalities to produce the demonstra-

tions (Figure 5.27). We then present three experiments using the incremental

learning approach with direct update method presented in Section 2.8.3, allow-

ing the teacher to gradually see the results of his/her demonstrations. Using the

motion sensors (Section 3.4) allows the user to demonstrate a high-dimensional

movement, while kinesthetic teaching (Section 3.2) provides a way of supporting

the robot in its reproduction of the task (Figure 5.27). By using scaffolds, the

user provides support to the robot by manually articulating a decreasing subset

of motors. The scaffolds progressively fade away while the user finally lets the

robot perform the task on its own, allowing the robot to experience the skill in-

dependently. Therefore, the teacher uses the scaffolding process to let the robot

gradually generalize the skill for an increasing range of contexts. Knowing that,

it may be important to reproduce the acquired skill after each demonstration to

help the teacher prepare the following demonstration according to the outcome.

This shares similarities with the human process of refining the understanding

of a task through practice and corrective feedback, by combining incrementally

new information with previous understanding. It thus suggests the use of incre-

mental learning approaches when teaching humanoid robots, allowing them to

extend, refine and elaborate their understanding of the skill.

144

GMR

retrieval

GMM encoding

or refinement

Latent space

projection

HOAP-3

simulator

Refinement

of the model

HOAP-3

robot

Movement

demonstration

(X-Sens)

Kinesthetic

teaching

(encoders)

Recognition of

existing motion

MODELDEMONSTRATION REPRODUCTION

Data space

re-projection

Visual objects

tracking

(stereo-vision)

Inverse

kinematics

Figure 5.28: Information flow across the complete system.


Figure 5.28 presents the information flow across the complete system. The robot

builds a model of the task constraints by observing the user demonstrating a

manipulation skill using different initial positions of the objects. After each

demonstration, the robot reproduces a generalized version of the task by prob-

abilistically combining the various extracted constraints. Watching the robot

reproducing the task after each demonstration helps the teacher evaluate the

reproduction and generalization capabilities of the robot. The teacher can thus

detect the portions of the task that require refinements or that have not been

correctly understood by the robot, helping the teacher prepare further demon-

strations. The model is refined incrementally after each demonstration, and the

user decides to stop the interaction when the robot has correctly learned the

skill.

By extracting the variations and correlations through GMR (Section 2.5), we

detect at each time step along the trajectory what the relevant variables compos-

ing the task are and how the different variables are correlated. For some tasks,

145

relevant and irrelevant variables are clearly separated (Alissandrakis, Nehaniv,

Dautenhahn, & Saunders, 2005), defined a priori, or are constant throughout

the task. Here, we consider the most general case where different levels of con-

straints are allowed, which can freely change during the skill. We believe that

representing the task constraints in a binary manner (relevant versus irrelevant

features) is not appropriate for continuous movements. Indeed, some goals re-

quire different precisions, that is, they can be described with different degrees of

invariance. For example, the movement used to drop a piece of sugar in a tiny

cup of coffee is more constrained than the movement to drop a bouillon cube in

a large pan.

Trajectory constraints are defined for each object considered in the scenario,

which allows the system to model simultaneously gestures and actions on ob-

jects, using the direct computation method described in Section 2.6.1. This

probabilistic representation of the trajectory constraints on each object is then

used to reproduce the task in new conditions, i.e., with new positions of objects

that have not been used to demonstrate the skill. Extracting the constraints of a

movement is important to determine which parts are relevant, which ones allow

variability, and what kind of correlations there are among the different variables.

Similarly to humans, an efficient human-robot teaching process should thus en-

courage variability in the different demonstrations provided to the robot, e.g.,

varied practiced conditions, varied demonstrators or varied exposures. Thus,

extracting not only a generalized movement from the demonstrations, but also

variability and correlation information, may allow the robot to use its experience

in changing environmental conditions, i.e., where the task demands can change

during the skill. Using GMMs, this can be set-up in an adaptive way without

increasing drastically the complexity of the system when new experiences are

provided.

The experiments are conducted using the HOAP-3 robot with 28 degrees

of freedom (DOFs), as presented in Section 3.2, of which only the 16 DOFs of

the upper torso are required in the experiments. Two webcams in its head are

used to track objects in 3D Cartesian space based on color information (Section

3.3). The objects to track are predefined in a calibration phase. Alternatively,

the initial positions of the objects can also be recorded by a moulding process

where the teacher grabs the robot’s arm, moves it toward the object and puts

the robot’s palm around the object. When the object touches its palm, the

robot feels the object by using a force sensor. It then briefly grasps and releases

the object while registering its position in 3D Cartesian space (Section 3.2).

Two different modalities are used to convey the demonstrations (Figure

5.27). First we use motion sensors attached to different body parts of the user

(Section 3.4). The user’s movements are recorded by 8 X-Sens motion sensors

attached to the torso, upper-arms, lower-arms, hands (at the level of the fingers)

and back of the head. We then use the motor encoders of the robot to record

information while the teacher moves the robot’s arms. The teacher selects what

146

Figure 5.29: Illustration of the use of motion sensors to acquire gestures from ahuman model. A simulation of the robot is projected behind the user to showhow the robot collects the joint angles during the process. The gesture is similarto the one used in Experiment 1, except that in these snapshots the user doesnot face the robot and only mimics the grasping of the object.

motors to control manually by slightly moving the corresponding motors just a

few milliseconds before the reproduction starts. The selected motors are set to

passive mode, which allows the user to move freely the corresponding degrees

of freedom while the robot executes the task.

We present three experiments to show that the method is generic and can be

used with datasets representing different modalities. For the first experiment,

the robot collects the joint angle trajectories of the two arms of the user, which

are the lowest-level data that the robot can collect. We show with this experi-

ment that the system can efficiently encode high-dimensional data by projecting

them in a latent space of motion. For the second experiment, the robot collects

the 3D Cartesian path for the right hand (using direct kinematics). The aim

is to show that manipulation skills can be learned by the system and general-

ized to various initial positions of objects. The third experiment extends this

manipulation skill to the use of different objects, whose specific properties are

learned through demonstrations. The first two experiments were presented in

Calinon and Billard (2007d) and the last experiment was presented in Calinon

and Billard (2007a).

5.4.2 Experiment 1: learning bimanual gestures

This experiment shows how a bimanual skill can be taught incrementally to the

robot in joint angle space using observation and scaffolding. The task consists of

grasping and moving a large foam die (Figures 5.27 and 5.29). Starting from a

rest posture, the left arm is moved first to touch the left side of the die, with the

head following the motion of the hand. Then a symmetrical gesture is performed

147

First attempt Third attempt Sixth attempt

Figure 5.30: Reproduction of the task after the first, third and sixth demonstra-tion. In the first attempt, the robot hits the die when approaching it. In thethird attempt, the robot’s skill gets better but the grasp is still unstable. In thesixth attempt, the die is correctly grasped. The trajectories of the hands areplotted in the first row, and the corresponding snapshots of the reproductionattempts are represented in the second row.

with the right arm. When both hands grasp the object, it is lifted and pulled

back on its base, with the robot’s head turned toward the object (Figure 5.29).

The user wearing the motion sensors performs the first demonstration of the

complete task, which is rescaled to T = 100 datapoints. This allows him/her to

demonstrate the full gesture by controlling simultaneously 16 joint angles. These

joint angles are then projected onto a subspace of lower dimensionality using

PCA. After observation, the robot reproduces a first generalized version of the

motion. This motion is then refined by kinesthetically helping the robot perform

the gesture, that is, by physically moving its limbs during the reproduction

attempt. The gesture is refined partially by guiding the desired DOFs while the

robot controls the remaining DOFs. By using this method, the teacher can only

move a limited subset of DOFs by using his or her two arms. It means that the

user can move the two shoulders and the two elbows of the robot simultaneously,

but the remaining DOFs (head, wrists and hands) are controlled by the robot.

The user first selects the DOFs to be control by moving the corresponding joints.

The robot detects the motion and sets the corresponding DOFs to passive mode.

Then, the robot reproduces the movement while recording the movement of the

limbs controlled by the user.

Results of the experiment are presented in Figures 5.30 and 5.31. We see

that the resulting paths of the hands are similar to the ones demonstrated

by the user (see Figure 5.29), even if training is performed in a subspace of

motion where joint angle trajectories have been projected. The system finds five

principal components and five Gaussian components to efficiently represent the

148

Figure 5.31: Snapshots of the sixth reproduction attempt.

Figure 5.32: Reproduction of actions on objects. The purpose of the task isto grasp the red cylinder and to bring it on top of the yellow cube. The firstsnapshot shows the 3D Cartesian frame of reference used in the experiment.

trajectories in this latent space. After the first demonstration of the movement

while wearing motion sensors, the robot can only reproduce a smoothed version

of the joint angles produced by the user. Because the user’s and robot’s bodies

differ (the robot is smaller than the user, but the size of the die does not change),

the robot approaches the die with its hands too close to grasp it. When trying

to reproduce the skill, the robot hits the die by moving its left hand first,

making the die fall before moving its right hand. Observing this, the teacher

progressively refines the model by providing appropriate scaffolds, that is, by

controlling the shoulders and the elbows of the robot while reproducing the

skill so that it may grasp the die correctly. In the third reproduction attempt,

the robot lifts the die awkwardly. In the sixth attempt, the robot skillfully

reproduces the task by itself (Figure 5.31). Therefore, the user decides to stop

the teaching process.

5.4.3 Experiment 2: learning to stack objects

This experiment shows how the system can learn incrementally manipulation

skills in 3D Cartesian space. Through the teacher’s support, the robot extracts

149

and combines constraints related to different objects in the environment. The

aim of the task is to grasp a red cylinder with the right hand and place it on a yel-

low cube (Figure 5.32). This experiment also shows that teaching manipulation

skills to the robot can be achieved through a scaffolding process, where the user

gradually highlights the affordances of different objects in its environment and

the effectivities of the body required to perform grasping and moving actions on

these objects, drawing insights from experiments in developmental psychology

where a caregiver learns manipulation skills to a child (Zukow-Goldring, 2004).

In the proposed setup, the robot learns simultaneously the affordances of the

two objects (i.e., the red cylinder has the properties that it can be stacked on

the yellow cube) and the associated effectivities (i.e., how the robot should use

its body to grasp and bring the cylinder on top of the cube without hitting any

object).

The specific constraints related to the two objects are extracted by vary-

ing the demonstrations provided to the robot, that is, by starting with differ-

ent initial positions of the objects. Thus, the efficiency of the reproduction

mainly relies on the ability of the teacher to provide appropriate scaffolds when

demonstrating the task by using sufficient and appropriate variability across the

demonstrations. By doing so, the system is able to generalize and reproduce

the skill in new situations that have not been used to demonstrate the task.

After each demonstration, the robot tries to reproduce the skill in two different

situations. This helps the teacher evaluate how well the robot can generalize in

new situations.

The system simultaneously learns the 3D Cartesian trajectories relative to

the different objects in the environment and the actions used to move these ob-

jects, as suggested in Section 2.6.1. The robot’s arm is kinematically redundant,

as discussed throughout the experiment in Section 5.3. We use in this experi-

ment a much simpler strategy to deal with the redundancies in the joint angle

configuration and hand path, with a strategy sharing similarities with the one

described in Shimizu, Yoon, and Kitagaki (2007). We define a posture by the

position of the right hand in a 3D Cartesian space and by an additional angular

parameter α (Figure 5.34) defining the elevation of the elbow (angle formed by

the elbow and a vertical plane). For each demonstration, the hand path and

a trajectory of angle α fully describe the gesture. A generalized version of the

hand paths and of the α trajectories are then used to compute (by geometrical

inverse kinematics) the joint angle trajectories required to control the robot.

The advantage of this method over other inverse kinematics algorithm is that a

constraint based on the observation of the user performing the skill is used as

an optimization factor, which produces movements that look natural.

There are n = 6 demonstrations of the task where each trajectory is rescaled

to T = 100 time steps (after six demonstrations, the user estimated that the

robot understood the task correctly). The total number of observations is thus

given by N = nT . For each demonstration i ∈ 1, . . . , n, the collected data

150

consists of the initial positions of the M objects o(h)i Mh=1 and of the set of

variables xi,j , αi,jTj=1 corresponding to the absolute right hand paths and

to the evolution of the parametric angle α. After each demonstration, the

trajectories of the right hand relative to the M different objects are computed

x(h)i,j = xi,j − o

(h)i

∀h ∈ 1, . . . ,M

∀i ∈ 1, . . . , n

∀j ∈ 1, . . . , T.

The constraints of the task are extracted from x(1), . . . , x(M), α to which

encoding and generalization are applied separately. By using two objects, a

generalized version of the trajectories is thus given by x(1), x(2), α and the

associated covariance matrices Σ(1), Σ(2), Σα.

Results for encoding, generalization and extraction of constraints are pre-

sented in Figure 5.33 (see also Figure 5.32 for the x1, x2, x3 directions). For the

first object, we see that the trajectory x(1) is highly constrained when grasp-

ing the object between time steps 20 and 40 (a tight envelope representing the

constraints around the generalized trajectory). The constraints are tighter for

the first two variables x(1)1 and x

(1)2 , defining the movement with respect to the

surface of the table. This is consistent with the shape of the object to grasp (a

cylinder placed vertically on the table). Indeed, the form and orientation of the

object enable it to be grasped with more variability on the third axis x3, defining

the vertical movement. When observing the constraints associated with x(1)3 ,

we also see that between time steps 40 and 70, the generalized trajectories of

the hand holding the object follows a bell-shaped path, which is consistent with

the demonstrations provided. For the second object, we see that the trajectory

x(2) is highly constrained at the end of the task, when placing the cylinder on

top of the cube.

The reproduction of the task in two new situations are presented in Figure

5.36. After the first demonstration, the system is not able to generalize the

skill (a similar motion is used for the two reproduction conditions). After the

third demonstration, the constraints are already roughly extracted. We see that

the robot correctly grasps the cylinder, but still has some difficulty in placing

it on the cube. By watching the robot reproduce the skill in the two different

situations, the teacher detects at the third reproduction attempt that the second

part of the task is not fully understood. The teacher then provides appropriate

scaffolds by kinesthetically demonstrating the task with an increased variability

in the initial positions of the cube. Thus, the constraint of placing the object

on top of the cube becomes more salient, and at the sixth attempt, the different

constraints are finally fulfilled for the different reproduction conditions; namely,

grabbing the cylinder, moving it with an appropriate bell-shaped movement,

and placing it on top of the cube.

151

Trajectories relative to the cylinder

20 40 60 80 100

−100

−50

0

50

100

t

x(1

)1

Data

20 40 60 80 100

−50

0

50

100

t

x(1

)2

20 40 60 80 100

0

50

100

t

x(1

)3

20 40 60 80 100

−100

−50

0

50

100

t

x(1

)1

GMM

20 40 60 80 100

−50

0

50

100

tx

(1)

2

20 40 60 80 100

0

50

100

t

x(1

)3

20 40 60 80 100

−100

−50

0

50

100

t

x(1

)1

GMR

20 40 60 80 100

−50

0

50

100

t

x(1

)2

20 40 60 80 100

0

50

100

t

x(1

)3

Trajectories relative to the cube

20 40 60 80 100

−100

0

100

200

t

x(2

)1

Data

20 40 60 80 100

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100

−100

0

100

200

t

x(2

)1

GMM

20 40 60 80 100

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100

−100

0

100

200

t

x(2

)1

GMR

20 40 60 80 100

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

Figure 5.33: Generalization and extraction of the constraints for the Cartesiantrajectories relative to the two objects, after six demonstrations of the skill.For each object, we have: left: the six demonstrations observed consecutively;middle: the Gaussian Mixture Model (GMM) of four components used to incre-mentally build the model; right: a representation of the generalized trajectoriesand associated constraints extracted by Gaussian Mixture Regression (GMR).

152

20 40 60 80 100−0.8

−0.6

−0.4

−0.2

t

α

Figure 5.34: Left and middle: Angle parameter α used to define the gesturewhen using a geometrical inverse kinematics algorithm. Right: Extraction of thegeneralized trajectory (thick line) and associated constraints (shaded envelope)from six observations (thin line) of angle parameter α.

3D view Top view

Situat

ion

1Situat

ion

2

Figure 5.35: Reproduction of the task for 2 different situations and after 6demonstrations of the skill. Even if the initial positions of the two objects havenot been used in the demonstrations, the robot is able to generalize and fulfillthe task requirements, namely grasping the cylinder and bringing it on top ofthe cube.

153


Situat

ion

150 100 150 200 250

50

100

150

200

250

x1x

2

50 100 150 200 250

50

100

150

200

250

x1

x2

50 100 150 200 250

50

100

150

200

250

x1

x2

Situat

ion

2

50 100 150 200 250

50

100

150

200

250

x1

x2

50 100 150 200 250

50

100

150

200

250

x1

x2

50 100 150 200 250

50

100

150

200

250

x1

x2

Figure 5.36: Reproduction of the task after the first, third, and sixth demon-strations for two new initial positions of the objects. A solid line represents thehand path with a point marking the starting position. A cross and a plus signrepresent the cylinder and the cube, respectively.

Figure 5.37: Experimental setup. Left: kinesthetic demonstration. Right: re-production and frame of reference.

5.4.4 Experiment 3: learning to move chess pieces

This experiment uses a chess game to show how a robot can learn manipulation

skills in 3D Cartesian space. The chess paradigm was also used by Alissandrakis

et al. (2002) to explore the correspondence problem in imitation learning, that

is, to know how to reproduce a particular motion on the chessboard by consid-

ering different chess pieces (different embodiments).

Through the teacher’s support, the robot extracts and combines constraints

related to different objects in the environment. The setup of the experiment is

presented in Figure 5.37. Six consecutive kinesthetic demonstrations are pro-

vided to show how to grasp a Rook, a Bishop or a Knight from a chessboard

and how to bring the grasped chess piece to the King of the adversary, see

Figures 5.38, 5.39 and 5.40. Each demonstration is rescaled to T = 100 dat-

154

Demo 1 Demo 2 Demo 3

Rook


Rook

Figure 5.38: Trajectories for the 6 consecutive kinesthetic demonstrations forthe Rook (only x1 and x2 are represented, corresponding to the plane of thechessboard). The cross and the plus sign represent respectively the Rook andthe King.


bis

hop


Bis

hop

Figure 5.39: Trajectories for the 6 consecutive kinesthetic demonstrations forthe Bishop (only x1 and x2 are represented, corresponding to the plane of thechessboard). The cross and the plus sign represent respectively the bishop andthe King.

155


Knig

ht


Knig

ht

Figure 5.40: Trajectories for the 6 consecutive kinesthetic demonstrations forthe Knight (only x1 and x2 are represented, corresponding to the plane of thechessboard). The cross and the plus sign represent respectively the Knight andthe King.

apoints. Three different models are created for the Rook, the Bishop and the

Knight. The setup is simplified by using only two chess pieces at the same time

(attentional scaffolding can be used to let the robot recognize only these two

chess pieces), and we do not take into account the generalization across different

frames of reference. It means that in our setup, the Rook is moved only in a

forward linear direction, the Bishop is moved only in a forward-left diagonal

direction, and the Knight is moved only two squares forward followed by one

square to the left, forming an inverse ”L” shape.7

Similarly to the previous experiment, the robot learns simultaneously the

affordances of the different objects (i.e., the Rook/Bishop/Knight are moved

differently to be brought to the King) and the associated effectivities (i.e., how

the robot should use its body to grasp and bring the Rook/Bishop/Knight to

the King without hitting any object). After each demonstration, the robot tries

to reproduce the skill with the chess pieces placed in a new situation that has

not been observed during the demonstrations. Thus, the user can test the ability

of the robot to generalize the skill over different situations.

The constraints extracted after the sixth demonstration are presented in

Figures 5.41, 5.42 and 5.43, showing the displacement constraints with respect

to the first object (respectively the Rook, the Bishop and the Knight) and the

second object (the King). For the Rook, we see in Figure 5.41 that the trajectory

is highly constrained for x(1)1 from time step 30 while the hand grasps the Rook,

7Note that the basic rules remain the same and that the directions only depend on theframe of reference considered.

156

Constraints for the Rook/King

Rel

ativ

eto

obje

ct1

(the

Rook

)

20 40 60 80 100

−20

0

20

40

60

tx

(1)

1

Data

20 40 60 80 100

−50

0

50

100

t

x(1

)2

20 40 60 80 100−50

0

50

100

t

x(1

)3

20 40 60 80 100

−20

0

20

40

60

t

x(1

)1

GMM

20 40 60 80 100

−50

0

50

100

tx

(1)

2

20 40 60 80 100−50

0

50

100

t

x(1

)3

20 40 60 80 100

−20

0

20

40

60

t

x(1

)1

GMR

20 40 60 80 100

−50

0

50

100

t

x(1

)2

20 40 60 80 100−50

0

50

100

t

x(1

)3

Rel

ativ

eto

obje

ct2

(the

Kin

g)

20 40 60 80 100−40−20

020406080

t

x(2

)1

Data

20 40 60 80 100

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100−40−20

020406080

t

x(2

)1

GMM

20 40 60 80 100

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100−40−20

020406080

t

x(2

)1

GMR

20 40 60 80 100

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

Figure 5.41: Extraction of the constraints relative to the Rook (1st object) andto the King (2nd object). In each cell, the first column represents the right handpaths collected in a 3D Cartesian space, the second column represents the GMMencoding of the data and the last column represents the GMR regression used toextract a generalized trajectory and associated constraints. The parts that showa thin envelope around the generalized trajectory are locally highly constrained,while the parts presenting large envelope allows a loose reproduction of thetrajectory.

157

Constraints for the Bishop/King

Rel

ativ

eto

obje

ct1

(the

Bis

hop

)

20 40 60 80 100

−100

−50

0

50

tx

(1)

1

Data

20 40 60 80 100

−50

0

50

100

t

x(1

)2

20 40 60 80 100−50

0

50

100

t

x(1

)3

20 40 60 80 100

−100

−50

0

50

t

x(1

)1

GMM

20 40 60 80 100

−50

0

50

100

tx

(1)

2

20 40 60 80 100−50

0

50

100

t

x(1

)3

20 40 60 80 100

−100

−50

0

50

t

x(1

)1

GMR

20 40 60 80 100

−50

0

50

100

t

x(1

)2

20 40 60 80 100−50

0

50

100

t

x(1

)3

Rel

ativ

eto

obje

ct2

(the

Kin

g)

20 40 60 80 100

0

50

100

150

t

x(2

)1

Data

20 40 60 80 100−150

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100

0

50

100

150

t

x(2

)1

GMM

20 40 60 80 100−150

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100

0

50

100

150

t

x(2

)1

GMR

20 40 60 80 100−150

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

Figure 5.42: Extraction of the constraints relative to the Bishop (1st object) andto the King (2nd object). In each cell, the first column represents the right handpaths collected in a 3D Cartesian space, the second column represents the GMMencoding of the data and the last column represents the GMR regression used toextract a generalized trajectory and associated constraints. The parts that showa thin envelope around the generalized trajectory are locally highly constrained,while the parts presenting large envelope allows a loose reproduction of thetrajectory.

158

Constraints for the Knight/King

Rel

ativ

eto

obje

ct1

(the

Knig

ht)

20 40 60 80 100

−50

0

50

100

tx

(1)

1

Data

20 40 60 80 100

−50

0

50

t

x(1

)2

20 40 60 80 100−50

0

50

100

t

x(1

)3

20 40 60 80 100

−50

0

50

100

t

x(1

)1

GMM

20 40 60 80 100

−50

0

50

tx

(1)

2

20 40 60 80 100−50

0

50

100

t

x(1

)3

20 40 60 80 100

−50

0

50

100

t

x(1

)1

GMR

20 40 60 80 100

−50

0

50

t

x(1

)2

20 40 60 80 100−50

0

50

100

t

x(1

)3

Rel

ativ

eto

obje

ct2

(the

Kin

g)

20 40 60 80 100

0

50

100

t

x(2

)1

Data

20 40 60 80 100

−150

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100

0

50

100

t

x(2

)1

GMM

20 40 60 80 100

−150

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

20 40 60 80 100

0

50

100

t

x(2

)1

GMR

20 40 60 80 100

−150

−100

−50

0

t

x(2

)2

20 40 60 80 100

0

50

100

t

x(2

)3

Figure 5.43: Extraction of the constraints relative to the Knight (1st object) andto the King (2nd object). In each cell, the first column represents the right handpaths collected in a 3D Cartesian space, the second column represents the GMMencoding of the data and the last column represents the GMR regression used toextract a generalized trajectory and associated constraints. The parts that showa thin envelope around the generalized trajectory are locally highly constrained,while the parts presenting large envelope allows a loose reproduction of thetrajectory.

159

Rook Bishop Knight

20 40 60 80 100−1

−0.8

−0.6

−0.4

−0.2

t

α

20 40 60 80 100

−0.8

−0.6

−0.4

−0.2

t

α

20 40 60 80 100

−0.8

−0.6

−0.4

−0.2

t

α

Figure 5.44: Extraction of the constraints on the gesture (modeled by angleparameter α) used to move the 3 different chess pieces.


Rook

Bis

hop

Knig

ht

Figure 5.45: Reproduction of the task for the Rook, the Bishop and the Knightafter observation of the first, third and sixth demonstration. The cross and theplus sign represent respectively the chess piece to grasp and where the graspedpiece must be brought.

160

i.e., the hand-object relationship allows only low variability during this part

of the skill. The Rook is then moved in a straight line, i.e., the direction in

x1 remains constant after grasping of the Rook. However, its final position

can change in amplitude, i.e., the Rook is moved along a straight line but its

final position on this straight line can vary, which is reflected by the constraints

extracted for x(1)2 (larger envelope after time step 70). For the Bishop, we see

in Figure 5.42 that the generalized trajectories (relative to Object 1) follows

a diagonal. The direction is highly constrained but the final position is not.

For the Knight, we see in Figure 5.43 that the generalized trajectories (relative

to Object 1) are more constrained. Indeed, for a given initial position of the

Knight, only one final position is allowed in the proposed setup. This is reflected

by the constraints for x(1)1 and x

(1)2 (and complementary for x

(2)1 and x

(2)2 ), where

the path with respect to the initial position of the Knight is highly constrained

(the followed path is quasi invariant across all demonstrations). For the three

chess pieces, the constraints for the vertical axis x(1)3 share similarities, showing

that the user grasps the chess piece from above, and displaces the chess piece

following a bell-shaped trajectory in a vertical plane. We observe that for each

chess piece the constraints for the first object are correlated with the constraints

for the second object. Indeed, the positions of the two objects have important

dependencies for the skill (it is important to reach the King of the opponent

with the chess piece).

By considering the constraints relative to the two objects, the reproduction

of the hand path is then computed for new initial positions of the objects. To do

so, the absolute constraint for each object is computed by adding the new initial

position of each object to the corresponding relative constraint (represented as

a varying mean and associated covariance matrix along the motion). The hand

path used for reproduction is then computed by multiplying at each time step

the two Gaussian distributions characterizing the absolute constraints for the

two objects. This hand path is converted to joint angles by using a geomet-

rical inverse kinematics algorithm, parameterized by the generalized version of

the angle α between the elbow and a vertical plane, which allows to reproduce

naturally-looking gestures by providing an additional constraint to the redun-

dant inverse kinematics problem. Figure 5.44 shows the generalized trajectories

and associated constraints on the angle parameter α (see also Figure 5.34). For

the three chess pieces, we see that the gesture used to reach for the chess piece

share similarities. Indeed, the α trajectories start with a negative value and

progressively converge to zero. It means that the elbow is first elevated outward

the body and is progressively lowered, approaching the body until the arm and

forearm are almost in a vertical plane. This allows the user (and the robot) to

approach the chess piece carefully without hitting the other chess piece. Indeed,

when experiencing the skill together with the robot through a scaffolding pro-

cess, the user quickly notices that when the robot is close to the chess piece, its

elbow has to be lowered to grasp the chess piece correctly. This is mainly due to

161

the missing DOF at the level of the wrist, which constrains the grasping posture

of HOAP-3 to a value of angle α close to zero. When the chess piece is grasped,

we see that two different movement strategies are adopted depending on the

path to follow. As the Rook is moved forward, the angular configuration does

not need to be changed after grasping. As the Bishop is moved in a diagonal

(forward-left), the user helps the robot adopt a correct posture to avoid hitting

its own body when performing the move (i.e., learning effectivities). This is

reflected by the decreasing negative value of angle α. A similar strategy is em-

ployed for the Knight, but the amplitude of change in angle α is lower due to the

shorter path followed by the Knight. We also see that there is a slight tendency

to first keep the arm in a vertical plane (α close to zero), and to finally use a

posture with a slight elevation of the elbow (negative α value). This behaviour

is probably due to the inadvertent decomposition of the ”L” shape by the user

when helping kinesthetically the robot displace the Knight.

Finally, Fig. 5.45 shows different reproduction attempts for initial configu-

rations that have not been observed by the robot during the teaching process.

We see that the ability of the robot increases, i.e., each demonstration helps

the robot refine its model of the skill and ability to generalize across different

situations.

Through these experiments, we have shown the importance of designing

a RbD framework not only by considering the learning system itself but the

complete interaction scenario. We have used an incremental and interactive

RbD framework in three experiments to teach manipulation skills to a humanoid

robot, and have emphasized the active role of the user in the teaching process

by putting him/her in the loop of the robot’s learning.

162

6 Discussion and further work


We present in this chapter various discussions about the proposed RbD frame-

work, as well as collaborative work and further work. The chapter is organized

as follows:

• Section 6.2 discusses the advantages of considering a teaching scenario

that puts the user in the loop of the robot’s learning. Several insights

concerning social learning from different fields of research are presented

in this section, used as hints to design the teaching scenario presented in

Section 5.4.

• Section 6.3 discusses the use of motion sensors to represent human ges-

tures, where the advantages and drawbacks are presented in Section 6.3.1.

Section 6.3.2 discusses the different ways of decomposing the rotation ma-

trices provided by the motion sensors into joint angle trajectories that can

be used by the robot.

• Section 6.4 presents the advantages of the HMM representation (previously

used in our RbD system) for imitation learning, and the similarities shared

by this probabilistic representation with other frameworks of imitation.

• Section 6.5 highlights the similarities that the GMR approach shares with

other regression techniques.

• Section 6.6 discusses the limitations and failures of the current system.

Section 6.6.1 shows how PCA can lose important information by reduc-

ing the dimensionality inadequately for some specific tasks. Section 6.6.2

shows how the incremental learning process can fail at updating correctly

the GMM parameters when the trajectories are too much inconsistent with

the model. Section 6.6.3 shows the problems of generalizing a skill to a

situation that is too much different to the situations encountered during

the demonstrations.

• Section 6.7 presents the work in collaboration with other researchers em-

ploying the RbD framework developed in this thesis.

• Section 6.8 finally suggests directions of research for the further work

following this thesis.

163

6.2 Roles of an active teaching scenario

In this section, we discuss how the active and incremental teaching methods

used in the scenario presented in Section 5.4 have been designed, and how

RbD benefits from this approach. We also discuss how the pedagogical skills

of the user can contribute to an efficient transfer of the skill. In RbD, even if

the efficiency of the learning system is highly relevant, we highlight here the

importance of building a teaching scenario that helps as much as possible the

user provide useful data for the robot.

To design efficient teaching systems and appropriate benchmarks, we empha-

size the importance of taking psychological and social factors into consideration.

Even if the learning efficiency can be measured quantitatively as presented in

Section 2.7.1, that is, by measuring how well the system reproduces the skill and

generalizes it with respect to a specific dataset), the performance also directly

depends on the quality of the dataset, and inherently on the teaching abilities

of the user. It is thus important to consider the skill transfer process not only

at the algorithmic level but also at the user level, by designing an HRI scenario

that makes an efficient use of the teaching abilities of the user. To do so, we

present several insights that guided us to the current setup, from various fields of

research covering psychology, pedagogy, developmental sciences, sociology and

sports science.

6.2.1 Insights from psychology

Considering the robot as a peer learner and watching the evolution of its under-

standing of the skill are important psychological factors for a successful interac-

tion. By drawing parallels with a caregiver-child interaction (and the associated

human social aptitude at transmitting culture), we see that by watching the evo-

lution and the outcomes of teaching, the teacher can feel psychologically more

implicated in the teaching teamwork. In psychology, benchmarks such as au-

tonomy, moral accountability and reciprocity have been proposed to evaluate

HRI setups (Kahn et al., 2006). For teaching applications, the robot’s capacity

to generalize over different contexts depends on the number of demonstrations

provided to the robot, but more importantly on the pedagogical quality of these

demonstrations (gradual variability of the situations and exaggerations of the

key features to reproduce). To succeed, it is therefore crucial that the teacher

feels implicated in the teamwork.

Indeed, an essential feature of human beings is our natural ability and desire

to transmit skills to the next generation. This transfer problem is complex and

involves a combination of social mechanisms such as speech, gestures, imitation,

observational learning, instructional scaffolding, as well as physical interaction

such as molding or kinesthetic teaching (Salter et al., 2006). The humanoid

shape of the robot and the humanlike properties of the teaching system aim at

helping the teacher consider the robot as a peer. As noted by MacDorman and

164

Cowley (2006), a humanoid robot is the best equipped platform for reciprocal

relationships with a human being. When considering a robot conforming to

humanlike appearance and learning behaviors, a psychological factor related to

altruism and reciprocity naturally appears. This may be particulary important

when considering teaching interactions. Creating a humanlike teaching scenario

could reinforce the user’s feeling that his or her role is to pass on knowledge

to the robot, just as another human might help him or her out in a similar

situation. In other words, the user would act as he or she would like to be

treated.

The active teaching process presented in Section 5.4 probably evokes some

of the feelings normally attributed to a caregiver-child dyad, in which the robot

would elicit a certain form of attention. Indeed, watching the increased under-

standing of the skill by progressively lowering the scaffolds may be an important

psychological factor for the teacher, where the user might feel self-esteem when

teaching the less-knowledgeable robot.

As discussed in the introduction, B. Lee et al. (2006) showed that behav-

ioral synchronization is a powerful communicative experience for HRI. Through

their experiments, the authors show that in human dyadic interactions, close

relationship partners are more engaged in joint attention to objects in the en-

vironments and to their respective body parts than strangers. Moreover, they

are more engaged in discovering and showing activities, sharing experience and

exploring the environment together. Indeed, by contrasting a humanoid robot

learning system with a computer language interpreting commands executed by

a programmer, we see that both processes involve a transfer of information to

the machine by incrementally watching the outcome of the transfer to help the

user refine his or her teaching strategy. However, in the computer language sit-

uation, the machine considered is more of a tool to process the user’s strategy

rather than a peer learner what the user might wish to instruct. In robotics,

Thomaz et al. (2005) presented HRI experiments in which the user can guide

the robot’s understanding of the objects in its environment. The interaction

creates a form of closeness where the user explores together with the robot its

environment. In our system, using a kinesthetic teaching approach also provides

such a close relationship with the robot by physically guiding the robot’s arms

to let it collect kinesthetic experiences.

6.2.2 Insights from pedagogy

Gergely and Csibra (2005) contrasted observational learning to pedagogy. The

authors noted that although animals such as chimpanzees can use tools to

achieve a goal, they tend to discard the tools after using them. Humans however

have developed the ability to recursively use tools (use of a tool to create an-

other tool), attaching functions to objects that do not directly involve outcomes.

This could have led to the development of pedagogical skills as powerful social

165

learning mechanisms to enable the transmission of not just observable behaviors

but also unobservable knowledge. Unlike observational learning, pedagogy re-

quires an active participation of the teacher which is achieved by a special type

of communication that aims at manifesting the relevant knowledge of a skill.

The teacher does not simply use his or her knowledge but engages in an activity

that benefits the learner. To highlight which features of a skill are relevant,

the teacher must first recognize the features by analyzing his or her knowledge

relative to the knowledge of the learner. Indeed, one does not need to be aware

of knowledge content to generate an appropriate behavior (e.g., riding a bike

may appear easy for a person, even when describing and teaching the skill to

another person is not obvious). These natural pedagogical skills could thus be

called upon contributions in the design of teaching interaction scenario such as

the ones presented in Section 5.4.

6.2.3 Insights from developmental sciences

Zukow-Goldring (2004) presented experiments in developmental psychology to

understand the methods used by caregivers to assist infants as they gradually

learn new skills by engaging in new activities. By analyzing how a child learns

manipulation and assembling skills (using pop beads), the authors show that

observation alone is not sufficient for the child to learn a new activity. The

caregiver first focuses the attention of the child to the affordances of the objects

(the possibility to assemble the two objects). Because of the lack of information

concerning the paths to follow to correctly assemble the objects, the child first

fails at reproducing the actions. The caregiver then shows what the child’s body

should do to assemble the objects with the required orientations and paths to

follow, demonstrating the effectivities required to assemble the objects. The

caregiver assists the child by partially demonstrating the action and lets the child

resolve progressively the skill by herself. Thus, the caregiver’s gestures gradually

provide perceptual information guiding the infant to perform the skill. The task

is first simplified and the child is then progressively put in various situations to

experience and generalize the skill. Embodying and putting the child through

the motions draws her attention to the coordination of affordances of the two

objects and effectivities of the body required to connect the two objects. Thus,

teaching is structured to let the child gather information on the characteristics

of the objects and actions specific to each of them.

The experiments presented in Section 5.4 are inspired from this teaching

process, and the different probabilistic components of the model naturally fit

with the representation of affordances and effectivities, which are learned si-

multaneously. In Zukow-Goldring (2004), by providing bodily experience, the

caregiver provides the infant the opportunity to see and feel the solutions to

the correspondence problem (i.e., detecting the match between self and other).

Similarly to the teaching process presented in these developmental psychology

166

experiments, our proposed human-robot teaching scenario starts with the robot

first observing the task performed by the user (through motion sensors). The

robot learner then begins to reproduce the task through the teacher’s assistance,

gradually performing parts of the task independently (by selecting the motors

to control manually, embodying the robot and putting it through the motion).

In our experiment, the affordances and effectivities respectively refer to the cor-

rect way of assembling the objects and the correct movements that the robot’s

body must adopt to manipulate the objects appropriately.

Rohlfing et al. (2006) also highlighted the importance of having multimodal

cues to reduce the complexity of human-robot skill transfer. In their work, they

consider multimodal information as an essential element to structure the demon-

strated tasks. Through experiments, the authors show that humans transfer

their knowledge in a social interaction by recognizing what current knowledge

the learner lacks. They are thus sensitive to the cognitive abilities of their in-

teraction partner. The authors then suggest taking insights from these studies

to reduce the learning complexity of current RbD frameworks; thus, sharing

human adaptability with the less knowledgeable becomes a central issue when

designing social robots. Therefore, they hypothesize that a human teacher can

also adapt naturally to a robot equipped with specific abilities. We adopted a

similar strategy in our learning framework and designed a skill transfer process

accordingly, i.e., that can benefit from the user’s capacity to adapt his or her

teaching strategies to the particular context.

6.2.4 Insights from sociology

Vygotsky (1978) introduced the zone of proximal development (ZPD) as a gen-

eral term to define the gap between what the learner already knows, namely, the

learner’s zone of current development (ZCD), and what the learner can acquire

through the teacher’s assistance (Wood & Wood, 1996). An efficient teaching

strategy consists of exploring and being familiar with the ZPD of the learner

to evaluate what the learner is able to acquire given his or her current ability.

This general paradigm can also be applied to human-robot interaction, where

the role of the teacher is to ascertain the current ZCD of the robot learner (by

testing the robot’s current understanding of the skill), and where its ZPD lies

(i.e., to ascertain what can be achieved by the robot with assistance). Search-

ing for what the robot already knows can appear time-consuming, but it may

also an important psychological factor for the teacher, by helping him/her feel

involved in collaborative human-robot teamwork.

When designing a human-robot teaching scenario, it is thus important to

allow the teacher to acquire the robot’s ZPD through interaction to provide

individualized support to the robot. This is achieved by anticipating the prob-

lems that the robot might encounter, providing the appropriate scaffolds and

gradually dismantling these scaffolds as the learner progresses (and eventually

167

constructing further scaffolds for the next stage of learning). Similar to mould-

ing behaviors, moving the robot kinesthetically in its own environment, as pre-

sented in Section 3.2, provides a social way of feeling the robot’s capacities and

limitations when interacting with the environment.

6.2.5 Insights from sports science

To transfer a skill between two human partners, different ways of performing

demonstrations can be used depending on the motor skill that must be trans-

ferred. Several methodologies have been investigated for skill acquisition in

sports with the aim of providing advices to sport coaches on how to transfer a

motor skill efficiently and how to measure success depending on the capacities

of individual athletes (Horn & Williams, 2004).

Coaching is the learning support aimed at improving the performance of the

learner to carry out the task by providing directions and feedback; it is highly

interactive and requires that one continuously analyzes the learner performance.

To do so, the coach needs to be receptive to the learner’s current level of perfor-

mance. Different modalities are traditionally used by sport coaches to help the

athletes acquire the skill (Horn & Williams, 2004), where the visual observation

of a skilled model completing the entire task provides a good basis for move-

ment production. This also applies to other types of skills, see, for example,

training by expert surgeons (Custers, Regehr, McCulloch, Peniston, & Reznick,

1999). Similarly, in out setup, the principal aim of observing the performance

of an expert through motion sensors (Section 3.4) is to provide a complete and

temporally continuous demonstration of the skill. By using motion sensors, the

human expert can freely perform the task while the robot observes the full-body

motion. However, as noted by Hodges and Franks (2002), optimal movement

templates do not always generalize well across individuals. Thus, individually-

based templates may be more appropriate in refining and achieving consistency

in a skill. This supports the further use of kinesthetic learning to guide the

robot’s arms physically to let the robot experience the skill (Section 3.2).

This body of work also states that multiple exposures provide the oppor-

tunity to discern the structure of the modeled task. It allows the teacher to

organize and verify what the learner knows and to focus attention on problem-

atic aspects in subsequent exposures. This is achieved until a saturation point

is reached, determined by the coach, at which additional demonstrations of the

task do not yield learning benefits. Similarly, after each reproduction by the

robot, the user decides whether the robot has correctly acquired the skill or

whether further refinement is required. Several factors affect the amount of

additional benefit that can be derived from multiple observations of a model.

Among them is the isomorphism of the various demonstrations. Indeed, repe-

tition of identical demonstrations may be of limited utility, whereas extremely

diverse demonstrations may generate conflicts or confusion. To encourage flex-

168

ibility and adaptability, the coaches often manipulate task properties (e.g., by

using different situations) to provide appropriate variability depending on the

learner’s capacities. Similarly, in our teaching scenario, the user is instructed to

displace progressively the objects after each demonstration to provide variability

in the exposures of the skill.

The different insights presented throughout this section come from various

fields of research but highlight the importance of having a multimodal and

incremental learning system to help the robot experience various situations and

refine its skill gradually. This is exploited in our RbD framework by using

kinesthetic teaching, vision and motion sensors (Chapter 3) to demonstrate a

skill, and by using incremental learning algorithms to update a probabilistic

representation of the skill (Section 2.8.3).

6.3 Observational learning using motion

sensors and decomposition of the gesture

into joint angles

Modelling and generating a movement that looks similar to a movement pro-

duced by a human is not an easy task. People can very easily distinguish a

natural motion from an artificially generated motion, but it is difficult to ex-

press that in a mathematical form and find criteria to determine which motion

can be considered as human-like. An approach proposed by Gams and Lenarcic

(2006) is to consider minimum torque in joints as a constraint, which implies to

have a dynamical model of the arm. Another approach is to apply statistical

comparison with a large database of human poses. However, as the range of

tasks that can be achieved by using two arms is huge, and as humans move in

different manners depending on the task and environment, such an approach

usually requires task-specific criteria. We proposed in this thesis to follow a dif-

ferent strategy, by using a reduced number of gestures captured by a human user

to extract the key elements of a skill and by adapting the skill to new situations.

Thus, the system leverages on the human expertise to learn a task. As the user

has optimized his/her motion skill throughout life, we take the perspective that

he/she can provides essential hints to a kinematically redundant robot for the

reproduction of the skill, even if the reproduced movement can not be exactly

the same as the ones demonstrated, due to dissimilar embodiment.

6.3.1 Advantages and drawbacks of motion sensors to

track gestures

Observational learning in RbD is usually performed using vision systems because

cameras are the interfaces that are technically the most similar to human-like

sensory system. However, vision is sometimes considered incorrectly as being the

169

unique choice of sensory system that can be used for observation. In this thesis,

we suggest to use motion sensors as an alternative to vision to observe the user’s

gestures. The main advantage of motion sensors is that they usually provide

a more robust measure of the joint angles characterizing a body posture. The

main advantage of vision compared to motion sensors is that there is no need to

wear special devices (Shon et al., 2005).1 Vision has the advantage to provide a

low-cost solution to the tracking issue, which is the case for our setup where two

webcams are used, but a precise and complete visual motion capture system can

also cost as much or more than a set of motion sensors. The drawback of vision

is that occlusion and lighting issues must be taken into account by the tracking

algorithms, which is usually not the case for motion sensors. Indeed, motion

sensors provide a continuous measure of the body gesture without interruption.

Bluetooth communication also allows the user to move freely when performing

the demonstration. The problem is more complicated for vision system because

the field of view (or the actuation of the head) must also be taken into account

by the tracking system. By using motion sensors, the measure of the joint angles

between two segments can be easily extended to different body parts, which is

often not the case for vision.2 Note however that the use of motion sensors is not

restrictive in our framework, as the algorithms can treat any kind of continuous

trajectories including vision data.

In Calinon and Billard (2006), we explored the use of motion sensors attached

to the body of the demonstrator to convey information about human body

gesture, and demonstrated that this method can present an alternative to vision

system to record human motion data, which allowed the recording of natural

movements with smooth velocity profiles characterizing human motion.

Another way of demonstrating a gesture is to use the kinesthetic teaching

process presented in Section 3.2. Compared to the use of motion sensors at-

tached to the body of the demonstrator, the joint angle trajectories recorded by

kinesthetic teaching usually present sharper turns and an unnatural decorrela-

tion of the different DOFs. Indeed, when grasping physically the end-effectors

of the robot and displacing the different joints in such a way, the motors of the

arms tend to move sequentially in a decoupled way. We showed in the exper-

iments in Section 5.4 that by combining information from the motion sensors

and kinesthetic teaching, it was possible to generate naturally looking trajecto-

ries and tackle at the same time the correspondence problem. Indeed, due to

the different embodiment between the user and the robot, it is not possible to

directly copy the joint angle trajectories demonstrated through motion sensors.

Transferring efficiently such gesture often requires refinement of the trajectories

with respect to the robot capabilities and with respect to its environment. On

the other hand, demonstrating a gesture only by kinesthetic teaching is limited

1Note that a large set of vision systems also consist of wearing special patches to be trackedmore easily by vision processing algorithms (Ude, Atkeson, & Riley, 2004; Shon et al., 2005).

2Vision usually requires dedicated tracking algorithms depending on the body parts, e.g.,legs and arms are usually processed by different tracking algorithms (or tuned differently).

170

by the naturalness of the motion and by the number of limbs that the user can

control simultaneously. Combining both methods provide a social approach of

teaching the humanoid robot (Section 6.2). Thus, the use of motion sensors pro-

vides a model of the entire movement for the robot, while kinesthetic teaching

offers a way of refining the demonstrated movement. It adds a social compo-

nent to the interaction as the user helps the robot acquire the skill by physically

manipulating its arms. By doing so, he/she implicitly feels the characteristics

and limitations of the robot in its own environment.

6.3.2 Extraction of joint angle information from the

motion sensors

We now discuss the issues concerning the decomposition of the matrices of

rotation provided by the motion sensors into joint angles, as presented in Section

3.4. Joints in a human arm are not ideal joints and do not hold a reference

location to the segments they connect. In most of robotics systems, it is usually

assumed for simplification that centers of rotation exist despite the fact that

no joint in a human arm meets this criterion (Gams & Lenarcic, 2006). Our

model of the human arm is defined as a kinematic chain, where the glenohumeral

joint (or shoulder joint) connects the girdle and the upper arm (3 DOFs), the

elbow connects the upper arm and the forearm (1 DOF), the wrist connects the

forearm and the hand (1 DOF) and the fingers are connected to the hand (1

DOF).

Each X-Sens motion sensor provides an orientation matrix expressed in a

fixed frame of reference, which is a complete description of the orientation with

no gimbal locks.3 However, due to the redundancy of the representation, the

main disadvantages are: (1) the orientation is defined by a large number of

parameters (it requires 9 parameters); and (2) it is difficult to interpret the

data and to process them efficiently (there is no physical interpretation for

the parameters in the rotation matrix and it is not possible to handle velocity

directly). To analyze human motion data, the biomechanical approach consists

of defining a Joint Coordinate System (JCS) and an associated angle convention,

which permits to analyze joint movements in terms of anatomical rotations.

Joint angles are defined as the orientation of one segment coordinate system

relative to another segment coordinate system. Thus, by defining a hierarchy of

the different joints, the joint angles can be processed in a similar way for each

body part, as presented in Section 3.4.

The rotation matrix can be constructed by applying 3 consecutive rotations

around specified axis. By considering different sequences of rotations, different

triplet of angles are extracted. The selected sequence of successive rotations

mainly depends on the analysis to perform. There are two typical types of

successive rotations: Eulerian and Cardanian. The Eulerian type involves rep-

3Configuration defined by Euler angles where one loses a dimension of rotation.

171

etition of rotations about one particular axis: XYX, XZX, YXY, YZY, ZXZ,

ZYZ. The Cardanian type is characterized by the rotations about all three axes:

XYZ, XZY, YZX, YXZ, ZXY, ZYX. In the experiments reported in this thesis,

we have suggested to use a XYZ decomposition order, but this sequence of suc-

cessive rotations is only one of the 12 different combinations. Even though the

Cardanian type is different from the Eulerian type in terms of the combination

of rotations, they both use a very similar approach to compute the orientation

angles. We have seen in Section 3.4 that the Cardanic convention is defined by

3 consecutive rotations Rx(φ) → Ry(θ) → Rz(ψ) around axis X, Y ′ and Z ′′,

where the rotation angles φ, θ ψ define respectively the roll, pitch and yaw an-

gles. The Eulerian convention is in contrast defined by 3 consecutive rotations

Rz(φ) → Rx(θ) → Rz(ψ) around axis Z, X ′ and Z ′′, where the rotation angles

are defined respectively as ψ, θ φ, defining the precession, nutation and spin

angles. The three Cardanic/Eulerian joint angles can then be retrieved from a

rotation matrix, considering one of the 12 axis order convention.

Unfortunately, for joints such as the shoulder, there is no single definition of

the joint angles that is anatomically meaningful for the full range of motion of

this joint. There is no standard sequence of rotations for describing the shoul-

der motion, and every representations suffers from gimbal locks.4 The choice of

convention order is thus important to place the singularity out of the motion

ranges typically used during the recording. To do so, the International Shoul-

der Group (ISG) recently published recommendations aimed at standardizing

the reporting of kinematic data to allow easier comparison of results among

researchers (Wu et al., 2005). This group recommended that the shoulder an-

gle (upper arm relative to torso) should be described by an Euler sequence

instead of the Cardanic representation flexion-extension, abduction-adduction

and humeral (or axial) rotation used in our setup. Indeed, the XYZ Cardan

sequence representation is usually not recommended for defining reaching or

throwing gestures (when the arm is abducted from the body), which are ges-

tures sharing similarities with the ones considered in our experiments. For gait

data in which the arm swings predominantly in flexion-extension, the XYZ Car-

dan sequence is however preferred. As the range of motion of the shoulder is

considerable, ISG also noted that there is no sequence of rotations for defining

the shoulder angle which is anatomically meaningful throughout the workspace

of the shoulder, and thus recommended to select the rotation sequence for the

task being analyzed. Despite their recommendation on the Eulerian sequence

to use for reaching movements, we then selected the Cardanic order convention

due to the XYZ joint hierarchy of the robots considered in the experiments.5

4This can be seen by observing the components of the matrix in Equation (3.1), expressedwith cosine and sine, which makes expect to find non-unique solutions.

5Note that it would still be possible to decompose the rotation matrix in another jointangle sequence, perform the joint angle analysis using this decomposition order and convertback the computed joint angles in another sequence order (by computing the rotation matrix),so that it can be run on the robot.

172

Figure 6.1: C.M. Heyes and E.D. Ray’s Associative Sequence Learning (ASL)model. Schema inspired from Heyes (2001).

Figure 6.2: C.L. Nehaniv and K. Dautenhahn’s algebraic framework to mapthe states, effects and actions of the demonstrator and the imitator. Schemainspired from Nehaniv and Dautenhahn (2002).

6.4 Advantages of the HMM representation for

imitation learning

In Calinon and Billard (2007c), we presented an implementation of an HMM-

based system (that we used prior to the development of our current GMM/GMR

framework) to encode, generalize, recognize and reproduce gestures, with repre-

sentation of the data in visual and motor coordinates. In this work, we described

how different models of imitation learning conducted us to the use of an HMM

representation of the data, and how it fitted the models of imitation proposed

by other groups or in other fields of research. Indeed, some of the features

contained in the HMM representation bear similarities with those of theoretical

models of imitation. By using HMM to encode gestures, we showed in this book

chapter that it was possible to keep the theoretical aspects and architectures of

these models, offering at the same time a probabilistic description of the data

that is more suitable for a real-world robotic application. We briefly summarize

here the models of imitation considered.

C.M. Heyes and E.D. Ray’s Associative Sequence Learning (ASL) mecha-

nism suggests that imitation requires a vertical association between a model’s

action, as viewed from the imitator’s point of view, and the corresponding im-

itator’s action (Heyes & Ray, 2000; Heyes, 2001). The vertical links between

the sensory representation of the observed task and the motor representation

173

are part of a repertoire, where elements can be added or refined. ASL sug-

gests that the links are created essentially by experience, with a concurrent

activation of sensory and motor representations. In their model, the mapping

between the sensory and motor representation can be associated with a higher

level representation (boxes depicted in Figure 6.1). Indeed, the model assumes

that data are correctly discretized, segmented and classified, similarly to the

imitation model in ethology proposed by Byrne (1999). In the ASL model, the

horizontal links model the successive activation of sensory inputs to learn the

skill, activating simultaneously the corresponding motor representation to copy

the observed behavior. Repetitive activation of the same sequence strengthens

the links and helps motor learning. Depending on the complexity of task orga-

nization, numerous demonstrations may be needed to provide sufficient data to

extract the regularities. The more data are available, the more evident is the

underlying structure of the task, clarifying which elements are essential, which

are optional, and which are variations in response to changing circumstances.

Any action that the imitator is able to perform can also be recognized by ob-

serving a model demonstrating the task. This imitation model stresses the need

of a probabilistic framework in a robotic application that can extract invariants

across multiple demonstrations. We see that such a model is in agreement with

a HMM decomposition of the task, where the underlying structure is learned by

using multiple demonstrations. If a given pattern appears frequently, its corre-

sponding transition weights are strengthen automatically by the Baum-Welch

learning algorithm presented in Section 2.8.2.

Each hidden state in an HMM can output multimodal data. In Calinon

and Billard (2007c), we used this property to model multiple variables in dif-

ferent frames of reference. Thus, we encoded a skill by using motion sensors

to record joint angle data (motor representation) and hands paths tracked by a

stereoscopic vision system (visual representation). In this application, the role

of HMM was to make a link between the different datasets, where each state

can be considered as a label (or as a higher level representation) common to the

visual and motor data. Indeed, in HMM, the sequence of states is not observed

directly and generates the visual or motor representation. Thus, similarly to the

ASL mechanism, the architecture of a multivariate HMM also has a horizontal

process to associate the elements in a sequential order (by learning transition

probabilities between the hidden states), and a vertical process to associate

each sensory representation to appropriate motor representation (which is done

through the hidden state). If data are missing from one part or the other (visual

or motor representation), it is still possible to recognize a task, and retrieve a

generalization of the task in the other representation.

C.L. Nehaniv and K. Dautenhahn also suggested the use of a general alge-

braic framework to address the correspondence problem in both natural and

artificial systems (Nehaniv & Dautenhahn, 2002, 2001). They introduced in

their framework the notion of correspondence problem to refer to the problem of

174

creating an appropriate mapping between what is performed by the demonstra-

tor and what is reproduced by the imitator, where the correspondences can be

constructed at various levels of granularity, reflecting the choice of a sequence

of subgoals. Indeed, the two agents considered as a demonstrator and an im-

itator may not share the same embodiments (e.g., difference in limb lengths,

sensors or actuators). In their framework, the authors interpret the ASL model

by representing the behavior performed by the demonstrator and imitator as

two automata structures, with states and transitions. The states are the basic

elements segmented from the whole task, that can produce effects or responses

in the environment (e.g., object displaced), and where an action is the transi-

tion from one state to another state (Figure 6.2). An imitation process is then

defined as a partial mapping process between the demonstrator’s and imitator’s

states, effects, and actions. In their model, an observer (an external observer,

the demonstrator or the imitator) decides which of the states, actions or effects

are the most important ones to imitate, by fixing a metric of imitation. Different

metrics are then used to yield solutions to different correspondence problems,

which also allow to formally quantify the success of an imitation.

This representation is closely related to the characteristics of HMM. In-

deed, HMM offers a direct extension of the automata depicted in this algebraic

framework, where the main advantage lies in its stochastic representation of the

transitions and observations, which is more suitable to encode real-world data

such as the ones collected by the robot in a RbD framework. To draw links

with their algebraic model, the hidden states of an HMM output probabilisti-

cally distributed values that are similar to the notion of effects, while transitions

between hidden states (also probabilistically defined) are equivalent to actions.

Note that encoding data in HMM does not resolve the correspondence prob-

lem, but provides a suitable representation to treat the what-to-imitate and the

how-to-imitate issues in a common framework. An HMM framework thus of-

fers a stochastic method to model the process underlying gesture imitation, by

making a link between theoretical concepts and practical applications. In par-

ticular, it stresses the fact that the observed elements of a demonstration and

the organization of these elements should be stochastically described to have

a robust robot application that takes into account the high variability and the

discrepancies across demonstrator and imitator points of view.

The GMM/GMR framework also follows such a trend, if one considers the

sequential order of the different Gaussian distributions used to represent the

trajectories. The clear advantage over HMM is that a generalized version of

the skill and associated constraints can be continuously represented along the

motion (Section 2.5).

175

6.5 Similarities shared by GMR and other

regression methods

We have presented in Section 2.5 a method to retrieve a smooth generalized

version of the trajectories through regression, by using a probabilistic represen-

tation of the data encoded in GMM. We discuss here the similarities shared

by GMR with other regression frameworks proposed in the literature. In Pro-

jection Pursuit Regression (PPR), high-dimensional data are iteratively inter-

preted in lower-dimensional projections where a smoothing method is applied

(Friedman, Jacobson, & Stuetzle, 1981). In Multivariate Adaptive Regression

Splines (MARS), high-dimensional data are split into subspaces where splines

are fitted to the data (Friedman, 1991). The use of local regression techniques

is another approach, which is probably the most extensively studied technique

used initially to smooth data, see Cleveland and Loader (1996) for a review. By

considering a dataset of input and output variables xi, yiNi=1, the aim of local

regression is to estimate E(yi) = f(xi) locally, by considering a subset of points

xj in the neighborhood of xi. Cleveland (1979) introduced Locally Weighted

Regression (LWR) where the fit also depends on the distance between xj and

xi. The influence of xj is thus weighted, giving more weight to the points near

the point xi whose response is being estimated and less weight to the points

further away. For each point xi whose response is being estimate, a linear or

quadratic polynomial function is usually used to fit locally the data. By comput-

ing the regression function values for each of the N datapoints, a smooth curve

is then retrieved, with the smoothness determined by the weighting function.

By considering a Gaussian distribution centered on xi as weighting function,6

the amount of data contributing to the local estimation can be parameterized

by variance Σi.7 The selection of Σi is thus a bias-variance trade-off (Geman,

Bienenstock, & Doursat, 1992), i.e., large Σi produces a smooth estimate (small

variance), while small Σi conforms more to the data (small bias), see the end of

Section 2.5. Similarly, the choice of the polynomial degree is also a bias-variance

tradeoff. Methods have been proposed to automatically select these parameters,

namely an adaptive local bandwidth selection and an adaptive local selection of

the polynomial degree (Cleveland & Loader, 1996).

As LWR combines the simplicity of linear least squares regression and the

flexibility of nonlinear regression, the approach appeared soon very appealing for

robot control (Atkeson, 1990; Atkeson et al., 1997; Moore, 1992). As discussed

in Section 1.2, further work mainly concentrated on moving on from a memory-

based approach to a model-based approach, and moving on from a batch learn-

ing process to an incremental learning strategy. Schaal and Atkeson (1998)

6Usually, the distribution is computed only on a bounded interval for computational effi-ciency.

7To draw a link with the generalization approach used in our framework, LWR would bedefined as the estimate of a response ξs at time step ξt, computed by weighting the spatialobservations ξs with respect to their closeness of the temporal input ξt.

176

introduced Receptive Field Weighted Regression (RFWR) as a non-parametric

approach to learn incrementally the fitting function without the need of storing

the whole training data (i.e., without using historical data). The method is

based on a receptive field approach, where each receptive field has the form of

a Gaussian kernel which is updated independently. The process allows to add

and prune receptive fields and to deal with irrelevant inputs. Within each re-

ceptive field, a local linear function models the relationships between input data

X and output data Y . The strategy is close to Gaussian mixture modelling,

where each Gaussian would be trained separately. Indeed, by encoding X,Y

as a joint distribution in a GMM of parameters µk,ΣkKk=1, the local Gaussian

model of X is described by µX,k,ΣX,k, which is closely related to the notion

of Gaussian unit in the receptive field. In receptive fields, the parameters are

ck,MkKk=1, representing respectively the centers (that are not updated) and

distance metric of the Gaussian (that are similar to ΣX). The local Gaussian

model of Y is parameterized by µY,k,ΣY,k, which can also be related to the

local linear model in the receptive field. In receptive fields, the parameters of

the linear model are βkKk=1, which are learned either incrementally or in batch

mode. In a batch mode, a leave-one-out cross validation scheme can be used

to learn MkKk=1 by gradient descent method, by considering a penalty factor

to regulate the smoothness. An incremental version can then be derived by a

stochastic approximation. When considering GMM, an incremental version of

the well-known Expectation-Maximization (EM) algorithm can be used to in-

crementally learn the parameters (Section 2.8.3), where the iterative estimation

process is regulated by bounding the covariance matrices (Section 2.11.1).8

Xu, Jordan, and Hinton (1995) also proposed a similar scheme by modeling

the data with a mixture of experts trained by EM algorithm, where the experts

takes the form of polynomial functions (linear models in that case), resulting in a

piecewise polynomial approximation of the data. The method shares similarities

with RFWR but does not use the same training methods.

To resolve the curse of dimensionality issue (Scott, 2004) in RFWR, Vi-

jayakumar et al. (2005) suggested to improve the approach so that it can operate

efficiently in high dimensional space, and proposed the use of Locally Weighted

Projection Regression (LWPR). Indeed, for high-dimensional data, considering

the distance between a set of points in a receptive field approach is not op-

timal because the distance among points does not separate the points well in

high dimensions. By detecting locally redundant or irrelevant input dimensions,

the approach suggests to locally reduce the dimensionality of the input data.

The curse of dimensionality can then be avoided by finding local projections

of the input data using Partial Least Squares (PLS) regression (Wold, 1966).

Projection regression decomposes multivariate regressions into a superposition

of univariate regressions along the projections in the input space. Multiple

8Note that the purpose is very similar to the use of a penalty factor to regulate the smooth-ness in the RFWR gradient descent method.

177

methods have been proposed to select which are the efficient projections (Vi-

jayakumar & Schaal, 2000), and PLS was found to be the most suitable method

for reducing the dimensionality in such a regression framework.9

In this thesis, we follow a similar strategy by considering that a set of tra-

jectories can be represented locally by a Gaussian distribution. We first use

PCA as a global projection method to reduce, if possible, the dimensionality of

the dataset, and to decorrelate the dataset globally. Then, by using GMM, the

non-linearities of the trajectories are modeled by piecewise local information

in the form of Gaussian distributions representing locally the variability and

the correlations of the data. GMM can be viewed as an in-between local and

global model, in the sense that it fits locally the joint input and output data,

using a global learning algorithm.10 GMR is then used as a regression process

to retrieve smooth generalized trajectories and associated constraints by using

the temporal components as inputs and retrieving an estimate of multivariate

spatial output components, which take the form of Gaussian distributions. At

each time step, the center of the retrieved distribution is then used as a gen-

eralized signal with associated covariance matrices describing the constraints

on this generalized signal. The principal advantage of using GMM/GMR for

our application is that GMM can be used easily to recognize and classify tra-

jectories, while GMR use the same representation to extract constraints and

retrieve a generalized version of the trajectories. The other advantage is that

the associated EM learning methods are easy to implement and provide a sim-

ple estimation method known to have good convergence properties if correctly

initialized. As GMM is a popular encoding scheme used in various fields of

research, we can also benefit from the recent work using this probabilistic rep-

resentation, and combine easily this encoding scheme with other probabilistic

approach (see Section 6.8.3 further).

6.6 Failures and limitations of the proposed

RbD system

6.6.1 Loss of important information through PCA

In the proposed RbD framework, we suggested to use PCA as a pre-processing

step to reduce the dimensionality of the data and to decorrelate the data by

projecting them in a latent space of motion. When considering a small dataset

of high dimensionality, this reduction of dimensionality is useful for the further

encoding in GMM, due to the sparsity of data in high-dimensional space (Scott,

2004). To do so, the projections showing the lowest variability are discarded.

9Note that in our framework, the curse of dimensionality is concerned with output datainstead of input data.

10Note that we consider in our regression framework unidimensional temporal inputs andmultidimensional spatial outputs.

178

Original data space

50 100−0.02

00.020.040.06

t

xs,1

50 100−0.04−0.02

00.020.040.060.08

t

xs,2

50 100−0.1

−0.08−0.06−0.04−0.02

t

xs,3

50 100

−0.05

0

0.05

t

xs,4

50 100

−0.04−0.02

00.020.04

t

xs,5

50 100

−0.1

−0.05

0

t

xs,6

50 100

−0.05

0

0.05

t

xs,7

50 100−0.02

0

0.02

t

xs,8

Latent space of motion extracted by PCA

50 100−0.1

0

0.1

ξt

ξ s,1

50 100

−0.05

0

0.05

ξt

ξ s,2

50 100−0.04

−0.02

0

0.02

ξt

ξ s,3

Figure 6.3: Failed attempt at reproducing a skill when reducing the dimen-sionality through a pre-processing phase using PCA to project the dataset ina latent space of lower dimensionality. Top: Original joint angle data space(grey line) of a toy problem that consists in chopping parsley with a knife. Thelast component consists of a small oscillation of the wrist used to mince theparsley. Bottom: 3-dimensional latent space of motion extracted by PCA. Thedata projected in this latent space are then projected back in the original dataspace (black line). We see that the oscillatory component contained in x8 is lostwhen re-projecting the data.

179

For a large set of signals, these dimensions often encapsulate the noise inherent

to the demonstrations and to the errors of the tracking system. Discarding these

dimensions becomes thus a good opportunity to denoise the signals. However,

some interesting features of the signals can also be lost during the projection.

For example, if we consider the motion of the hand while wiping a blackboard, we

could decompose the gesture into a small circular motion (aiming at wiping the

local part of the blackboard) and a large motion (aiming at covering the whole

surface of the blackboard). In this example, using PCA could fail at extracting

the small circular motion because PCA retains the dimensions presenting large

variability in priority. For this example, PCA would thus fail at representing

efficiently the purpose of the task. Similarly, by recording the joint angles of

a cook chopping parsley with a knife, the processing of the data in a latent

space of motion of lower dimensionality could also fail at extracting the rapid

mincing gesture of the hand, when considering PCA as a criterion. A similar

toy problem is presented in Figure 6.3, where variable x8 is said to represent an

important sinusoidal signal for the task. Due to its small variance, this variable

is not taken into account when reducing the dimensionality of the latent space

of motion. This particular characteristic of the task (which is not due to noise

in this example) disappears when reproducing the skill (black line).

6.6.2 Failure at learning incrementally the GMM

parameters

We have seen throughout this thesis that the user must demonstrate a task in

different situations to help the robot extract the task constraints and general-

ize the skill to various contexts. However, the situations encountered should

not be too dissimilar from one demonstration to the next one, i.e., the skill

should be demonstrated with an increased variability in the contexts where it

can be applied. Note that this issue also applies for humans. When a teacher

presents different examples of a skill, these examples must be sufficiently con-

nected together to show how the skill can be extended to different situations. In

other words, the consecutive demonstrations should show a progression in the

generalization perspective.

In our system, if the user does not show appropriate variability in his/her

consecutive demonstrations, the learning system can also fail at connecting the

different examples and converging to an optimal solution. This failure is par-

tially prevented by updating the model only with trajectories that have pre-

viously been recognized by the model, i.e., by updating the model only with

demonstrations that are not too dissimilar to the situations encountered, i.e., if

the likelihood of the model is above a given threshold. When this threshold is

too low, a trajectory which is very dissimilar to the others can be used to update

this model. As only this last observation is used to incrementally update the

model, the parameters can then fail at updating correctly the set of parameters.

180

Low variability between two consecutive demonstrations

20 40 60 80 100

−0.05

0

0.05

Demonstration 1

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 2

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 3

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 4

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 5

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 6

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMR

ξt

ξ s

High variability between two consecutive demonstrations

20 40 60 80 100

−0.05

0

0.05

Demonstration 1

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 2

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 3

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 4

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 5

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 6

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMR

ξt

ξ s

Figure 6.4: Illustration of the importance of the variability between two con-secutive demonstrations when learning incrementally the data using the directupdate method. The same dataset is used in the two situations, but presentedin a different order. The graphs show the consecutive encoding of the data inGMM, after convergence of the EM algorithm. Only the latest observed trajec-tory (in black line) is used to update the models (no historical data is used).The last graphs show the generalization through GMR resulting from the GMMparameters estimation.

181

Low variability between two consecutive demonstrations

20 40 60 80 100

−0.05

0

0.05

Demonstration 1

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 2

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 3

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 4

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 5

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 6

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMR

ξt

ξ s

High variability between two consecutive demonstrations

20 40 60 80 100

−0.05

0

0.05

Demonstration 1

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 2

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 3

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 4

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 5

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

Demonstration 6

ξt

ξ s

20 40 60 80 100

−0.05

0

0.05

GMR

ξt

ξ s

Figure 6.5: Illustration of the importance of the variability between two consec-utive demonstrations when learning incrementally the data using the generativemethod. The same dataset is used in the two situations, but presented in adifferent order. The graphs show the consecutive encoding of the data in GMM,after convergence of the EM algorithm. Only the latest observed trajectory(in black line) is used to update the models (no historical data is used). Thelast graphs show the generalization through GMR resulting from the GMMparameters estimation.

182

Figure 6.4 illustrates this issue, where two models are built upon the same

set of demonstrations but presented in a different order. The models are learned

incrementally by a direct-update method (see Section 2.8.3). When the skill is

progressively demonstrated, i.e., when the consecutive examples provided share

similarities with the previous demonstrations, encoding and generalization of the

skill provide satisfying results (top). However, when two consecutive demonstra-

tions are presented in very different contexts, i.e., by trying to generalize the

skill too rapidly, the system fails at learning optimally the parameters (bottom).

In this example, we see that after the second demonstration, the system is not

able to combine the last part of the first trajectory with the last part of the

second trajectory (only a poor local solution is found by EM). The two final

parts of the first and second trajectories thus tend to be encoded separately,

as we can see after the sixth demonstration. We see that one of the Gaussian

component is stuck to the encoding of only one specific part of the dataset (the

end of the first demonstration), that is, the component is not updated optimally

with respect to the other demonstrations. This is not very surprising, since the

assumption of the algorithm is not satisfied at least for the first two demon-

strations (to produce a correct update of the parameters, the set of posterior

probabilities must remain similar enough when new data are used to update the

model). Thus, the GMM parameters update provides a bad estimate after the

second demonstration, where this estimation is further stuck to this poor local

optimum.

Figure 6.5 presents a similar example by learning incrementally the data

using a generative method (see Section 2.8.3). For this particular dataset, we

see that the generative method is better at capturing the essential constraints

of the task in terms of variability. However, relying on a stochastic process still

has the disadvantage that it can not guarantee that two consecutive learning

processes will find the same estimation of the parameters.

6.6.3 Failure at extending a skill to a context that is too

dissimilar to the ones encountered

When the constraints of a task have been extracted, we have seen in Section 2.5

that the system can reproduce the task in new situations that have not been

directly observed during the demonstrations. However, the extent of this gener-

alization capability is limited to the variability of the demonstrations provided.

Figure 6.6 presents an example of failure for the Bucket Task presented in the

experiments in Section 5.3. We see in inset (1) that the essential characteristics

of the task are correctly reproduced, namely grasping the bucket and bringing

it to a specific place. However, in insets (2) and (3), where the initial positions

of the bucket are far from the demonstrated trajectories, we see that the system

fails at reproducing correctly the skill (only a small deviation of the trajectories

in the direction of the bucket is observed). Thus, in a situation that is too

183

100 50 0 50 100 150

50

100

150

200

250

x1 [mm]

x 2 [mm

]

1

−150 −100 −50 0 50 100 150

0

50

100

150

200

250

x1 [mm]

x 2 [mm

]

2

−100 −50 0 50 100 150

50

100

150

200

250

x1 [mm]

x 2 [mm

]

3

Figure 6.6: Failed attempt at reproducing the task. Bottom left: Mean valuestaken by the cost function H for the Bucket Task with a varying initial positionof the bucket on the table. 1,2,3: Reproduction for the corresponding threelocations on the map.

dissimilar to the ones encountered during the demonstrations, the system fails

at reproducing the skill. This is mainly due to the restriction of the learning

system to generalize a skill only in the range of variability introduced during the

demonstrations. One way to prevent this would be to provide demonstrations

with more variability on the initial positions of the object, or to allow the robot

to extend its search space by adding self-learning capabilities to the robot (see

also Section 6.7.2 further).

6.7 Collaborative work

We present here two applications considering two different learning approaches

that are complementary to our framework. These two applications highlight the

generality of the proposed approach and the efficiency of representing the task

constraints in a probabilistic framework.

6.7.1 Toward the use of a dynamic approach to be

robust to perturbation during reproduction

In the experiments reported in this thesis, we implicitly assumed that kinematics

information was sufficient to describe the task and that dynamical information

was less important for the tasks considered. Indeed, the system proposed in

184

Figure 6.7: Demonstrations of velocity profiles using different initial configura-tions.

Figure 6.8: Handling of dynamic perturbations during the reproduction at-tempts, where the perturbations are produced by displacing the box during thereproduction of the skill by the robot.

140

160

180

200

220

80

100

120

−30

−20

−10

0

10

20

30

40

50

60

xy

z

Generalization of the

demonstrations

Simple dynamical

dystem

Modulated dynamical

system

Target

Figure 6.9: Illustration of the joint use of a dynamical system and velocity pro-files learned through demonstrations. The point represents the target to reach.The straight line represents the trajectory followed by the simple dynamical sys-tem. The thin curved line represents the trajectory followed by the generalizedvelocity profile (learned through demonstrations). The thick curved line finallyrepresents the trajectory followed by the dynamical system modulated with thegeneralized velocity profile.

185

this thesis is open-loop and is aimed at providing a solution to the reproduction

of a task by generalizing over the different demonstrations produced, that is,

it restrains the search space of the possible solutions that the robot can use to

achieve a task.

To deal with dynamical information, the framework was also applied suc-

cessfully in a collaborative work where the model presented in this thesis was

used co-jointly with a dynamical system allowing to deal with perturbations

during the reproduction attempts (Hersch & Billard, 2006). The idea is that

robustness to changes in the environment and to external perturbations can be

performed by: (1) generalizing over a set of demonstrated velocity profiles; (2)

using the generalized velocity profile to modulate a dynamical system. This

work thus investigated the use of GMR to provide a range of velocity profile so-

lutions for the reproduction of a task, which is then used in a dynamical system

to achieve stable solutions to a reproduction attempt in case of perturbations.

The method combines the use of a dynamical system to reach a specific tar-

get with the GMR representation of the velocity profiles demonstrated by the

user. The reproduction of the skill under dynamical perturbations can then be

achieved by weighting the influences of the different controllers. The dynamical

system used in this experiment is described as

ξ = β1

(

−ξ + β2(ξg − ξ)

)

,

where ξ, ξ and ξ refer to the position, velocity and acceleration (in joint space

and Cartesian space). ξg refers to the position of an attractor (goal to reach).

β1 and β2 are the parameters of the dynamical system.

Figure 6.7 shows the different demonstrations performed in different initial

configuration for a task consisting of putting an object into a box. Figure 6.8

shows the reproduction of the skill by considering dynamical perturbations pro-

duced here by displacing the box during the reproduction attempt. Figure 6.9

represents the influence for the reproduction of the different elements (dynami-

cal system and generalized velocity profiles through GMR).

Different issues still have to be investigated, notably concerning the use of

multiple attractors, which can be linked to the issue concerning the extraction

of specific events to characterize subtasks in the global motion (Section 6.8.1).

6.7.2 Toward a self-learning approach using GMR to

explore the search space

The RbD framework presented in this thesis assumed that the skill was learned

solely on imitation based approaches. It is however known in human develop-

ment that imitation alone is not the only means of learning new skills. Indeed,

as shown in Section 6.2, self-learning is also a crucial component in the learning

paradigm. As presented in Chapter 1, several works in robotics adopted the

186

Figure 6.10: Adaptation of a skill learned by imitation through the use of Re-inforcement Learning (RL) techniques, where the skill consists of putting anobject into a box. The generalized trajectory learned through demonstrationsis represented in dark thick line. When a new obstacle is placed on the way,the robot needs to adapt its model to avoid the obstacle. The adaptation of thelearned trajectory through RL is represented in light thin line.

20 40 60 80 100

0

0.1

0.2

ξt

ξ s

20 40 60 80 100

0

0.1

0.2

ξt

ξ s

20 40 60 80 100

0

0.1

0.2

ξt

ξ s

Figure 6.11: Illustration of the joint use of imitation and self-experimentationprocesses to explore the search space for new solutions based on the examplesprovided by the user. Left: Original model of the skill learned through multipledemonstrations Middle: Search of new solutions by initiating the search bydisplacing the Gaussian centers randomly with respect to the learned covarianceinformation and using Reinforcement Learning to find a correct adaptation ofthe skill. Right: New trajectory generated through GMR by using the modifiedmodel (the original model is represented in light color).

187

Model A - Data

20 40 60 80 100−0.1

−0.050

t

x1

20 40 60 80 100

−0.050

0.05

t

x2

−0.1 −0.05 0

−0.05

0

0.05

x1

x2

Model A - GMM

20 40 60 80 100−0.1

−0.050

t

x1

20 40 60 80 100

−0.050

0.05

t

x2

−0.1 −0.05 0

−0.05

0

0.05

x1

x2

Figure 6.12: Encoding of Model A in GMM.

perspective that self-experimentation and practice can be used jointly to learn

a skill, where observations are used to initiate the further search of a controller

by repeated reproduction attempts (Atkeson & Schaal, 1997; Bentivegna et al.,

2004). This combination of observation and practice aims at developing more

robust and autonomous controllers leveraging from the use of new demonstra-

tions when the robot is faced with a situation that is too dissimilar to the ones

previously demonstrated (see also Section 6.6.3).

The framework presented here was successfully applied in collaborative work

considering both imitation and self-experimentation to learn and adapt a skill

to new situations using Reinforcement Learning (RL) techniques (Guenter, Her-

sch, Calinon, & Billard, 2007). Figure 6.10 illustrates the principles of this joint

work. The search process is illustrated in Figure 6.11. The GMM representa-

tions learned during the demonstrations are used as hints for the search of new

solutions during the self-exploration phase. The exploration of new solutions is

initiated by first displacing the centers µk in GMM (with respect to Σk) and by

using GMR to retrieve new trajectories that are smooth enough to be consid-

ered as plausible solutions for the robot. In Guenter et al. (2007), the authors

showed that this joint work permits to extend the skill to a larger context than

the ones demonstrated.

6.8 Further work

6.8.1 Toward a joint use of discrete and continuous

constraints

Throughout this thesis, we have suggested to represent the constraints of a

skill in a continuous form, taking the perspective that a continuous model is

generic and allows to encode different kinds of movements by carrying the degree

188

Model B - Data

20 40 60 80 1000

0.05

0.1

tx

1

20 40 60 80 100−0.05

00.05

0.1

t

x2

0 0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

x1

x2

Model B - GMM

20 40 60 80 1000

0.05

0.1

t

x1

20 40 60 80 100−0.05

00.05

0.1

t

x2

0 0.02 0.04 0.06 0.08 0.1

−0.05

0

0.05

0.1

x1

x2

Figure 6.13: Encoding of Model B in GMM.

Model AB - GMM

50 100 150 200−0.1

0

0.1

t

x1

50 100 150 200−0.05

00.05

0.1

t

x2

−0.1 −0.05 0 0.05 0.1

−0.05

0

0.05

0.1

x1

x2

Model AB - GMR

50 100 150 200−0.1

0

0.1

t

x1

50 100 150 200−0.05

00.05

0.1

t

x2

−0.1 −0.05 0 0.05 0.1

−0.05

0

0.05

0.1

x1

x2

Figure 6.14: Top: Encoding of Model A and Model B in a Model AB using amixture of GMMs. Bottom: Regression using Model AB to retrieve a smoothtransition between Model A and Model B.

189

of invariance along the trajectories. In further work, a segmentation of these

constraints could also be considered to detect particular events in the flow of

motion. Indeed, as seen in the experiments in Chapter 5, some parts of the tasks

are constrained by particular high-level events such as grasping or dropping an

object. It could thus be useful to consider these constraints as higher-level

events by segmenting the trajectories into a discrete set of important events,

i.e., by considering the subparts of the motion showing strong invariance as

specific events in the continuous flow of motion.

Moreover, the BMM representation of binary signals presented in Section

2.14.1 could also be used as a hint to segment the whole motion into specific

events. Indeed, BMM could model probabilistically the temporal occurrence of

particular events in the trajectory (e.g., when a keyword is detected by a speech

recognition system). As these events would not occur at the same time for each

demonstration, the probabilistic representation of the BMM could be used to

encode this variability efficiently.

Discretization would thus allow to represent a motion as a set of discrete

events that are invariant across multiple demonstrations, with transitions from

one event to the other described in a continuous form using GMM. For repro-

duction, the advantage of decomposing a motion into a set of separate subtasks

is that each subtask can be refined separately, i.e., by producing a demonstra-

tion for one of the subtask only instead of the whole motion. Moreover, if the

robot fails at reproducing only one part of the skill, it would have the possibil-

ity to reiterate only the failed subpart and then continue with the remaining

parts, without having to start again from the beginning of the motion. By

using this segmentation process, the different subtasks could also be combined

in a different order by taking advantage of the smoothing properties of GMR.

To illustrate this, a toy example is presented in Figures 6.12, 6.13 and 6.14,

presenting respectively a subtask A encoded in a GMM, a subtask B encoded

in a GMM and a task AB encoded in a mixture of GMMs. By doing so, the

transitions between the two subtasks A and B can be computed by: (1) modify-

ing the temporal components of the GMM encoding subtask B; (2) merging the

Gaussian components of the two GMMs into a mixture of GMMs (top of Figure

6.14); (3) performing regression through GMR on this new model (bottom of

Figure 6.14).

Extracting automatically specific events in the trajectory is closely related

to the when-to-imitate issue and to the automatic segmentation of the data.

In the current system presented in this thesis, only the user has the ability to

partition the data trace into segments that are used in further processing. In

Calinon and Billard (2004), we have tried to use simple methods to segment

automatically multiple demonstrations from a continuous flow of motion, with

an approach based on a moving window checking for segments of low velocities

along the motion. By staying in a fixed posture a few millisecond before and

after each demonstration, the system could then detect the start and the end

190

of a demonstration. We noticed through this experiment that the process was

robust when used by a trained user but that it required adaptation from the

user when used for the first time, showing that the process is not very natural.

This when-to-imitate issue could be explored in further work, where a possible

way to solve the problem could be to detect the specific events in the continuous

flow of motion.

Another crucial point to consider in further work is the who-to-imitate issue,

or how to select automatically the good examples to learn a task. Indeed, the

current system does not cope with outliers and the learning system currently

reckon on the quality of the examples provided by the user. This is currently left

as an open question in our framework, and further work could explore how to

give the robot the ability to select who is the good teacher for a given situation,

and how to cope with bad demonstration, see e.g. (Ting, D’Souza, & Schaal,

2007).

6.8.2 Toward a hierarchical learning approach

If the whole motion is segmented as a sequence of subtasks, we have seen in the

previous section that it could be useful to reorganize these elements in a differ-

ent order to perform new skills. One way to do that is to extend the proposed

framework to a hierarchical model. For example, further work could explore the

use of hierarchical GMR to reproduce a skill. By taking inspiration from the

issues in RbD depicted in Muench et al. (1994), we could consider the gener-

alization issue both at a trajectory level and at a process level by considering

a hierarchical architecture. Indeed, GMMs could be used to encode basic tra-

jectory elements represented as states in a Dynamic Bayesian Network (DBN)

or in a Hidden Markov Model. The role of the DBN/HMM would be to learn

the skill in terms of organization of trajectory elements. A generalized sequence

of elements could then be retrieved by the model. After this organization of

the basic elements, GMR could then be used to connect smoothly the trajec-

tory segments representing the different elements, as illustrated in Figure 6.14.

This would produce a low-level controller for the robot characterized by smooth

transitions between different motion primitives. Generalization could thus take

place simultaneously at a trajectory level (to extract the relevant featured of

an action), and at a program level (to organize the different components of a

behaviour). Open questions are here concerned with the discretization of the

trajectory elements and with the hierarchical learning of the model. Note that

such an architecture would be very close to the theoretical model of imitation

proposed by Byrne (1999) in ethology, where an imitation process is described

both at a program level and at an action level.

191

Extraction of gaze direction Extraction of pointing direction

Probabilistic encoding of attentional scaffolding

Figure 6.15: Use of prior information with the probabilistic framework by usinginformation coming from the gaze and pointing direction.

6.8.3 Toward the use of attentional scaffolding as priors

in the probabilistic framework

In Calinon and Billard (2006), we investigated the use of attentional scaffolding

to extract information about the relevant objects to consider in the task. In

Calinon, Epiney, and Billard (2005), we investigated the use of communicative

gesture and speech for attentional scaffolding and to guide the whole interac-

tion scenario. These social cues aimed at understanding the intent of the user

and can be used and combined differently depending on the demonstrator’s and

imitator’s personalities. For humans, the combination of these cues involves

complex emotional and cultural interaction aspects. Some of these hints are

subtle and can involve misunderstanding (e.g., facial expression), while others

are more explicit (e.g., speech). Depending on who they are interacting with,

humans naturally select the appropriate means by which to transfer their in-

tention (e.g., use of gestures with deaf people). A humanoid robot also has its

own particular sensory and working memory capabilities, and current humanoid

robot sensors are still not able to capture subtle social cues in the way humans

do. Throughout this thesis, we have thus suggested that one of the most robust

and appropriate ways of transferring information to the robot is by using imi-

tation learning techniques based on statistics. We have seen that learning the

192

Cone-plane intersection Estimation of gaze direction

−10

1

−1

0

10

0.2

0.4

0.6

x1x2

p(x

1,x

2)

Figure 6.16: Left: Estimation of gazing (or pointing) direction estimated asthe intersection of a cone of vision with a plane (the surface of a table). Right:Representation of the intersection as a covariance matrix to estimate probabilis-tically where the user is looking at (or pointing at).

correlations and invariance across a set of sensory variables can be carried out

quite efficiently by the robot. Indeed, compared to humans, keeping track of a

large amount of information is not a bottleneck for the robot due to its high

working memory capacity.

However, extracting the constraints through statistics is not the only means

of conveying information. Indeed, we have seen in the experiment presented in

Section 5.2 that the use of priors could speed-up learning. One direction for

further work is to explore the use of prior such as gazing/pointing direction or

vocal information to convey additional information on the relevant features or

events to detect during the demonstrations (Figures 6.15 and 6.16). Concerning

vocal information, speech recognition could be considered, where BMM could

model probabilistically the temporal occurrence of different keywords to recog-

nize. More generically, without looking into the linguistic content of speech,

prosodic pattern could provide useful information on the user’s communicative

intent (Breazeal & Aryananda, 2002). By taking insights from this characteris-

tics, a direction for further work would be to consider the energy and pitch of

the vocal traces, which could be encoded in GMM and be jointly used as priors

in the probabilistic framework proposed.

6.8.4 Improving GMR by considering multiple inputs

Regression using a GMM representation of the data can be extended in several

ways. Throughout this thesis, we have considered the use of GMR by taking

a temporal variable as input to retrieve smooth spatial trajectories. However,

GMR could also be used to tackle the more generic issue of sensory-motor

learning as in (Atkeson, 1990; Ghahramani & Jordan, 1994).

Another direction for further work would be to consider the use of GMR to

improve the scaffolding process by guiding the robot throughout the task with-

out being constrained to demonstrate the skill in a given period of time. Indeed,

193

Data

−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1

−0.04

−0.02

0

0.02

0.04

x1, x3

x2,x

4

GMM

−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1

−0.04

−0.02

0

0.02

0.04

x1, x3

x2,x

4

GMR

−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1

−0.04

−0.02

0

0.02

0.04

x1, x3

x2,x

4

Figure 6.17: Example of GMR using multi-dimensional input variables. Firstline: 4-dimensional data where only the couplings x1, x2 (cross signs) andx3, x4 (plus signs) are represented. Second line: Encoding of the data inGMM. Third line: Regression by using x1, x2 as input variables (dashed line)and retrieving x3, x4 as output variables (solid line), with associated con-straints (shaded areas). This process could be used for a scaffolding process byreplacing for example x1, x2 with the joint angles of the left arm and x3, x4with the joint angles of the right arm. If the motors of the left arm are movedby the user, the robot can adapt its motion by moving the motors of the rightarm accordingly, through the GMR process.

194

in the scenario proposed in the first experiment of Section 5.4, the user produced

a first demonstration of the skill which was then reproduced by the robot at

the same speed as the demonstration. During reproduction, the user provided

scaffolds by moving simultaneously a subpart of the robot’s joints through the

motion. The user is then constrained to provide the scaffolds at the correct

timing, i.e., the robot still guides the reproduction in terms of speed. It means

that the user must follow as much as possible the gesture, and does not have

the ability to deform the demonstration temporally (e.g., by showing some parts

of the motion more slowly). A possible solution to tackle this drawback would

be to use the multidimensional GMM representation of the gesture (already ac-

quired by the robot) to retrieve one part of the gesture while the user controls

the remaining part. It means that the DOFs that are manually controlled by

the user could be used as input variables for the GMR process to retrieve the

remaining DOFs for the motors activated by the robot. By doing so, the user

could provide more natural scaffolds by letting the robot automatically synchro-

nize the movements of its active motors with the passive motors controlled by

the user. The role of GMR here would thus be concerned with the selection of

the most likely joint angle values for the active motors, knowing the joint angle

values of the passive motors. Similarly, regression using the BMM representa-

tion could be explored in further work, by considering multivariate inputs and

outputs.11

A toy example used to illustrate how GMR could be used within a scaf-

folding context is presented in Figure 6.17, where the GMR process considers

multivariate inputs and outputs.

6.8.5 Improving the latent space projection by

considering non-linear or local models

Throughout this thesis, the latent spaces of motion were defined by a global

linear projection in a subspace of lower dimensionality. Then, GMMs were used

to locally encode the data by extracting locally the variance and correlations

across the different variables. Thus, the global linear projection acts here as a

rough process to discard the additional dimensionality which are redundant with

respect to the entire motion. A possible improvement of the system would be to

use local projection models instead of global projection models, by exploring for

example the use of Probabilistic Principal Component Analysis (PPCA) instead

of PCA to decompose the data (Tipping & Bishop, 1999).12

Another direction for the improvement of the current setup would be to

11Note that only a threshold on the parameter value of BMM is used in Section 5.3 toretrieve a generalized version of the binary signals, which could be extended to a more efficientregression approach.

12Note that this representation can be viewed as an extension of the GMM representationwhere the principal components of each covariance matrix would be computed to discard theeigenvectors whose corresponding eigenvalues are low, similarly to the process of applyingPCA locally.

195

consider non-linear transformations such as Kernel PCA (Scholkopf, Smola,

& Muller, 1998) or Gaussian Process based approaches (Lawrence, 2005).13 It

would also be interesting to see whether a polynomial approach could be used to

decompose the data, where higher order transformation could be approximated

by a Taylor expansion. By doing so, it would still be possible to transform a

multivariate Gaussian distribution from the latent space to the original data

space (and vice-versa) by considering only linear transformation properties of

Gaussian distributions.

6.8.6 Predicting the outcome of a demonstration

Another important direction of research is concerned with the ability of the

robot to understand the intention of the user and the ability to learn a skill by

observing a failed attempt from the demonstrator. Indeed, even by watching a

failed attempt, people usually learn the desired result that the teacher intended

to achieve. It would thus be interesting to explore this ability in the current

framework, and see how the system could be modified to handle this type of

paradigm.

The extraction of intention is also related to the dual role of perception

and prediction explored in robotics (Demiris & Hayes, 2002). Demiris and

Khadhouri (2006) suggested to use building blocks consisting of pairs of inverse

and forward models that compete in parallels and in a hierarchical manner to

both plan and execute an action. As discussed in Section 1.2, the architecture

allows to treat both perception and actions in a unified model, and is able

to provide an estimate of the upcoming outcomes when observing an action

performed by the demonstrator. An hypothetic solution to deal with such an

issue in our framework would to take advantage of the prediction properties of

the GMM. Indeed, after observation of a set of successful attempts, the model

has the ability to recognize a skill on the fly when performed by the user, by

computing the likelihood of the model when faced with the partial observation

of the skill. Using this on-line recognition property, it could then be possible,

even by watching a failed attempt, to estimate what was the purpose of a task

(or at least to estimate which were the specific task constraints at the end or

during the motion). By doing so, it would be possible to predict the outcomes

of a task at a trajectory level, assuming that correct demonstrations of the skill

have previously been demonstrated. Prior information in the form of attentional

scaffolding (Figures 6.15 and 6.16) could also be of high relevance to extract the

intent of the user more robustly, as discussed in Section 6.8.3.

13Note that such decomposition could suffer from the lack of data or from the slow conver-gence of the learning algorithms.

196

6.8.7 Toward a lifelong learning approach

Finally, a similar direction for further work is concerned with the plasticity of

the learning process. Indeed, we have presented in this thesis two incremental

learning algorithms in Section 2.8.3 that equally weigh the different demonstra-

tions provided to the robot (see for example the linear learning rate considered

in Equation (2.15)). As this linearity assumption does not meet the learning

rates in humans, the incorporation of the notion of plasticity (or alternatively

forgetting factors) in the current learning framework could be explored in fur-

ther work. Plasticity could be considered by fixing a behaviour in a repertoire as

soon as the skill has been learned sufficiently, which would be a crucial element

for lifelong robot learning (Thrun & Mitchell, 1995). The early experiences of a

behaviour could also be tone down by forgetting the oldest examples and focus-

ing on the new demonstrations only, which would also allow the robot to adapt

its behaviour when the constraints change in a consequent manner. A direction

for further work is thus to explore this trade-off and to select an optimal learn-

ing rate function to define how the learning system should use its experience

acquired through time.

197

7 Conclusion

Throughout this thesis, we explored the use and combination of different ma-

chine learning techniques in a Robot Programming by Demonstration (RbD)

framework to probabilistically encode a task represented as a trace of sensory

variables collected by the robot. By encapsulating the local properties of these

multivariate trajectories through a probabilistic model, we then showed that

this representation of the data satisfied recognition, generalization and repro-

duction properties required by the RbD process, and demonstrated how the

model could extract constraints in a continuous and probabilistic form, which

allowed to combine multiple constraints for the reproduction of the task.

In Chapter 1, we presented an overview of the different issues tackled by

RbD, and how these issues were influenced by the introduction of humanoid

robots in robotics. We then highlighted the importance and novelty of our

approach, and presented its situation in current research.

In Chapter 2, we described the different models that we used during the

time period of this PhD, starting from a Hidden Markov Model (HMM) rep-

resentation of the skill. We then described different methods to retrieve data

using this representation, and showed how this conducted us to the joint use of

Gaussian Mixture Model (GMM) and Gaussian Mixture Regression (GMR) to

encode, recognize and reproduce a skill, with the first principal advantage over

HMM to represent the constraints continuously along the trajectory, which also

allowed to combine several constraints and to retrieve a smooth generalized tra-

jectory for the reproduction of the skill. We then showed how this GMM/GMR

model could be incrementally trained, and proposed different regularization and

pre-processing methods for the particular use of these statistical algorithms in a

RbD framework. We concluded the chapter by showing the technical advantages

of the current GMM/GMR approach adopted in the latest RbD framework de-

veloped over the initial attempts at reproducing a skill through the use of an

HMM representation.

In Chapters 2, 4 and 5, we presented several experiments to test the differ-

ent algorithms of the framework and to illustrate its use though various HRI

applications, by increasing gradually the challenges tackled by the proposed

RbD framework. Indeed, we first started in Section 5.2 by describing one of

our first experiment using HMM to show how to reproduce a skill using a single

controller (by selecting either a direct controller or an inverse kinematics con-

troller). Then, the problem of combining the different controllers was presented

198

in Section 5.3. Finally, Section 5.4 presented the most up-to-date experiments

considering an incremental learning of the skill, multiple constraints on different

objects, and the use of different modalities to convey the demonstrations.

In Chapter 6, we discussed several issues concerning the teaching methods

employed throughout this thesis, from a technical perspective (use of motion

sensors and kinesthetic teaching) and from a social perspective (use of active

teaching methods in a human-robot interactive scenario). We then discussed

different cases of failure of the current system, and finally proposed several

directions for further research from various perspectives, by noting that some

of these propositions are current ongoing work.

This thesis contributed to RbD by proposing a generic probabilistic frame-

work to deal with recognition, generalization, reproduction and evaluation is-

sues, and more specifically to deal with the automatic extraction of task con-

straints and with the determination of a controller satisfying several constraints

simultaneously to reproduce the skill in a different context. The generality of

the approach was demonstrated through the use of various experiments and

through collaborative work showing that the probabilistic framework suggested

can be used co-jointly with other approaches. It also contributed to Human-

Robot Interaction (HRI) by proposing active teaching methods that puts the

human teacher in the loop of the robot’s learning by incrementally learning a

skill through scaffolding methods and by using different modalities to demon-

strate gestures.

199

A Additional results

A.1 Additional results for the search of an

optimal latent space for HMM

0 20 40 60 80 100

−150

−100

−50

0

50

100

150

θ 1

joint angles

0 20 40 60 80 100

−200

−150

−100

−50

0

50

100

θ 2

0 20 40 60 80 100

−200

−150

−100

−50

0

50

100

θ 3

0 20 40 60 80 100

−250

−200

−150

−100

−50

0

50

100

θ 4

step

0 20 40 60 80 100

−40

−20

0

20

40

x 1

hand path

0 20 40 60 80 100

−20

0

20

40

60

x 2

0 20 40 60 80 100

−40

−20

0

20

40

x 3

step

0 20 40 60 80 100

−150

−100

−50

0

50

100

150

ξθ 1

PCA joint angles

0 20 40 60 80 100

−150

−100

−50

0

50

100

150

ξθ 2

step

0 20 40 60 80 100

−40

−20

0

20

40

ξx 1

PCA hand path

0 20 40 60 80 100

−40

−20

0

20

40

ξx 2

step

Figure A.1: Decomposition by PCA and reconstruction of the trajectories fordrawing of the alphabet letter C. The training data with additional syntheticnoise nt, rt, ns, rs = 100%, 100%, 100%, 100% are represented in thin line.Superimposed to those, we show the reconstructed trajectories (in thick line).First column: Joint angle trajectories. Second column: Hand paths. Third col-umn: 2 components resulting from PCA for the joint angle trajectories. Fourthcolumn: 2 components resulting from PCA for the hand paths.

This appendix presents additional results of the experiment presented in

Section 4.2. Figures A.1-A.3 show examples of the encoding and reproduction

results when applying PCA preprocessing. We observe that the signals extracted

by PCA and ICA (Figure 4.4) present many similarities. As expected, the

principal and independent components for both joint angle trajectories and hand

paths bear the same qualitative characteristics, highlighting the correlations

between the two datasets.

200

0 20 40 60 80 100

60

80

100

120

140

160

180

θ 1

joint angles

0 20 40 60 80 100−140

−120

−100

−80

−60

−40

−20

θ 2

0 20 40 60 80 100

40

60

80

100

120

140

160

θ 3

0 20 40 60 80 100

−160

−140

−120

−100

−80

−60

−40

θ 4

step

0 20 40 60 80 100

−20

−10

0

10

20

30

x 1

hand path

0 20 40 60 80 100

10

20

30

40

50

60

x 2

0 20 40 60 80 100

−20

−10

0

10

20

x 3step

0 20 40 60 80 100

−60

−40

−20

0

20

40

60

ξθ 1

PCA joint angles

0 20 40 60 80 100

−20

−10

0

10

20

ξx 1

PCA hand path

Figure A.2: Decomposition by PCA and reconstruction of the trajectories re-sulting for the waving gesture. The training data with additional syntheticnoise nt, rt, ns, rs = 10%, 20%, 10%, 20% are represented in thin line. Su-perimposed to those, we show the reconstructed trajectories (in thick line). Firstcolumn: Joint angle trajectories. Second column: Hand paths. Third column: 2components resulting from PCA for the joint angle trajectories. Fourth column:2 components resulting from PCA for the hand paths.

A.2 Additional results for the search of an

optimal latent space for GMM

This appendix presents additional results of the experiment presented in Section

4.3 for Gestures 2-10 (Figures A.4-A.12).

201

0 20 40 60 80 100

40

60

80

100

120

140

160

θ 1

joint angles

0 20 40 60 80 100

−120

−100

−80

−60

−40

−20

θ 2

0 20 40 60 80 100

−20

0

20

40

60

80

100

θ 3

0 20 40 60 80 100

−160

−140

−120

−100

−80

−60

−40

θ 4

step

0 20 40 60 80 100

−10

−5

0

5

10

x 1hand path

0 20 40 60 80 100

15

20

25

30

35

x 2

0 20 40 60 80 100

−10

−5

0

5

10

15

x 3

step

0 20 40 60 80 100

−60

−40

−20

0

20

40

60

ξθ 1

PCA joint angles

0 20 40 60 80 100

−10

−5

0

5

10

ξx 1

PCA hand path

Figure A.3: Decomposition by PCA and reconstruction of the trajectories for theknocking gesture. Training data with additional synthetic noise nt, rt, ns, rs =100%, 100%, 100%, 100% are represented in thin lines. Superimposed to those,we show in the reconstructed trajectories (in thick line). First column: Jointangle trajectories. Second column: Hand paths. Third column: 2 componentsresulting from PCA for the joint angle trajectories. Fourth column: 2 compo-nents resulting from PCA for the hand paths.

202

Enco

din

gin

ala

tent

spac

eof

mot

ion

−1 0 1

−1

−0.5

0

0.5

1

ξ1

ξ 2

−1 0 1

−1

−0.5

0

0.5

1

ξ1

ξ 2

−1 0 1

−1

−0.5

0

0.5

1

ξ1

ξ 2

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1000

0.5

1

1.5

t

θ 2

50 100−1.5

−1

−0.5

t

θ 3

50 1000

0.5

1

t

θ 4

50 100−0.8

−0.6

−0.4

−0.2

t

θ 5

50 100−1

0

1

t

θ 6

50 1000.2

0.4

0.6

0.8

t

θ 7

50 100−0.8−0.6−0.4−0.2

00.2

t

θ 8

50 100

−1.5

−1

−0.5

t

θ 9

50 100−0.1

00.10.20.3

t

θ 10

50 100−0.4−0.2

00.20.4

t

θ 11

50 100

−0.2

−0.1

0

0.1

t

θ 12

50 1000

0.20.40.60.8

t

θ 13

50 100−0.3

−0.2

−0.1

0

0.1

t

θ 14

50 100−0.8−0.6−0.4−0.2

0

t

θ 15

50 100

0.06

0.08

0.1

t

θ 16

Figure A.4: Encoding of Gesture 2 in a latent space of motion. Top cell:Generalized trajectories represented in a latent space of motion by using a lineartransformation defined by PCA (solid line), CCA (dotted line) and ICA (dash-dotted line). Bottom cell: The generalized trajectories from PCA, CCA andICA are projected back in the original data space (they are superimposed).

203

Enco

din

gin

ala

tent

spac

eof

mot

ion

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 3

−1 0 1

−1

0

1

ξ1

ξ 3

−1 0 1

−1

0

1

ξ1

ξ 3

−1 0 1

−1

0

1

ξ2

ξ 3

−1 0 1

−1

0

1

ξ2

ξ 3−1 0 1

−1

0

1

ξ2ξ 3

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 100

0.5

1

1.5

t

θ 2

50 100

−0.5

−0.4

−0.3

−0.2

t

θ 3

50 1000

0.5

1

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 100

1.55

1.6

1.65

t

θ 6

50 1000.14

0.15

0.16

0.17

t

θ 7

50 1000.1

0.2

0.3

0.4

t

θ 8

50 100

−0.17

−0.16

−0.15

−0.14

t

θ 9

50 1000

0.20.40.60.8

t

θ 10

50 100

−0.1

0

0.1

t

θ 11

50 100

−0.3

−0.2

−0.1

0

t

θ 12

50 100

−0.08−0.06−0.04−0.02

0

t

θ 13

50 100−0.5

−0.45

−0.4

−0.35

t

θ 14

50 100

−0.8

−0.6

−0.4

−0.2

t

θ 15

50 100

0.020.040.060.08

0.1

t

θ 16


204

Enco

din

gin

ala

tent

spac

eof

mot

ion

−2−1 0 1

−2

−1

0

1

ξ1

ξ 2

−2−1 0 1

−2

−1

0

1

ξ1

ξ 2

−2−1 0 1

−2

−1

0

1

ξ1

ξ 2

−2−1 0 1

−2

−1

0

1

ξ1

ξ 3

−2−1 0 1

−2

−1

0

1

ξ1

ξ 3

−2−1 0 1

−2

−1

0

1

ξ1

ξ 3

−2−1 0 1

−2

−1

0

1

ξ2

ξ 3

−2−1 0 1

−2

−1

0

1

ξ2

ξ 3−2−1 0 1

−2

−1

0

1

ξ2ξ 3

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1000.5

1

1.5

t

θ 2

50 100−0.8

−0.6

−0.4

−0.2

t

θ 3

50 1000

0.2

0.4

0.6

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 1001.6

1.621.641.661.68

t

θ 6

50 1000.14

0.15

0.16

0.17

t

θ 7

50 100

0.30.40.50.6

t

θ 8

50 100

−0.35−0.3

−0.25−0.2

t

θ 9

50 100−0.5

0

0.5

t

θ 10

50 100

−0.3

−0.2

−0.1

t

θ 11

50 100−0.2

00.20.40.6

t

θ 12

50 100−0.05

0

0.05

t

θ 13

50 100

−0.4

−0.3

−0.2

t

θ 14

50 100−0.8−0.6−0.4−0.2

0

t

θ 15

50 100

−0.05

0

0.05

0.1

t

θ 16


205

Enco

din

gin

ala

tent

spac

eof

mot

ion

−2 0 2

−2

−1

0

1

2

ξ1

ξ 2

−2 0 2

−2

−1

0

1

2

ξ1

ξ 2

−2 0 2

−2

−1

0

1

2

ξ1

ξ 2−2 0 2

−2

−1

0

1

2

ξ1

ξ 3

−2 0 2

−2

−1

0

1

2

ξ1

ξ 3

−2 0 2

−2

−1

0

1

2

ξ1

ξ 3

−2 0 2

−2

−1

0

1

2

ξ1

ξ 4−2 0 2

−2

−1

0

1

2

ξ1ξ 4

−2 0 2

−2

−1

0

1

2

ξ1

ξ 4

−2 0 2

−2

−1

0

1

2

ξ2

ξ 3

−2 0 2

−2

−1

0

1

2

ξ2

ξ 3

−2 0 2

−2

−1

0

1

2

ξ2

ξ 3

−2 0 2

−2

−1

0

1

2

ξ2

ξ 4

−2 0 2

−2

−1

0

1

2

ξ2

ξ 4

−2 0 2

−2

−1

0

1

2

ξ2ξ 4

−2 0 2

−2

−1

0

1

2

ξ3

ξ 4

−2 0 2

−2

−1

0

1

2

ξ3

ξ 4

−2 0 2

−2

−1

0

1

2

ξ3

ξ 4

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 100

11.21.41.61.8

t

θ 2

50 100

−0.3

−0.25

−0.2

−0.15

t

θ 3

50 1000

0.2

0.4

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 100

1.2

1.4

1.6

t

θ 6

50 100

0.2

0.3

0.4

t

θ 7

50 100

−0.4

−0.2

0

t

θ 8

50 100

−1.5

−1

−0.5

t

θ 9

50 1000

0.2

0.4

0.6

t

θ 10

50 100

−0.1

0

0.1

0.2

t

θ 11

50 100−0.2

00.20.40.60.8

t

θ 12

50 100

−0.5

0

0.5

t

θ 13

50 100

−0.35

−0.3

−0.25

t

θ 14

50 100

−0.3

−0.2

−0.1

t

θ 15

50 100

0.020.040.060.08

0.1

t

θ 16


206

Enco

din

gin

ala

tent

spac

eof

mot

ion

−2 0 2−2

−1

0

1

2

ξ1

ξ 2

−2 0 2−2

−1

0

1

2

ξ1

ξ 2

−2 0 2−2

−1

0

1

2

ξ1

ξ 2

−2 0 2−2

−1

0

1

2

ξ1

ξ 3

−2 0 2−2

−1

0

1

2

ξ1

ξ 3

−2 0 2−2

−1

0

1

2

ξ1

ξ 3

−2 0 2−2

−1

0

1

2

ξ2

ξ 3

−2 0 2−2

−1

0

1

2

ξ2

ξ 3−2 0 2

−2

−1

0

1

2

ξ2

ξ 3

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1001.4

1.6

1.8

t

θ 2

50 100

−0.6

−0.4

−0.2

t

θ 3

50 1000

0.5

1

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 100

1.6

1.62

1.64

1.66

t

θ 6

50 1000.14

0.15

0.16

0.17

t

θ 7

50 100

0.2

0.3

0.4

t

θ 8

50 100−0.2

−0.18

−0.16

−0.14

t

θ 9

50 100−0.2

0

0.2

0.4

0.6

t

θ 10

50 100

−0.1

−0.05

0

0.05

t

θ 11

50 100

−0.5

0

0.5

t

θ 12

50 100

0.02

0.04

0.06

t

θ 13

50 100

−0.4

−0.3

−0.2

t

θ 14

50 100−0.8−0.6−0.4−0.2

0

t

θ 15

50 100−0.05

0

0.05

0.1

t

θ 16


207

Enco

din

gin

ala

tent

spac

eof

mot

ion

−2 −1 0 1−2

−1

0

1

ξ1

ξ 2

−2 −1 0 1−2

−1

0

1

ξ1

ξ 2

−2 −1 0 1−2

−1

0

1

ξ1

ξ 2−2 −1 0 1

−2

−1

0

1

ξ1

ξ 3

−2 −1 0 1−2

−1

0

1

ξ1

ξ 3

−2 −1 0 1−2

−1

0

1

ξ1

ξ 3

−2 −1 0 1−2

−1

0

1

ξ1

ξ 4−2 −1 0 1

−2

−1

0

1

ξ1ξ 4

−2 −1 0 1−2

−1

0

1

ξ1

ξ 4

−2 −1 0 1−2

−1

0

1

ξ2

ξ 3

−2 −1 0 1−2

−1

0

1

ξ2

ξ 3

−2 −1 0 1−2

−1

0

1

ξ2

ξ 3

−2 −1 0 1−2

−1

0

1

ξ2

ξ 4

−2 −1 0 1−2

−1

0

1

ξ2

ξ 4

−2 −1 0 1−2

−1

0

1

ξ2ξ 4

−2 −1 0 1−2

−1

0

1

ξ3

ξ 4

−2 −1 0 1−2

−1

0

1

ξ3

ξ 4

−2 −1 0 1−2

−1

0

1

ξ3

ξ 4

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1000.8

11.21.41.6

t

θ 2

50 100

−0.8

−0.6

−0.4

−0.2

t

θ 3

50 100

−1

0

1

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 100

1.6

1.65

1.7

t

θ 6

50 1000.14

0.15

0.16

0.17

t

θ 7

50 1000

0.2

0.4

0.6

t

θ 8

50 100−0.35

−0.3

−0.25

t

θ 9

50 100

0.4

0.6

0.8

t

θ 10

50 100

−0.75−0.7

−0.65−0.6

−0.55

t

θ 11

50 100

−0.5

0

0.5

t

θ 12

50 100

−0.2

−0.1

0

t

θ 13

50 100−0.6

−0.4

−0.2

0

t

θ 14

50 100−0.8−0.6−0.4−0.2

0

t

θ 15

50 1000.07

0.08

0.09

0.1

t

θ 16


208

Enco

din

gin

ala

tent

spac

eof

mot

ion

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 2

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1000.60.8

11.21.41.61.8

t

θ 2

50 100−0.8

−0.6

−0.4

−0.2

t

θ 3

50 100

−1

−0.5

0

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 1001.48

1.51.521.541.561.58

t

θ 6

50 1000.14

0.15

0.16

0.17

t

θ 7

50 100

−0.25

−0.2

−0.15

−0.1

t

θ 8

50 100−0.2

−0.18

−0.16

−0.14

t

θ 9

50 100−0.2

00.20.40.60.8

t

θ 10

50 100−0.5

−0.45

−0.4

−0.35

t

θ 11

50 100−0.2

0

0.2

t

θ 12

50 100

0.05

0.1

0.15

0.2

t

θ 13

50 100−0.6

−0.5

−0.4

−0.3

t

θ 14

50 100−0.8−0.6−0.4−0.2

0

t

θ 15

50 100−0.1

0

0.1

t

θ 16


209

Enco

din

gin

ala

tent

spac

eof

mot

ion

−1 0 1

−1

−0.5

0

0.5

1

ξ1

ξ 2

−1 0 1

−1

−0.5

0

0.5

1

ξ1

ξ 2

−1 0 1

−1

−0.5

0

0.5

1

ξ1

ξ 2

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 1001.41.61.8

22.22.4

t

θ 2

50 100

−1−0.8−0.6−0.4−0.2

t

θ 3

50 100

−1

−0.5

0

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 1001.41.61.8

22.22.4

t

θ 6

50 1000.2

0.4

0.6

0.8

t

θ 7

50 1000

0.20.40.60.8

t

θ 8

50 100

−1.5

−1

−0.5

t

θ 9

50 100

00.20.40.60.8

t

θ 10

50 100−0.8−0.6−0.4−0.2

00.2

t

θ 11

50 100

−0.8

−0.6

−0.4

−0.2

t

θ 12

50 100

00.20.40.60.8

t

θ 13

50 100−0.18

−0.16

−0.14

−0.12

−0.1

t

θ 14

50 100

−0.1

−0.05

0

t

θ 15

50 100

−0.020

0.020.040.06

t

θ 16


210

Enco

din

gin

ala

tent

spac

eof

mot

ion

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 2

−1 0 1

−1

0

1

ξ1

ξ 3

−1 0 1

−1

0

1

ξ1

ξ 3

−1 0 1

−1

0

1

ξ1

ξ 3

−1 0 1

−1

0

1

ξ2

ξ 3

−1 0 1

−1

0

1

ξ2

ξ 3−1 0 1

−1

0

1

ξ2ξ 3

Re-

pro

ject

ion

inth

eor

igin

aldat

asp

ace

50 1000.14

0.15

0.16

0.17

t

θ 1

50 100

1

1.5

2

t

θ 2

50 100−1

−0.8−0.6−0.4−0.2

t

θ 3

50 100−1

−0.5

0

t

θ 4

50 100

−1.5

−1

−0.5

t

θ 5

50 100

1.6

1.7

1.8

t

θ 6

50 1000.14

0.15

0.16

0.17

t

θ 7

50 100−0.2

00.20.40.6

t

θ 8

50 100

−0.5

−0.4

−0.3

t

θ 9

50 1000

0.20.40.60.8

t

θ 10

50 100

−0.2

−0.1

0

t

θ 11

50 100

0

0.2

0.4

t

θ 12

50 100−0.5

−0.4

−0.3

−0.2

−0.1

t

θ 13

50 100−0.3

−0.2

−0.1

t

θ 14

50 100

−0.8

−0.6

−0.4

−0.2

t

θ 15

50 100

0.020.040.060.08

0.1

t

θ 16


211

A.3 Additional results for the optimal selection

of the number of parameters in the GMM

20 40 60 80 100

−101

ξ s,1

20 40 60 80 100−2−1

01

ξ s,2

20 40 60 80 100

102030

ξt

Ω

20 40 60 80 100

−101

ξ s,1

20 40 60 80 100−2−1

01

ξ s,2

ξt

Figure A.13: Segmentation of Gesture 2 by the curvature-based method (leftcolumn), used to initialize the GMM parameters (right column). In the leftcolumn, the trajectory Ω (last row) containing curvature information is usedto defined local maxima to segment the trajectory ξ. In the right column, theGMM parameters initialized by the curvature segmentation (dashed line) arethen updated by EM (solid line).

20 40 60 80 100

−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100

−1

0

1

ξ s,3

20 40 60 80 100

51015

ξt

Ω

20 40 60 80 100

−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100

−1

0

1

ξ s,3

ξt



Section 4.4 for Gestures 2-10 (Figures A.13- A.21).

212

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100−1

0

1ξ s,3

20 40 60 80 1002468

1012

ξt

Ω

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100−1

0

1

ξ s,3

ξt


20 40 60 80 100

−1.5−1

−0.50

0.5

ξ s,1

20 40 60 80 100−1

0

1

ξ s,2

20 40 60 80 100−1

0

1

ξ s,3

20 40 60 80 100−1

0

1

ξ s,4

20 40 60 80 100

2468

ξt

Ω

20 40 60 80 100

−1.5−1

−0.50

0.5

ξ s,1

20 40 60 80 100−1

0

1

ξ s,2

20 40 60 80 100−1

0

1

ξ s,3

20 40 60 80 100−1

0

1

ξ s,4

ξt


213

20 40 60 80 100

−1

0

1

ξ s,1

20 40 60 80 100−1

0

1

ξ s,2

20 40 60 80 100−1

0

1ξ s,3

20 40 60 80 100

20

40

ξt

Ω

20 40 60 80 100

−1

0

1

ξ s,1

20 40 60 80 100−1

0

1

ξ s,2

20 40 60 80 100−1

0

1

ξ s,3

ξt


20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100−1

0

1

ξ s,2

20 40 60 80 100

−1

0

1

ξ s,3

20 40 60 80 100

−1

0

1

ξ s,4

20 40 60 80 100

204060

ξt

Ω

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100−1

0

1

ξ s,2

20 40 60 80 100

−1

0

1

ξ s,3

20 40 60 80 100

−1

0

1

ξ s,4

ξt


214

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1−0.5

00.5

ξ s,2

20 40 60 80 100

51015

ξt

Ω

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1−0.5

00.5

ξ s,2

ξt


20 40 60 80 100

−101

ξ s,1

20 40 60 80 100

−101

ξ s,2

20 40 60 80 100

10

20

30

ξt

Ω

20 40 60 80 100

−101

ξ s,1

20 40 60 80 100

−101

ξ s,2

ξt


215

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1

ξ s,2

20 40 60 80 100−1

0

1

ξ s,3

20 40 60 80 100

1020304050

ξt

Ω

20 40 60 80 100−1

0

1

ξ s,1

20 40 60 80 100

−1

0

1ξ s,2

20 40 60 80 100−1

0

1

ξ s,3

ξt


216

A.4 Additional results for the robustness

evaluation of the incremental learning

process

Gesture 3

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Gesture 4

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Figure A.22: Reproduction using Gaussian Mixture Regression (GMR) in alatent space of motion (Gestures 3-4 ), with models trained with the batch (B)and incremental training methods (IA and IB), representing respectively thedirect update and generative method.


Section 4.5, for Gesture 3-10 (Figures A.22-A.25).

217

Gesture 5

50 100−0.2

0

0.2ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,4

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Gesture 6

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Figure A.23: Reproduction using Gaussian Mixture Regression (GMR) in alatent space of motion (Gestures 5-6 ), with models trained with the batch (B)and incremental training methods (IA and IB), representing respectively thedirect update and generative method.

218

Gesture 7

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,4

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Gesture 8

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Figure A.24: Reproduction using Gaussian Mixture Regression (GMR) in alatent space of motion (Gestures 7-8 ), with models trained with the differentbatch (B) and incremental training methods IA and IB, representing respectivelythe direct update and generative method.

219

Gesture 9

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Gesture 10

50 100−0.2

0

0.2

ξ s,1

Data

50 100−0.2

0

0.2GMR − B

50 100−0.2

0

0.2GMR − IA

50 100−0.2

0

0.2GMR − IB

50 100−0.2

0

0.2

ξ s,2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

50 100−0.2

0

0.2

ξ s,3

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt50 100

−0.2

0

0.2

ξt

Figure A.25: Reproduction using Gaussian Mixture Regression (GMR) in alatent space of motion (Gestures 9-10 ), with models trained with the differentbatch (B) and incremental training methods IA and IB, representing respectivelythe direct update and generative method.

220

B Publications of the author

This appendix lists the publications of the author with a brief comment on how

they are connected to this thesis. The publications are listed according to the

date of their first submission.

Calinon, S. and Billard, A. (2003). PDA interface for humanoid

robots. In Proceedings of the IEEE International Conference on

Humanoid Robots (Humanoids).

This paper presented one of our early work concerning the use of vision

to let a robot imitate basic movements of the user’s head and arms, and the

use of speech recognition to associate a continuous movement with a high-level

description of the behaviour (represented as keywords). The keywords could

define either which body part was used (e.g., head, arm) or which affordances

this body part has with the body (e.g., left, right). The keywords could also rep-

resent the effectivities required to perform the behaviour correctly (e.g., move

up, move down). Using a simple associative memory (1-layer Artificial Neural

Network trained by Hebbian learning), the robot was then able to associate and

combine the high-level labels with the corresponding features. By observing

multiple demonstrations, the robot thus extracted statistically what were the

relevant features corresponding to the keywords, and was then able to repro-

duce and combine the corresponding features by activation and association of

the corresponding keywords. The implementation used a Personal Digital As-

sistant (PDA) or Pocket-PC with a built-in camera and microphone to process

multimodal information on-board (vision and speech). The aim was to provide

the small-sized humanoid robot Robota (Billard, 2003) with simple imitation

and learning capabilities by attaching the PDA to the robot’s belly in order to

become fully autonomous in terms of processing and power supply.

Billard, A., Epars, Y., Calinon, S., Cheng, G., and Schaal, S. (2004).

Discovering optimal imitation strategies. Robotics and Autonomous

Systems, 47(2-3):69–77.

This article developed a general policy for learning the relevant features of

an imitation task, by defining a probabilistic metric of imitation as a weighted

sum of dissimilarities measures. The metric was defined as a combination of

221

hierarchical control strategies for different subsets of data (levels of imitation).

One of this level of imitation was concerned with the reproduction of similar

joint angles trajectories as the ones observed during the demonstrations. This

was tackled by using HMM to encode the gestures and to retrieve a generalized

version of the trajectories.

Calinon, S. and Billard, A. (2004). Stochastic gesture production

and recognition model for a humanoid robot. In Proceedings of

the IEEE/RSJ international Conference on Intelligent Robots and

Systems (IROS), pages 2769–2774.

This paper explored the use of HMMs to deal simultaneously with the en-

coding, recognition, generalization and reproduction issues required by RbD.

The aim was to treat these different issues in a single probabilistic framework

and to use the generalization capabilities of HMM to extract which were the rel-

evant features to imitate by using the keypoints encoding approach presented in

Section 2.4.3. This paper also suggested the use of an automatic segmentation

algorithm to extract relevant trajectories from a continuous flow of gestures.

An experiment using X-Sens motion sensors to acquire movement data was

presented, consisting in drawing the alphabet letters A,B,C and D on a board

using the DB robot to reproduce the skill.1

Calinon, S. and Billard, A. (2007). Learning of gestures by imitation

in a humanoid robot. In Imitation and Social Learning in Robots,

Humans and Animals: Behavioural, Social and Communicative Di-

mensions. Cambridge University Press, K. Dautenhahn and C.L.

Nehaniv edition, pages 153–177.

This book chapter improved the previous approach based on HMM by en-

coding the task simultaneously in joint space (motor representation) and in task

space (visual representation), using the continuous encoding scheme presented

in Section 2.4.4 and the PCA pre-processing presented in Section 2.9.1. The cor-

respondence between the motor and visual representation of the task was then

performed in a latent space of motion through the hidden states of the HMM,

as discussed in Section 6.4. We showed in this book chapter that the proposed

computational framework shared similarities with works on imitation in devel-

opmental psychology and ethology. We also presented in this book chapter the

automatic selection of an optimal controller for the skill through a metric of

imitation. The metric was used to determine whether the joint angles or hand

paths were the most suitable representation for the task, similarly as in the

experiment presented in Section 5.2. In the case of joint angles, the generalized

1DB is an hydraulic anthropomorphic robot designed by the Kawato Dynamic Brain Re-search Project and by the Sarcos company, see, e.g., (Atkeson et al., 2000).

222

trajectories were directly replayed on the robot. In the case of hand-path, an

inverse kinematics algorithm was used to compute the joint angles correspond-

ing to the hand-paths using an energy minimization criterion. We presented an

experiment with HOAP-2 using X-Sens motion sensors to acquire joint angle

trajectories (Section 3.4), and using an external stereoscopic vision system to

acquire hand paths (Section 3.3). The tasks considered were a waving gesture,

a knocking gesture, a drinking gesture and writing different alphabet letters on

a board, similarly to the gestures used in the experiment presented in Section

4.2.

Calinon, S. and Billard, A. (2005). Recognition and reproduction of

gestures using a probabilistic framework combining PCA, ICA and

HMM. In Proceedings of the International Conference on Machine

Learning (ICML), pages 105–112.

This paper contrasted the use of PCA and ICA to pre-process gestures for

further encoding in HMM, and evaluated recognition and reproduction perfor-

mance of the model, as presented in Section 4.2. This paper also discussed the

stability of the proposed learning system.

Billard, A., Calinon, S., and Guenter, F. (2006). Discriminative and

adaptive imitation in uni-manual and bi-manual tasks. Robotics and

Autonomous Systems, 54(5):370–384.

This article improved the previous framework by finding an optimal control

strategy using simultaneously the joint angle and hand path representation rela-

tive to an object. The controller was found by minimizing a metric of imitation

balancing the reproduction of either joint angles or hand paths depending on

the variability observed during the demonstrations. We presented in this article

experiments with HOAP-2 using kinesthetic teaching to acquire joint angle tra-

jectories, and an external stereoscopic vision system to track the initial position

of the object. The dataset consisted of gestures collected while ”stirring the

fondue”, ”hammering a nail”, and ”grabbing a ball with two hands”.

Calinon, S., Epiney, J., and Billard, A. (2005). A humanoid robot

drawing human portraits. In IEEE-RAS International Conference

on Humanoid Robots, pages 161–166.

This paper presented an entertaining HRI application developed to demon-

strate the motor capabilities of the robot and to explore the use of different

sensory modalities to interact with the robot through a scenario where a hu-

manoid robot draws the portrait of the user through an interactive process using

speech, vision and simple communication gestures.

223

Calinon, S., Guenter, F., and Billard, A. (2005). Goal-directed imita-

tion in a humanoid robot. In Proceedings of the IEEE International

Conference on Robotics and Automation (ICRA), pages 299–304.

This paper presented a more detailed version of the experiments in Sec-

tion 5.2 (the dots experiment), highlighting the advantages of including priors

knowledge in a goal-directed imitation learning framework.

Calinon, S., Guenter, F., and Billard, A. (2006). On learning the sta-

tistical representation of a task and generalizing it to various contexts.

In Proceedings of the IEEE International Conference on Robotics and

Automation (ICRA), pages 2978–2983.

This paper presented an early version of the experiments presented in Section

5.3 (the Chess Task and Sugar Task), where the data were processed in the

original data space instead of in a latent space of motion.

Calinon, S., Guenter, F., and Billard, A. (2007). On learning, repre-

senting and generalizing a task in a humanoid robot. IEEE Trans-

actions on Systems, Man and Cybernetics, Part B. Special issue on

robot learning by observation, demonstration and imitation, 37(2),

pages 286–298.

This article presented a more detailed version of the experiments introduced

in Section 5.3 (the Bucket Task, Chess Task and Sugar Task).

Calinon, S. and Billard, A. (2006). Teaching a humanoid robot to

recognize and reproduce social cues. In Proceedings of the IEEE

International Symposium on Robot and Human Interactive Commu-

nication (RO-MAN), pages 346–351.

This paper explored the use of simple communicative gestures and atten-

tional scaffolding in a RbD framework, where gazing and pointing directions

were estimated probabilistically through the use of motion sensors, and where

simple gestures (learned by HMM through an imitation game) were used to

guide a teaching scenario and provide feedback to the robot. This paper also

suggested that the probabilistic representation (using Gaussian distribution) of

attentional scaffolding could be jointly used in further work as priors in the RbD

framework proposed in this thesis (see also discussion in Section 6.8.3).

Calinon, S. and Billard, A. (2007). Incremental learning of gestures

by imitation in a humanoid robot. In Proceedings of the ACM/IEEE

International Conference on Human-Robot Interaction (HRI).

224

This paper introduced the two incremental learning strategies presented in

Section 2.8.3, and tested both algorithms as presented in Section 4.5 (Basketball

gestures).

Hersch, M., Guenter, F., Calinon, S., and Billard, A. (2006). Learn-

ing dynamical system modulation for constrained reaching tasks. In

Proceedings of the IEEE-RAS International Conference on Humanoid

Robots (Humanoids), pages 444–449.

This paper presented the collaborative work discussed in Section 6.7.1. It

introduced the use of a GMM/GMR representation to encode velocity profiles

demonstrated by a human user, where a generalized version of the velocity

profiles was used to modulate a dynamical system. The joint work allowed to

learn a skill by imitation and to deal with perturbations during the reproduction

of the skill.

Calinon, S. and Billard, A. (2007). What is the teacher’s role in

robot programming by demonstration? - Toward benchmarks for im-

proved learning. Interaction Studies. Special Issue on Psychological

Benchmarks in Human-Robot Interaction, 8(3)..

This article presented the first two experiments in Section 5.4 (the die task

and the stacking object task). We highlighted in this paper the active role of the

user in the teaching process, and presented several insights from other fields of

research (Section 6.2) that conducted us to the design of the current RbD setup.

The further evaluation of the current setup within the COGNIRON consortium,

and the importance of considering the pedagogical skills of the human user when

evaluating a RbD framework were also discussed in this article.

Calinon, S. and Billard, A. (2007). Active teaching in robot program-

ming by demonstration. In Proceedings of the IEEE International

Symposium on Robot and Human Interactive Communication (RO-

MAN).

This paper discussed the active role of the user in a RbD paradigm using

an incremental teaching scenario, and presented a more detailed version of the

Chess experiment in Section 5.4.

Guenter, F., Hersch, M., Calinon, S., and Billard, A. (2007). Re-

inforcement learning for imitating constrained reaching movements.

Advanced Robotics, Special Issue on Imitative Robots. In press.

This article presented the collaborative work discussed in Section 6.7.2 which

225

proposed to link the framework presented in this thesis to a Reinforcement

Learning approach. This experiment demonstrated that the joint of these two

learning approaches allows the robot to learn by imitation and to extend the

learned skill through self-experimentation.

226

C Source codes available onthe internet

The source codes for the most important algorithms presented in this thesis

are available for download on the Learning Algorithms and Systems Laboratory

(LASA) website at the address:

http://lasa.epfl.ch/sourcecode/

Most of these source codes were available online simultaneously with the pub-

lications of the corresponding papers. The principal aim was to permit to the

interested reader to run and modify several examples and to test the proposed

algorithms with their own dataset. The second aim was to gather as much feed-

back as possible for the further improvements of the proposed learning system.

Together with the diffusion of the source codes, the use of commercial products

for the robots and sensory hardware was also aimed at giving the opportunity

to other researchers to reproduce the experiments presented throughout this

thesis and to build new applications upon them. The list of source codes and

executable functions are listed below.

GMM/GMR v1.0

Demo1

Demonstration of the generalization process using Gaussian Mixture Regression

(GMR). This software loads a 3D dataset, trains a Gaussian Mixture Model

(GMM), and retrieves a generalized version of the dataset with associated con-

straints through GMR. Each datapoint has 3 dimensions, consisting of 1 tem-

poral value and 2 spatial values (e.g., drawing on a 2D Cartesian plane). Each

temporal value is used as a query point to retrieve a sequence of expected spatial

distribution through GMR.

Demo2

Demonstration of the use of Gaussian Mixture Regression (GMR) using spatial

variables of arbitrary dimensions as query points. This software loads a 4D

dataset of spatial variables, trains a Gaussian Mixture Model (GMM), and uses

query points of 2 dimensions to retrieve a generalized version of the data for

the remaining 2 dimensions, with associated constraints, through GMR. Each

227

datapoint has 4 dimensions, consisting of 2 × 2 spatial values (e.g., drawing

on a 2D Cartesian plane simultaneously with right and left hand). A new

sequence of 2D spatial values (i.e., data for left hand) is loaded and used as

query points to retrieve a sequence of expected spatial distributions for the

remaining dimensions (i.e., data for right hand).

Demo3

Demonstration of the smooth transitions properties of data retrieved by Gaus-

sian Mixture Regression (GMR). This software loads two 3D datasets, trains

two separates Gaussian Mixture Models (GMMs), and retrieves a generalized

version of the two datasets concatenated in time, with associated constraints,

through GMR. Each datapoint has 3 dimensions, consisting of 1 temporal value

and 2 spatial values (e.g., drawing on a 2D Cartesian plane). A sequence of tem-

poral values is used as query points to retrieve a sequence of expected spatial

distributions through GMR. Even if the positions of the last datapoint in the

first dataset do not coincide with the first datapoints of the second dataset, the

regression using a concatenation of the two GMM components in a single model

produces a smooth trajectory characterized by a smooth transition between the

two datasets.

GMM latent space v1.0

Demo1

This software loads a dataset consisting of several examples of trajectories, finds

a latent space of lower dimensionality encapsulating the important character-

istics of these trajectories using Principal Component Analysis (PCA). The

projected dataset is then encoded in a Gaussian Mixture Model (GMM), and

the components of this GMM are re-projected in the original data space.

GMM/GMR multiple constraints v1.0

Demo1

Demonstration of the reproduction of a generalized trajectory through Gaus-

sian Mixture Regression (GMR), when considering two independent constraints

represented separately in two Gaussian Mixture Models (GMMs). Through

regression, a smooth generalized trajectory satisfying the constraints encapsu-

lated in both GMMs is extracted, with associated constraints represented as

covariance matrices.

228

GMM incremental learning v1.0

Demo1

Demonstration of an incremental learning process of Gaussian Mixture Model

(GMM), where the update mechanism only uses the latest observed trajectory

to update the models (no historical data is used). The dataset consists of several

trajectories presented one-by-one to update the GMM parameters by using an

incremental version of the Expectation-Maximization (EM) algorithm (direct

update method).

Demo2

Demonstration of an incremental learning process of Gaussian Mixture Model

(GMM), where the update mechanism only uses the latest observed trajec-

tory to update the models (no historical data is used). The dataset consists

of several trajectories presented one-by-one to update the GMM parameters

by generating stochastically a new dataset from the current model, adding the

new trajectory to this dataset and updating the GMM parameters using the re-

sulting dataset, through a standard Expectation-Maximization (EM) algorithm

(generative method).

Cone-plane intersection v1.0

Demo1

Demonstration of a cone-plane intersection to extract attentional scaffolding in-

formation (gaze or pointing direction), where the intersection is represented as

a Gaussian distribution defined by its center and associated covariance matrix.

The algorithm can then be used to extract probabilistically information con-

cerning gazing or pointing direction. Indeed, by representing a visual field as a

cone and representing a table as a plane, the Gaussian distribution can be used

to estimate if an object on the table is observed/pointed by the user.

229

References

Aleotti, J., & Caselli, S. (2005, August). Trajectory clustering and stochasticapproximation for robot programming by demonstration. In Proceedings ofthe IEEE/RSJ international conference on intelligent robots and systems(IROS).

Aleotti, J., & Caselli, S. (2006a, May). Grasp recognition in virtual reality forrobot pregrasp planning by demonstration. In Proceedings of the IEEEinternational conference on robotics and automation (ICRA) (pp. 2801–2806).

Aleotti, J., & Caselli, S. (2006b). Robust trajectory learning and approxima-tion for robot programming by demonstration. Robotics and AutonomousSystems, 54 (5), 409–413.

Aleotti, J., Caselli, S., & Reggiani, M. (2004). Leveraging on a virtual environ-ment for robot programming by demonstration. Robotics and AutonomousSystems, 47 (2-3), 153–161.

Alissandrakis, A., Nehaniv, C., & Dautenhahn, K. (2002). Imitation withALICE: Learning to imitate corresponding actions across dissimilar em-bodiments. IEEE Transactions on Systems, Man, and Cybernetics, PartA: Systems and Humans, 32 (4), 482–496.

Alissandrakis, A., Nehaniv, C., & Dautenhahn, K. (2007). Correspondencemapping induced state and action metrics for robotic imitation. IEEETransactions on Systems, Man and Cybernetics, Part B. Special issue onrobot learning by observation, demonstration and imitation, 37 (2), 299–307.

Alissandrakis, A., Nehaniv, C., Dautenhahn, K., & Saunders, J. (2005). An ap-proach for programming robots by demonstration: Generalization acrossdifferent initial configurations of manipulated objects. In Proceedingsof the IEEE international symposium on computational intelligence inrobotics and automation (pp. 61–66).

Arandjelovic, O., & Cipolla, R. (2006). Incremental learning of temporally-coherent Gaussian mixture models. Technical Papers - Society of Manu-facturing Engineers (SME).

Arbib, M., Billard, A., Iacobonic, M., & Oztop, E. (2000). Special issue onsynthetic brain imaging: Grasping, mirror neurons and imitation. NeuralNetworks, 13, 975-997.

Ardizzone, E., Chella, A., & Pirrone, R. (2000, July). Pose classification us-ing support vector machines. In Proceedings of the IEEE-INNS-ENNSinternational joint conference on neural networks (IJCNN) (Vol. 6, pp.317–322).

Arimoto, S., Hashiguchi, H., Sekimoto, M., & Ozawa, R. (2005). Genera-tion of natural motions for redundant multi-joint systems: A differential-geometric approach based upon the principle of least actions. Journal ofRobotic Systems, 22 (11), 583–605.

Asfour, T., Gyarfas, F., Azad, P., & Dillmann, R. (2006, December). Imitationlearning of dual-arm manipulation tasks in humanoid robots. In Pro-

230

ceedings of the IEEE-RAS international conference on humanoid robots(Humanoids).

Atkeson, C. (1990). Using local models to control movement. In Advances inneural information processing systems (Vol. 2, pp. 316–323). San Fran-cisco, CA, USA: Morgan Kaufmann Publishers Inc.

Atkeson, C., Hale, J., Pollick, F., Riley, M., Kotosaka, S., Schaal, S., et al.(2000). Using humanoid robots to study human behavior. IEEE IntelligentSystems, 15 (4), 46-56.

Atkeson, C., Moore, A., & Schaal, S. (1997). Locally weighted learning forcontrol. Artificial Intelligence Review, 11 (1-5), 75–113.

Atkeson, C., & Schaal, S. (1997). Learning tasks from a single demonstration.In Proceedings of the IEEE/RSJ international conference on robotics andautomation (ICRA) (Vol. 2).

Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., & MacIntyre, B.(2001). Recent advances in augmented reality. IEEE Computer Graphicsand Applications, 21 (6), 34–47.

Bejczy, A., Kim, W., & Venema, S. (1990, May). The phantom robot: predictivedisplays for teleoperation with time delay. In Proceedings of the IEEEinternational conference on robotics and automation (ICRA) (p. 546-551).

Bekkering, H., Wohlschlaeger, A., & Gattis, M. (2000). Imitation of gestures inchildren is goal-directed. Quarterly Journal of Experimental Psychology,53A(1), 153–164.

Bentivegna, D., Atkeson, C., & Cheng, G. (2004). Learning tasks from observa-tion and practice. Robotics and Autonomous Systems, 47 (2-3), 163–169.

Billard, A. (2003). Robota: Clever toy and educational tool. Robotics andAutonomous Systems, 42, 259–269.

Billard, A., Calinon, S., & Guenter, F. (2006). Discriminative and adaptiveimitation in uni-manual and bi-manual tasks. Robotics and AutonomousSystems, 54 (5), 370–384.

Billard, A., & Dillmann, R. (2006). Special issue on the social mechanisms ofrobot programming by demonstration. Robotics and Autonomous Systems,54 (5), 351-352.

Billard, A., & Schaal, S. (2006). Special issue on the brain mechanisms ofimitation learning. Neural Networks, 19 (3).

Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its applicationto parameter estimation for Gaussian mixture and hidden Markov models.Technical Report, University of Berkeley, ICSI-TR-97-021.

Borga, M. (1998). Learning multidimensional signal processing. PhD thesis,Linkoping University, Sweden.

Bradski, G. (1998, October). Real time face and object tracking as a componentof a perceptual user interface. In Proceedings of the IEEE workshop onapplications of computer vision (WACV) (pp. 214–219).

Brand, M., & Hertzmann, A. (2000, July). Style machines. In Proceedings ofthe ACM international conference on computer graphics and interactivetechniques (SIGGRAPH) (pp. 183–192).

Breazeal, C., & Aryananda, L. (2002). Recognition of affective communicativeintent in robot-directed speech. Autonomous Robots, 12 (1), 83–104.

Breazeal, C., Brooks, A., Gray, J., Hoffman, G., Kidd, C., Lee, H., et al. (2004).Tutelage and collaboration for humanoid robots. Humanoid Robots, 1 (2),315–348.

Breazeal, C., Buchsbaum, D., Gray, J., Gatenby, D., & Blumberg, B. (2005).Learning from and about others: Towards using imitation to bootstrapthe social understanding of others by robots. Artificial Life, 11 (1-2).

231

Brethes, L., Menezes, P., Lerasle, F., & Hayet, J. (2004, April). Face trackingand hand gesture recognition for human-robot interaction. In Proceedingsof the IEEE international conference on robotics and automation (ICRA)(Vol. 2, pp. 1901–1906).

Burdea, G. (1999). Invited review: the synergy between virtual reality androbotics. IEEE Transactions on Robotics and Automation, 15 (3), 400–410.

Byrne, R. (1999). Imitation without intentionality. using string parsing to copythe organization of behaviour. Animal Cognition, 2, 63–72.

Calinon, S., & Billard, A. (2004, September). Stochastic gesture production andrecognition model for a humanoid robot. In Proceedings of the IEEE/RSJinternational conference on intelligent robots and systems (IROS) (pp.2769–2774).

Calinon, S., & Billard, A. (2005, August). Recognition and reproduction of ges-tures using a probabilistic framework combining PCA, ICA and HMM. InProceedings of the international conference on machine learning (ICML)(pp. 105–112).

Calinon, S., & Billard, A. (2006, September). Teaching a humanoid robot to rec-ognize and reproduce social cues. In Proceedings of the IEEE internationalsymposium on robot and human interactive communication (RO-MAN)(pp. 346–351).

Calinon, S., & Billard, A. (2007a, August). Active teaching in robot pro-gramming by demonstration. In Proceedings of the IEEE internationalsymposium on robot and human interactive communication (RO-MAN).(In press)

Calinon, S., & Billard, A. (2007b, March). Incremental learning of gesturesby imitation in a humanoid robot. In Proceedings of the ACM/IEEEinternational conference on Human-Robot Interaction (HRI) (pp. 255–262).

Calinon, S., & Billard, A. (2007c). Learning of gestures by imitation in a hu-manoid robot. In Imitation and social learning in robots, humans andanimals: Behavioural, social and communicative dimensions (K. Dauten-hahn and C.L. Nehaniv ed., pp. 153–177). Cambridge University Press.

Calinon, S., & Billard, A. (2007d). What is the teacher’s role in robot pro-gramming by demonstration? - Toward benchmarks for improved learn-ing. Interaction Studies. Special Issue on Psychological Benchmarks inHuman-Robot Interaction, 8 (3).

Calinon, S., Epiney, J., & Billard, A. (2005, December). A humanoid robotdrawing human portraits. In Proceedings of the IEEE-RAS internationalconference on humanoid robots (Humanoids) (pp. 161–166).

Calinon, S., Guenter, F., & Billard, A. (2005, April). Goal-directed imitationin a humanoid robot. In Proceedings of the IEEE international conferenceon robotics and automation (ICRA) (pp. 299–304).

Cannon, D., & Thomas, G. (1997). Virtual tools for supervisory and collabora-tive control of robots. Presence: Teleoperators and Virtual Environments,6 (1), 1–28.

Carreira-Perpinan, M. (2002). Continuous latent variable models for dimen-sionality reduction and sequential data reconstruction. PhD thesis, Dept.of Computer Science, University of Sheffield, UK.

Caurin, G., Albuquerque, A., & Mirandola, A. (2004, September-October).Manipulation strategy for an anthropomorphic robotic hand. In Proceed-ings of the IEEE/RSJ international conference on intelligent robots andsystems (IROS) (pp. 1656–1661).

232

Chalodhorn, R., Grimes, D., Maganis, G., Rao, R., & Asada, M. (2006, May).Learning humanoid motion dynamics through sensory-motor mapping inreduced dimensional spaces. In Proceedings of the IEEE internationalconference on robotics and automation (ICRA) (pp. 3693–3698).

Chestnutt, J., Michel, P., Nishiwaki, K., Kuffner, J., & Kagami, S. (2006,May). An intelligent joystick for biped control. In Proceedings of theIEEE international conference on robotics and automation (ICRA) (pp.860–865).

Chiu, C., Chao, S., Wu, M., & Yang, S. (2004). Content-based retrieval forhuman motion data. Visual Communication and Image Representation,15, 446–466.

Cleveland, W. (1979). Robust locally weighted regression and smoothing scat-terplots. American Statistical Association, 74 (368), 829–836.

Cleveland, W., & Loader, C. (1996). Smoothing by local regression: Principlesand methods. In W. Haerdle & M. G. Schimek (Eds.), Statistical theoryand computational aspects of smoothing (pp. 10–49). New York: Springer.

Cohn, D., Ghahramani, Z., & Jordan, M. (1996). Active learning with statisticalmodels. Artificial Intelligence Research, 4, 129–145.

Custers, E., Regehr, G., McCulloch, W., Peniston, C., & Reznick, R. (1999).The effects of modeling on learning a simple surgical procedure: See one,do one or see many, do one? Advances in Health Sciences Education, 4 (2),123–143.

Cypher, A. (1993). Watch what I do: Programming by demonstration(A. Cypher, Ed.). The MIT Press.

Dasgupta, S., & Schulman, L. (2000). A two-round variant of EM for Gaussianmixtures. In Proceedings of the conference on uncertainty in artificialintelligence (UAI) (pp. 152–159).

Dautenhahn, K. (1997). I could be you - the phenomenological dimension ofsocial understanding. Cybernetics and Systems, Special Issue on Episte-mological Aspects of Embodied AI, 28(5), 417-453.

Dautenhahn, K. (2003). Roles and functions of robots in human society: Im-plications from research in autism therapy. Robotica, 21 (4), 443–452.

Decety, J., Chaminade, T., Grezes, J., & Meltzoff, A. (2002). A PET explorationof the neural mechanisms involved in reciprocal imitation. Neuroimage,15, 265–272.

Delson, N., & West, H. (1996). Robot programming by human demonstration:Adaptation and inconsistency in constrained motion. In Proceedings of theIEEE international conference on robotics and automation (ICRA) (pp.30–36).

Demiris, Y., & Hayes, G. (2002). Imitation as a dual route process featuring pre-dictive and learning components: a biologically-plausible computationalmodel. In K. Dautenhahn & C. Nehaniv (Eds.), Imitation in animals andartifacs (p. 327-361). MIT Press.

Demiris, Y., & Khadhouri, B. (2006). Hierarchical attentive multiple modelsfor execution and recognition of actions. Robotics and Autonomous Sys-tems, Special Issue on the Social Mechanisms of Robot Programming fromDemonstration, 54 (5), 361–369.

Dempster, A., & Rubin, N. L. D. (1977). Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society B,39 (1), 1–38.

Dillmann, R. (2004). Teaching and learning of robot tasks via observation ofhuman performance. Robotics and Autonomous Systems, 47 (2-3), 109–116.

233

Dominey, P., Alvarez, M., Gao, B., Jeambrun, M., Cheylus, A., Weitzenfeld,A., et al. (2005, December). Robot command, interrogation and teach-ing via social interaction. In Proceedings of the IEEE-RAS internationalconference on humanoid robots (Humanoids) (pp. 475–480).

Dubach, C. (2004). Stereoscopic vision and trajectory recognition. Technicalreport, Autonomous Systems Lab, Ecole Polytechnique Federale de Lau-sanne.

Dufay, B., & Latombe, J.-C. (1984). An approach to automatic robot program-ming based on inductive learning. The International Journal of RoboticsResearch, 3 (4), 3–20.

Ekvall, S., & Kragic, D. (2005, April). Grasp recognition for programming bydemonstration. In Proceedings of the IEEE international conference onrobotics and automation (ICRA) (pp. 748–753).

Ekvall, S., & Kragic, D. (2006, September). Learning task models from mul-tiple human demonstrations. In Proceedings of the IEEE internationalsymposium on robot and human interactive communication (RO-MAN)(pp. 358–363).

FIBA Central Board. (2006). The official basketball rules. International BasketFederation (FIBA).

Fodor, I. (2002). A survey of dimension reduction techniques. Technical Report,Lawrence Livermore National Laboratory, Center for Applied ScientificComputing, UCRL-ID-148494.

Friedman, J. (1991). Multivariate adaptive regression splines. The Annals ofStatistics, 19 (1), 1–67.

Friedman, J., Jacobson, M., & Stuetzle, W. (1981). Projection pursuit regres-sion. American Statistical Association, 76, 817–823.

Friedrich, H., Muench, S., Dillmann, R., Bocionek, S., & Sassin, M. (1996).Robot programming by demonstration (RPD): Supporting the inductionby human interaction. Machine Learning, 23 (2), 163–189.

Fritsch, F., & Carlson, R. (1980). Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17 (2), 238–246.

Fujie, S., Ejiri, Y., Nakajima, K., Matsusaka, Y., & Kobayashi, T. (2004,September). A conversation robot using head gesture recognition as para-linguistic information. In IEEE international workshop on robot and hu-man interactive communication (RO-MAN) (pp. 159–164).

Furui, S. (1986). Speaker-independent isolated word recognition using dynamicfeatures of speech spectrum. IEEE Transactions on Acoustics, Speech,and Signal Processing, 34 (1), 52–59.

Gams, A., & Lenarcic, J. (2006). Human arm kinematic modeling and trajectorygeneration. In IEEE/RAS-EMBS international conference on biomedicalrobotics and biomechatronics (BioRob).

Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and thebias/variance dilemma. Neural Computation, 4 (1), 1–58.

Gergely, G., & Csibra, G. (2005). The social construction of the culturalmind: Imitative learning as a mechanism of human pedagogy. InteractionStudies, 6, 463–481.

Ghahramani, Z., & Jordan, M. (1994). Supervised learning from incompletedata via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector(Eds.), Advances in neural information processing systems (Vol. 6, pp.120–127). Morgan Kaufmann Publishers, Inc.

Gilloux, M. (1994). Hidden markov models in handwriting recognition. InS. Impedovo (Ed.), Fundamentals in handwriting recognition. Berlin, Hei-delberg: Springer.

234

Goldenberg, E., & Feurzeig, W. (1987). Exploring language with LOGO. Cam-bridge, MA, USA: MIT Press.

Grochow, K., Martin, S., Hertzmann, A., & Popovic, Z. (2004). Style-basedinverse kinematics. In Proceedings of the ACM international conferenceon computer graphics and interactive techniques (SIGGRAPH) (pp. 522–531).

Gross, R., & Shi, J. (2001, June). The CMU motion of body (MoBo) database(Tech. Rep. No. CMU-RI-TR-01-18). Pittsburgh, PA: Robotics Institute,Carnegie Mellon University.

Guenter, F., Hersch, M., Calinon, S., & Billard, A. (2007). Reinforcement learn-ing for imitating constrained reaching movements. Advanced Robotics,Special Issue on Imitative Robots. (In press)

Gupta, A., & O’Malley, M. (2006, June). Design of a haptic arm exoskeleton fortraining and rehabilitation. IEEE/ASME Transactions on Mechatronics,11 (3), 280–289.

Hafner, V., & Kaplan, F. (2005). Learning to interpret pointing gestures:experiments with four-legged autonomous robots. In S. Wermter, G. Palm,& M. Elshaw (Eds.), Biomimetic neural learning for intelligent robots.intelligent systems, cognitive robotics, and neuroscience. Springer Verlag.

Heinzmann, J., & Zelinsky, A. (1999). A safe-control paradigm for humanrobotinteraction. Journal of Intelligent and Robotic Systems, 25 (4), 295–310.

Heise, R. (1989). Demonstration instead of programming: Focussing attention inrobot task acquisition. PhD thesis, University of Calgary, Alberta Canada.

Hersch, M., & Billard, A. (2006). A biologically-inspired model of reachingmovements. In Proceedings of the IEEE/RAS-EMBS international con-ference on biomedical robotics and biomechatronics (BioRob).

Heyes, C. (2001). Causes and consequences of imitation. Trends in CognitiveSciences, 5, 253–261.

Heyes, C., & Ray, E. (2000). What is the significance of imitation in animals?Advances in the Study of Behavior, 29, 215–245.

Hightower, J., & Borriello, G. (2001, August). Location systems for ubiquitouscomputing. IEEE Computer, 34 (8), 57–66.

Hodges, N., & Franks, I. (2002). Modelling coaching practice: the role ofinstruction and demonstration. Sports Sciences, 20 (10), 793–811.

Hollerbach, J., & Jacobsen, S. (1996, October). Anthropomorphic robots andhuman interactions. In Proceedings of the international symposium onhumanoid robots (HURO) (pp. 83–91).

Horn, R., & Williams, A. (2004). Observational learning: Is it time we took an-other look? In A. Williams & N. Hodges (Eds.), Skill acquisition in sport:Research, theory and practice (pp. 175–206). London, UK: Routledge.

Hwang, Y., Choi, K., & Hong, D. (2006, May). Self-learning control of cooper-ative motion for a humanoid robot. In Proceedings of the IEEE interna-tional conference on robotics and automation (ICRA) (pp. 475–480).

Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independentcomponent analysis. IEEE Transactions on Neural Networks, 10 (3), 626–634.

Iba, S., Weghe, J., Paredis, C., & Khosla, P. (1999). An architecture forgesture-based control of mobile robots. In Proceedings of the IEEE/RSJinternational conference on intelligent robots and systems (IROS) (Vol. 2,pp. 851–857).

Ikeuchi, K., & Suchiro, T. (1992, May). Towards an assembly plan from ob-servation, part I: Assembly task recognition using face-contact relations(polyhedral objects). In Proceedings of the IEEE international conference

235

on robotics and automation (ICRA) (Vol. 3, pp. 2171–2177).Inamura, T., Kojo, N., & Inaba, M. (2006, October). Situation recognition

and behavior induction based on geometric symbol representation of mul-timodal sensorimotor patterns. In Proceedings of the IEEE/RSJ inter-national conference on intelligent robots and systems (IROS) (pp. 5147–5152).

Inamura, T., Kojo, N., Sonoda, T., Sakamoto, K., Okada, K., & Inaba, M.(2005). Intent imitation using wearable motion capturing system withon-line teaching of task attention. In Proceedings of the IEEE-RAS inter-national conference on humanoid robots (Humanoids).

Ishiguro, H. (2006). Interactive humanoids and androids as ideal interfaces forhumans. In Proceedings of the international conference on intelligent userinterfaces (IUI) (pp. 2–9). ACM Press.

Ishiguro, H., Ono, T., Imai, M., & Kanda, T. (2003). Development of an inter-active humanoid robot robovie - an interdisciplinary approach. SpringerTracts in Advanced Robotics, 6, 179–192.

Ito, M., Noda, K., Hoshino, Y., & Tani, J. (2006). Dynamic and interactivegeneration of object handling behaviors by a small humanoid robot usinga dynamic neural network model. Neural Networks, 19 (3), 323–337.

Ito, M., & Tani, J. (2004). Joint attention between a humanoid robot andusers in imitation game. In International conference on development andlearning (ICDL).

Jansen, B., & Belpaeme, T. (2006). A computational model of intention readingin imitation. Robotics and Autonomous Systems, 54 (5), 394–402.

Jenkins, O., & Mataric, M. (2004). A spatio-temporal extension to isomap non-linear dimension reduction. In Proceedings of the international conferenceon machine learning (ICML) (pp. 56–63).

Jolliffe, I. (1986). Principal component analysis. Springer-Verlag.Kahn, P., Freier, N., Friedman, B., Severson, R., & Feldman, E. (2004, Septem-

ber). Social and moral relationships with robotic others? In Proceedingsof the IEEE international workshop on robot and human interactive com-munication (RO-MAN) (pp. 545–550).

Kahn, P., Ishiguro, H., Friedman, B., & Kanda, T. (2006, September). What isa human? - Toward psychological benchmarks in the field of human-robotinteraction. In Proceedings of the IEEE international symposium on robotand human interactive communication (RO-MAN) (pp. 364–371).

Kajita, S., Nagasaki, T., Yokoi, K., Kaneko, K., & Tanie, K. (2002, May). Run-ning pattern generation for a humanoid robot. In Proceedings of the IEEEinternational conference on robotics and automation (ICRA) (Vol. 3, pp.2755–2761).

Kambhatla, N. (1996). Local models and Gaussian mixture models for statisticaldata processing. PhD thesis, Oregon Graduate Institute of Science andTechnology, Portland, OR.

Kanda, T., Hirano, T., Eaton, D., & Ishiguro, H. (2004). Interactive robots associal partners and peer tutors for children: A field trial. Human ComputerInteraction, 19 (1-2), 61–84.

Kaplan, F. (2004). Who is afraid of the humanoid? investigating culturaldifferences in the acceptance of robots. International journal of humanoidrobotics, 1 (3), 465–480.

Kapoor, A., & Picard, R. (2001). A real-time head nod and shake detector. InWorkshop on perceptive user interfaces (pui) (pp. 1–5).

Kato, I. (1973). Development of WABOT 1. Biomechanism, 2, 173–214.Kehtarnavaz, N., & deFigueiredo, J. (1988). A 3-D contour segmentation scheme

236

based on curvature and torsion. IEEE Transactions on Pattern Analysisand Machine Intelligence, 10 (5), 707–713.

Khatib, O., Sentis, L., Park, J.-H., & Warren, J. (2004). Whole body dy-namic behavior and control of human-like robots. International Journalof Humanoid Robotics, 1 (1), 29–43.

Kheddar, A. (2001). Teleoperation based on the hidden robot concept. IEEETransactions on Systems, Man and Cybernetics, Part A, 31 (1), 1–13.

Kozima, H., & Yano, H. (2001). A robot that learns to communicate withhuman caregivers. In International workshop on epigenetic robotics.

Kullback, S., & Leibler, R. (1951). On information and sufficiency. Annals ofMathematical Statistics, 22 (1), 79–86.

Kuniyoshi, Y., Inaba, M., & Inoue, H. (1994). Learning by watching: Extractingreusable task knowledge from visual observation of human performance.IEEE Transactions on Robotics and Automation, 10 (6), 799–822.

Lawrence, N. (2005). Probabilistic non-linear principal component analysis withGaussian process latent variable models. Machine Learning Research, 6,1783–1816.

Lee, B., Hope, G., & Witts, N. (2006, September). Could next generationandroids get emotionally close? relational closeness from human dyadicinteractions. In Proceedings of the IEEE international symposium on robotand human interactive communication (RO-MAN) (pp. 475–479).

Lee, C., & Xu, Y. (1996, April). Online, interactive learning of gestures forhuman/robot interfaces. In IEEE international conference on roboticsand automation (ICRA) (Vol. 4, pp. 2982–2987).

Lee, D., & Nakamura, Y. (2006, October). Stochastic model of imitating a newobserved motion based on the acquired motion primitives. In Proceed-ings of the IEEE/RSJ international conference on intelligent robots andsystems (IROS) (pp. 4994–5000).

Lee, D., & Nakamura, Y. (2007, April). Mimesis scheme using a monocularvision system on a humanoid robot. In Proceedings of the IEEE interna-tional conference on robotics and automation (ICRA) (pp. 2162–2168).

Lenarcic, J., & Stanisic, M. (2003). A humanoid shoulder complex and thehumeral pointing kinematics. IEEE Transactions on Robotics and Au-tomation, 19 (3), 499–506.

Lieberman, H. (2001). Your wish is my command: Programming by example(H. Lieberman, Ed.). M. Kaufmann.

Lozano-Perez, T. (1983). Robot programming. Proceedings of the IEEE, 71 (7),821–841.

MacDorman, K., Chalodhorn, R., & Asada, M. (2004, August). Periodic non-linear principal component neural networks for humanoid motion segmen-tation, generalization, and generation. In Proceedings of the internationalconference on pattern recognition (ICPR) (Vol. 4, pp. 537–540).

MacDorman, K., & Cowley, S. (2006, September). Long-term relationships as abenchmark for robot personhood. In Proceedings of the IEEE internationalsymposium on robot and human interactive communication (RO-MAN)(pp. 378–383).

MacQueen, J. (1967). Some methods for classification and analysis of multi-variate observations. In Proceedings of the symposium on mathematicalstatistics and probability (pp. 281–297). University of California Press.

Massie, T., & Salisbury, J. (1994, November). The PHANToM haptic interface:A device for probing virtual objects. In Proceedings of the ASME winterannual meeting, symposium on haptic interfaces for virtual environmentand teleoperator systems (pp. 295–300).

237

Matsumoto, T., Konno, A., Gou, L., & Uchiyama, M. (2006, October). A hu-manoid robot that breaks wooden boards applying impulsive force. In Pro-ceedings of the IEEE/RSJ international conference on intelligent robotsand systems (IROS) (pp. 5919–5924).

Mayer, H., Nagy, I., Knoll, A., Braun, E., Lange, R., & Bauernschmitt, R. (2007,April). Adaptive control for human-robot skilltransfer: Trajectory plan-ning based on fluid dynamics. In Proceedings of the IEEE internationalconference on robotics and automation (ICRA) (p. 1800-1807).

McLachlan, G., & Peel, D. (2000). Finite mixture models. Wiley.Meltzoff, A., & Moore, M. (1977). Imitation of facial and manual gestures by

human neonates. Science, 198 (4312), 75–78.Milgram, P., Rastogi, A., & Grodski, J. (1995, July). Telerobotic control using

augmented reality. In Proceedings of the IEEE international workshop onrobot and human communication (RO-MAN) (pp. 21–29).

Mitchell, T., Caruana, R., Freitag, D., McDermott, J., & Zabowski, D. (1994).Experience with a learning personal assistant. Communications of theACM, 37 (7), 80–91.

Moore, A. (1992). Fast, robust adaptive control by learning only forwardmodels. In J. E. Moody, S. J. Hanson, & R. P. L (Eds.), Advances inneural information processing systems (Vol. 4). San Francisco, CA, USA:Morgan Kaufmann.

Mori, M. (1970). Bukimi no tani (the uncanny valley). Energy, 7, 33–35.Muench, S., Kreuziger, J., Kaiser, M., & Dillmann, R. (1994). Robot pro-

gramming by demonstration (RPD) - Using machine learning and userinteraction methods for the development of easy and comfortable robotprogramming systems. In Proceedings of the international symposium onindustrial robots (ISIR) (pp. 685–693).

Nadel, J., Guerini, C., Peze, A., & Rivet, C. (1999). The evolving nature ofimitation as a format for communication. In Imitation in infancy (pp.209–234). Cambridge University Press.

Nakanishi, J., Morimoto, J., Endo, G., Cheng, G., Schaal, S., & Kawato, M.(2004). Learning from demonstration and adaptation of biped locomotion.Robotics and Autonomous Systems, 47 (2-3), 79–91.

Naksuk, N., & Lee, C. (2005, April). Utilization of movement prioritization forwhole-body humanoid robot trajectory generation. In Proceedings of theIEEE international conference on robotics and automation (ICRA) (pp.1079–1084).

Nam, Y., & Wohn, K. (1996). Recognition of space-time hand-gestures usinghidden Markov model. In ACM symposium on virtual reality software andtechnology (pp. 51–58).

Nehaniv, C., & Dautenhahn, K. (2000). Of hummingbirds and helicopters:An algebraic framework for interdisciplinary studies of imitation and itsapplications. In J. Demiris & A. Birk (Eds.), Interdisciplinary approachesto robot learning (Vol. 24, pp. 136–161). World Scientific Press.

Nehaniv, C., & Dautenhahn, K. (2001). Like me? - measures of correspondenceand imitation. Cybernetics and Systems: An International Journal, 32 (1-2), 11–51.

Nehaniv, C., & Dautenhahn, K. (2002). The correspondence problem. InImitation in animals and artifacts (pp. 41–61). MIT Press.

Nguyen, L., Bualat, M., Edwards, L., Flueckiger, L., Neveu, C., Schwehr, K.,et al. (2001). Virtual reality interfaces for visualization and control ofremote vehicles. Autonomous Robots, 11 (1), 59–68.

Nickel, K., & Stiefelhagen, R. (2003). Pointing gesture recognition based on 3d-

238

tracking of face, hands and head orientation. In international conferenceon multimodal interfaces (icmi) (pp. 140–146).

Nicolescu, M., & Mataric, M. (2003). Natural methods for robot task learning:Instructive demonstrations, generalization and practice. In Proceedings ofthe international joint conference on autonomous agents and multiagentsystems (AAMAS) (pp. 241–248).

Ogawara, K., Takamatsu, J., Kimura, H., & Ikeuchi, K. (2003). Extraction ofessential interactions through multiple observations of human demonstra-tions. IEEE Transactions on Industrial Electronics, 50 (4), 667–675.

Otmane, S., Mallem, M., Kheddar, A., & Chavand, F. (2000). Active virtualguides as an apparatus for augmented reality based telemanipulation sys-tem on the internet. In Proceedings of the IEEE simulation symposium(p. 185).

Pardowitz, M., Zoellner, R., Knoop, S., & Dillmann, R. (2007). Incrementallearning of tasks from user demonstrations, past experiences and vocalcomments. IEEE Transactions on Systems, Man and Cybernetics, PartB. Special issue on robot learning by observation, demonstration and imi-tation, 37 (2), 322-332.

Park, H., Kim, E., Jang, S., Park, S., Park, M., & Kim, H. (2005). Hmm-based gesture recognition for robot control. Pattern Recognition and ImageAnalysis, 607–614.

Peshkin, M., Colgate, J., & Moore, C. (1996, April). Passive robots and hapticdisplays based on nonholonomic elements. In Proceedings of the IEEEinternational conference on robotics and automation (ICRA) (Vol. 1, p.551-556).

Peters, J., Vijayakumar, S., & Schaal, S. (2003, October). Reinforcementlearning for humanoid robotics. In Proceedings of the IEEE internationalconference on humanoid robots (Humanoids).

Petersen, K. B., & Pedersen, M. S. (2007). The matrix cookbook. TechnicalReport. University of Denmark.

Pfeifer, R., & Scheier, C. (2001). Understanding intelligence. Cambridge, MA,USA: MIT Press.

Piaget, J. (1962). Play, dreams and imitation in childhood. New York, USA:Norton.

Popovic, M., & Popovic, D. (2001). Cloning biological synergies improves controlof elbow neuroprostheses. IEEE Engineering in Medicine and BiologyMagazine.

Rabiner, L. (1989, February). A tutorial on hidden Markov models and selectedapplications in speech recognition. Proceedings of the IEEE, 77:2, 257-285.

Righetti, L., & Ijspeert, A. (2006, May). Programmable central pattern gener-ators: an application to biped locomotion control. In Proceedings of theIEEE international conference on robotics and automation (ICRA).

Riley, M., Ude, A., Atkeson, C., & Cheng, G. (2006, December). Coaching: Anapproach to efficiently and intuitively create humanoid robot behaviors.In Proceedings of the IEEE-RAS international conference on humanoidrobots (Humanoids) (pp. 567–574).

Rizzolatti, G., Fadiga, L., Fogassi, L., & Gallese, V. (2002). From mirrorneurons to imitation: Facts and speculations. In A. Meltzoff & W. Prinz(Eds.), The imitative mind: Development, evolution and brain bases (pp.163–182). Cambridge University Press.

Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanismsunderlying the understanding and imitation of action. Nature ReviewsNeuroscience, 2, 661–670.

239

Rohlfing, K., Fritsch, J., Wrede, B., & Jungmann, T. (2006). How can multi-modal cues from child-directed interaction reduce learning complexity inrobots? Advanced Robotics, 20 (10), 1183–1199.

Rolland, J., Davis, L., & Baillot, Y. (2001). A survey of tracking technologyfor virtual environments. In Augmented reality and wearable computers(chap. 3). Mahwah, NJ.

Rosheim, M. (2006). Leonardo’s lost robots. New York, USA: Springer.Rossi, D. D., Bartalesi, R., Lorussi, F., Tognetti, A., & Zupone, G. (2006).

Body gesture and posture classification by smart clothes. In Ieee/ras-embs international conference on biomedical robotics and biomechatronics(biorob).

Sakamoto, D., Kanda, T., Ono, T., Ishiguro, H., & Hagita1, N. (2007, March).Android as a telecommunication medium with a human-like presence. InProceedings of the ACM/IEEE international conference on human-robotinteraction (HRI) (pp. 193–200).

Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimizationfor spoken word recognition. IEEE Transactions on Acoustics, Speech andSignal Processing, 26 (1), 43–49.

Salter, T., Dautenhahn, K., & Boekhorst, R. te. (2006). Learning about naturalhuman-robot interaction styles. Robotics and Autonomous Systems, 54 (2),127–134.

Sato, T., Genda, Y., Kubotera, H., Mori, T., & Harada, T. (2003). Robotimitation of human motion based on qualitative description from multi-ple measurement of human and environmental data. In Proceedings ofthe IEEE/RSJ international conference on intelligent robots and systems(IROS) (Vol. 3, pp. 2377–2384).

Saunders, J., Nehaniv, C., & Dautenhahn, K. (2006, March). Teaching robotsby moulding behavior and scaffolding the environment. In Proceedingsof the ACM SIGCHI/SIGART conference on Human-Robot Interaction(HRI) (pp. 118–125).

Scassellati, B. (1999). Imitation and mechanisms of joint attention: A develop-mental structure for building social skills on a humanoid robot. LectureNotes in Computer Science, 1562, 176–195.

Schaal, S., & Atkeson, C. (1996). From isolation to cooperation: An alternativeview of a system of experts. In Advances in neural information processingsystems (Vol. 8, p. 605-611). San Francisco, CA, USA: Morgan KaufmannPublishers Inc.

Schaal, S., & Atkeson, C. (1998). Constructive incremental learning from onlylocal information. Neural Computation, 10 (8), 2047–2084.

Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysisas a kernel eigenvalue problem. Neural Computation, 10 (5), 1299–1319.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics,6, 461–464.

Scott, D. (2004). Multivariate density estimation and visualization. In Handbookof computational statistics: Concepts and methods (pp. 517–538).

Shimizu, M., Yoon, W.-K., & Kitagaki, K. (2007, April). A practical redun-dancy resolution for 7 dof redundant manipulators with joint limits. InProceedings of the IEEE international conference on robotics and automa-tion (ICRA) (pp. 4510–4516).

Shin, H., Lee, J., Shin, S., & Gleicher, M. (2001). Computer puppetry: Animportance-based approach. ACM Transactions on Graphics, 20 (2), 67–94.

Shon, A., Grochow, K., & Rao, R. (2005). Robotic imitation from human

240

motion capture using Gaussian processes. In Proceedings of the IEEE/RASinternational conference on humanoid robots (Humanoids).

Shon, A., Storz, J., & Rao, R. (2007, April). Towards a real-time bayesianimitation system for a humanoid robot. In Proceedings of the IEEE inter-national conference on robotics and automation (ICRA) (pp. 2847–2852).

Smith, D., Cypher, A., & Spohrer, J. (1994). Kidsim: Programming agentswithout a programming language. Communications of the ACM, 37 (7),54–67.

Song, M., & Wang, H. (2005). Highly efficient incremental estimation of Gaus-sian mixture models for online data stream clustering. In Proceedings ofSPIE: Intelligent computing - theory and applications III.

Starner, T., & Weaver, A., J.and Pentland. (1998). Real-time american signlanguage recognition using desk and wearable computer based video. IEEETrans. on Pattern Analysis and Machine Intelligence.

Steels, L. (2001). Language games for autonomous robots. Intelligent Systems,16 (5), 16–22.

Steil, J., Rothling, F., Haschke, R., & Ritter, H. (2004). Situated robot learn-ing for multi-modal instruction and imitation of grasping. Robotics andAutonomous Systems, 47 (2-3), 129–141.

Sturman, D., & Zeltzer, D. (1994, January). A survey of glove-based input.IEEE Computer Graphics and Applications, 14 (1), 30–39.

Sun, W. (1999). Spectral analysis of hermite cubic spline collocation systems.SIAM Journal of Numerical Analysis, 36 (6).

Sung, H. (2004). Gaussian mixture regression and classification. PhD thesis,Rice University, Houston, Texas.

Svinin, M., Goncharenko, I., Luo, Z., & Hosoe, S. (2006, October). Modelingof human-like reaching movements in the manipulation of flexible objects.In Proceedings of the IEEE/RSJ international conference on intelligentrobots and systems (IROS) (p. 549-555).

Tachi, S., Maeda, T., Hirata, R., & Hoshino, H. (1994). A construction methodof virtual haptic space. In Proceedings of the international conference onartificial reality and tele-existence (pp. 131–138).

Takano, W., Yamane, K., & Nakamura, Y. (2007, April). Capture databasethrough symbolization, recognition and generation of motion patterns. InProceedings of the IEEE international conference on robotics and automa-tion (ICRA) (pp. 3092–3097).

Takubo, T., Inoue, K., Arai, T., & Nishii, K. (2006, October). Wholebodyteleoperation for humanoid robot by marionette system. In Proceedings ofthe IEEE/RSJ international conference on intelligent robots and systems(IROS) (pp. 4459–4465).

Thomaz, A., Berlin, M., & Breazeal, C. (2005, July). Robot science meetssocial science: An embodied computational model of social referencing.In Workshop toward social mechanisms of android science (CogSci) (pp.7–17).

Thrun, S. (2004). Toward a framework for human-robot interaction. Human-Computer Interaction, 19 (1-2), 9–24.

Thrun, S., & Mitchell, T. (1993). Integrating inductive neural network learningand explanation-based learning. In Proceedings of the international jointconference on artificial intelligence (IJCAI) (p. 930-936).

Thrun, S., & Mitchell, T. (1995). Lifelong robot learning. Robotics and Au-tonomous Systems, 15, 25–46.

Tian, T., & Sclaroff, S. (2005). Handsignals recognition from video using 3Dmotion capture data. In Proceedings of the IEEE workshop on motion and

241

video computing (pp. 189–194).Ting, J.-A., D’Souza, A., & Schaal, S. (2007, April). Automatic outlier de-

tection: A bayesian approach. In Proceedings of the IEEE internationalconference on robotics and automation (ICRA) (pp. 2489–2494).

Tipping, M., & Bishop, C. (1999). Probabilistic principal component analysis.Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy), 61 (3), 611–622.

Ude, A. (1993). Trajectory generation from noisy positions of object features forteaching robot paths. Robotics and Autonomous Systems, 11 (2), 113–127.

Ude, A., Atkeson, C., & Riley, M. (2004). Programming full-body movementsfor humanoid robots by observation. Robotics and Autonomous Systems,47 (2-3), 93–108.

Urtasun, R., Glardon, P., Boulic, R., Thalmann, D., & Fua, P. (2004). Style-based motion synthesis. Computer Graphics Forum, 23 (4), 1-14.

Vijayakumar, S., D’souza, A., & Schaal, S. (2005). Incremental online learningin high dimensions. Neural Computation, 17 (12), 2602–2634.

Vijayakumar, S., & Schaal, S. (2000). Locally weighted projection regression:An O(n) algorithm for incremental real time learning in high dimensionalspaces. In Proceedings of the international conference on machine learning(ICML) (pp. 288–293).

Viterbi, A. (1967). Error bounds for convolutional codes and an asymptoticallyoptimum decoding algorithm. IEEE Transactions on Information Theory,13 (2), 260–269.

Vlassis, N., & Likas, A. (2002). A greedy EM algorithm for Gaussian mixturelearning. Neural Processing Letters, 15 (1), 77–87.

Vygotsky, L. (1978). Mind in society - The development of higher psychologicalprocesses (M. Cole, V. John-Steiner, S. Scribner, & E. Souberman, Eds.).Camridge, MA, USA: Harvard University Press.

Waldherr, S., Romero, R., & Thrun, S. (2000). A gesture based interface forhuman-robot interaction. Autonomous Robots, 9 (2), 151–173.

Wold, H. (1966). Estimation of principal components and related models byiterative least squares. In P. Krishnaiaah (Ed.), Multivariate analysis (pp.391–420). New York, USA: Academic Press.

Wood, D., & Wood, H. (1996). Vygotsky, tutoring and learning. Oxford Reviewof Education, 22 (1), 5–16.

Wu, G., Helm, F. van der, Veeger, H., Makhsous, M., Van Roy, P., Amglin,C., et al. (2005). ISB recommendation on definitions of joint coordinatesystems of various joints for the reporting of human joint motion - partII: shoulder, elbow, wrist and hand. Biomechanics(38), 981–992.

Xu, L., Jordan, M., & Hinton, G. (1995). An alternative model for mixturesof experts. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances inneural information processing systems (Vol. 7, pp. 633–640). The MITPress.

Xuan, G., Zhang, W., & Chai, P. (2001). EM algorithms of Gaussian mixturemodel and hidden Markov model. In Proceedings of the IEEE internationalconference on image processing (Vol. 1, p. 145-148).

Yamane, K., & Nakamura, Y. (2003). Dynamics filter - concept and implemen-tation of online motion generator for human figures. IEEE Transactionson Robotics and Automation, 19 (3), 421–432.

Yang, J., Xu, Y., & Chen, C. (1997). Human action learning via hidden Markovmodel. IEEE Transactions on Systems, Man, and Cybernetics – Part A:Systems and Humans, 27 (1), 34–44.

Yokoi, K., Nakashima, K., Kobayashi, M., Mihune, H., Hasunuma, H., Yanagi-

242

hara, Y., et al. (2006). A tele-operated humanoid operator. InternationalJournal of Robotics Research, 25 (5-6), 593–602.

Yoon, W.-K., Suehiro, T., Onda, H., & Kitagaki, K. (2006, May). Task skilltransfer method using a bilateral teleoperation. In Proceedings of theIEEE international conference on robotics and automation (ICRA) (pp.3250–3256).

Yoshikai, T., Otake, N., Mizuuchi, I., Inaba, M., & Inoue, H. (2004, September-October). Development of an imitation behavior in humanoid kenta withreinforcement learning algorithm based on the attention during imitation.In Proceedings of the IEEE/RSJ international conference on intelligentrobots and systems (IROS) (pp. 1192–1197).

Yoshikawa, Y., Shinozawa, K., Ishiguro, H., Hagita, N., & Miyamoto, T. (2006,August). Responsive robot gaze to interaction partner. In Proceedings ofRobotics: Science and Systems (RSS). Philadelphia, USA.

Young, J., Xin, M., & Sharlin, E. (2007). Robot expressionism through car-tooning. In Proceeding of the ACM/IEEE international conference onhuman-robot interaction(HRI) (pp. 309–316).

Zen, H., Tokuda, K., & Kitamura, T. (2007). Reformulating the hmm asa trajectory model by imposing explicit relationships between static anddynamic feature vector sequences. Computer Speech and Language, 21 (1),153–173.

Zhang, Z. (2000). A flexible new technique for camera calibration. IEEETransactions on Pattern Analysis and Machine Intelligence, 22 (11), 1330–1334.

Zoellner, R., Rogalla, O., Dillmann, R., & Zoellner, M. (2002, September).Understanding users intention: programming fine manipulation tasksby demonstration. In IEEE/RSJ international conference on intelligentrobots and system (IROS) (Vol. 2, pp. 1114–1119).

Zukow-Goldring, P. (2004). Caregivers and the education of the mirror system.In Proceedings of the international conference on development and learning(ICDL) (pp. 96–103).

243

SylvainCalinonT +41 21 693 59 12

B [email protected]

Learning Algorithms and SystemsLaboratory (LASA)

Ecole Polytechnique Fédérale deLausanne (EPFL)

STI-I2S-LASA, Station 9CH-1015 Lausanne

Switzerlandhttp://lasa.epfl.ch

Curriculum

2007–now Post-doc at the Learning Algorithms and Systems Laboratory (LASA). Ecole poly-technique fédérale de Lausanne (EPFL), Switzerland.

2003–2007 PhD in Robot Programming by Demonstration obtained at the LASA Laboratory.EPFL, Switzerland.

2001–2003 MSc in Microengineering, specialization in Robotics. EPFL, Switzerland.

1998–2001 BSc in Microengineering. EPFL, Switzerland.

1980 Born in Yverdon, Switzerland. Swiss and French citizen. Single.

Research interests

• Robot Programming by Demonstration • Learning by Imitation• Machine Learning • Human-Robot Interaction

Publications

Book chapters Calinon, S. and Billard, A. (2007) "Learning of Gestures by Imitation in a HumanoidRobot". Dautenhahn, K. and Nehaniv, C.L. (eds.). Imitation and Social Learning inRobots, Humans and Animals: Behavioural, Social and Communicative Dimensions.Cambridge University Press.

Journals Calinon, S. and Billard, A. (2007) "What is the Teacher’s Role in Robot Program-ming by Demonstration? - Toward Benchmarks for Improved Learning". InteractionStudies. Special Issue on Psychological Benchmarks in Human-Robot Interaction, 8:3.

Calinon, S., Guenter, F. and Billard, A. (2007) "On Learning, Representing and Gen-eralizing a Task in a Humanoid Robot". IEEE Transactions on Systems, Man andCybernetics, Part B. Special issue on robot learning by observation, demonstrationand imitation, 37:2.

Guenter, F., Hersch, M., Calinon, S. and Billard, A. (2007) "Reinforcement Learningfor Imitating Constrained Reaching Movements". Advanced Robotics, Special Issueon Imitative Robots.

Billard, A., Calinon, S. and Guenter, F. (2006) "Discriminative and Adaptive Imitationin Uni-Manual and Bi-Manual Tasks". Robotics and Autonomous Systems, 54:5.

Billard, A., Epars, Y., Calinon, S., Cheng, G. and Schaal, S. (2004) "DiscoveringOptimal Imitation Strategies". Robotics and Autonomous Systems, Special Issue:Robot Learning from Demonstration, 47:2-3.

Conferencepapers

Calinon, S. and Billard, A. (2007) "Active Teaching in Robot Programming by Demon-stration". In proceedings of the IEEE International Symposium on Robot and HumanInteractive Communication (RO-MAN), Jeju, Korea.

Calinon, S. and Billard, A. (2007) "Incremental Learning of Gestures by Imitation ina Humanoid Robot". In Proceedings of the ACM/IEEE International Conference onHuman-Robot Interaction (HRI), Arlington, Virginia, USA.

Calinon, S. and Billard, A. (2006) "Teaching a Humanoid Robot to Recognize andReproduce Social Cues". In proceedings of the IEEE International Symposium onRobot and Human Interactive Communication (RO-MAN), Hertfordshire, UK.

Calinon, S., Guenter, F. and Billard, A. (2006) "On Learning the Statistical Repre-sentation of a Task and Generalizing it to Various Contexts". In proceedings of theIEEE International Conference on Robotics and Automation (ICRA), Orlando, USA.

Calinon, S., Epiney, J. and Billard, A. (2005) "A Humanoid Robot Drawing HumanPortraits". In Proceedings of the IEEE-RAS International Conference on HumanoidRobots (Humanoids), Tsukuba, Japan.

Calinon, S. and Billard, A. (2005) "Recognition and Reproduction of Gestures usinga Probabilistic Framework combining PCA, ICA and HMM". In proceedings of theInternational Conference on Machine Learning (ICML), Bonn, Germany.

Calinon, S., Guenter, F. and Billard, A. (2005) "Goal-Directed Imitation in a Hu-manoid Robot". In Proceedings of the IEEE International Conference on Roboticsand Automation (ICRA), Barcelona, Spain.

Calinon, S. and Billard, A. (2004) "Stochastic Gesture Production and RecognitionModel for a Humanoid Robot". In proceedings of the IEEE/RSJ International Con-ference on Intelligent Robots and Systems (IROS), Sendai, Japan.

Calinon, S. and Billard, A. (2003) "PDA Interface for Humanoid Robots". In Proceed-ings of the IEEE International Conference on Humanoid Robots, Karlsruhe, Germany.

Hersch, M., Guenter, F., Calinon, S. and Billard, A. (2006) "Learning DynamicalSystem Modulation for Constrained Reaching Tasks". In Proceedings of the IEEE-RAS International Conference on Humanoid Robots (Humanoids), Genova, Italy.

Conferenceabstracts and

posters

Calinon, S. and Billard, A. (2005) "Learning of Gestures by Imitation in a HumanoidRobot". International Symposium on Imitation in Animals and Artifacts (AISB),Hatfield, UK.

Calinon, S. and Billard, A. (2004) "Gesture Recognition and Reproduction for a Hu-manoid Robot using Hidden Markov Models. AMI/PASCAL/IM2/M4 Workshop onMultimodal Interaction and Related Machine Learning Algorithms, Martigny, Switzer-land.

Technical report Calinon, S. (2003) "PDA Interface for Humanoid Robots using Speech and VisionProcessing". MSc thesis.

Peer-reviewed videos

Calinon, S. and Billard, A. (2007) "Incremental Learning of Gestures by Imitation in aHumanoid Robot". ACM/IEEE International Conference on Human-Robot Interaction(HRI), Arlington, Virginia, USA.

Hersch, M., Guenter, F., Calinon, S. and Billard, A. (2006) "Learning DynamicalSystem Modulation for Constrained Reaching Tasks". IEEE-RAS International Con-ference on Humanoid Robots (Humanoids), Genova, Italy.

Calinon, S. and Billard, A. (2005) "Goal-Directed Imitation in a Humanoid Robot".IEEE International Conference on Robotics and Automation (ICRA), Barcelona, Spain.

Calinon, S. and Billard, A. (2005) "A Humanoid Robot Drawing Human Portraits".IEEE-RAS International Conference on Humanoid Robots (Humanoids), Tsukuba,Japan.

Calinon, S. and Billard, A. (2004) "Stochastic Gesture Production and RecognitionModel for a Humanoid Robot". IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Sendai, Japan.

Calinon, S. and Billard, A. (2003) "PDA Interface for Humanoid Robots". IEEE-RASInternational Conference on Humanoid Robots (Humanoids), Karlsruhe, Germany.

Invited talks

2007 "Extraction of Task Constraints in a Robot Programming By Demonstration". Work-shop on Concept Learning for Embodied Agents. International Conference on Roboticsand Automation (ICRA), Roma, Italia.

2006 "Learning and Probabilistic Representation of a Task in a Humanoid Robot". Work-shop on Collaborative Human-Robot Teamwork. International Conference on Roboticsand Automation (ICRA), Orlando, Florida, USA.

Affiliations to funded research projects

COGNIRON European Integrated Project COGNIRON ("The Cognitive Companion") funded underContract FP6-IST-002020. This 4-years project, started in 2004, involves 8 universitypartners, for a project cost of 8 million euros. The overall objectives are to study theperceptual, representational, reasoning and learning capabilities of embodied robotsin human centered environments.

NSF Swiss National Science Foundation, through grant 620-066127 of the SNF Professor-ships program

Refereeing

2007 Reviewer for IEEE Transactions on Systems, Man, and Cybernetics - Part B.

2007 Reviewer for Advanced Robotic Systems.

2006 Reviewer for International Conference on Robotics and Automation (ICRA).

2005 Reviewer for IEEE Transactions on Systems, Man, and Cybernetics - Part C.

2005 Reviewer for Neural Networks.

Research supervision

Graduate level "Object tracking for a camera designed to be worn by children". MSc thesis, EPFL,2005.

"Control of the humanoid robot HOAP-2". MSc thesis, EPFL, 2004.

"Artificial Neural Network (ANN) - Hidden Markov Model (HMM) Comparison for agesture recognition application". MSc thesis, EPFL, 2004.

"Person recognition for the humanoid robot Robota". MSc thesis, EPFL, 2003.

Undergraduatelevel

"A Humanoid Robot that can paint". EPFL, 2005.

"Gaze direction detection using gyroscopic sensors". EPFL, 2005.

"Motion data representation using motion sensors". EPFL, 2004.

"Recognition of trajectories using a stereoscopic vision system". EPFL, 2004.

"Gestures analysis using gyroscopic sensors". EPFL, 2004.

"Development of an auditory systems running on a Pocket-PC". EPFL 2003.

"Robust visual arms and head tracking running on a Pocket-PC". EPFL, 2003.

"Interactive game of a pair of mini-humanoid robot Robota". EPFL, 2003.

Lausanne, 31/05/2007

Documents

lasa.epfl.chlasa.epfl.ch/publications/uploadedFiles/EPFL_TH3814.pdf · POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR ingénieur en microtechnique diplômé EPF de nationalité