159
VISION-BASED SCENE UNDERSTANDING WITH SPARSITY PROMOTING PRIORS THÈSE N o 5070 (2011) PRÉSENTÉE LE 10 Mai 2011 À LA FACULTÉ SCIENCES ET TECHNIQUES DE L’INGÉNIEUR Laboratoire des traitements des signaux 2 PROGRAMME DOCTORAL INFORMATIQUE ET COMMUNICATION ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE POUR L’OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR Alexandre Alahi Ingénieur diplômé Ecole Polytechnique Fédérale de Lausanne, Suisse De nationalité francaise proposée au jury: Prof. Thiran Jean-Philippe , président du jury Prof. Vandergheynst Pierre, directeur de thèse Prof. Kunt Murat, co-directeur de thèse Prof. Aghajan Hamid, Stanford, rapporteur Prof. Bierlaire Michel, EPFL, rapporteur Prof. Cavallaro Andrea, QMUL, rapporteur

vision-based scene understanding with sparsity promoting priors

Embed Size (px)

Citation preview

Page 1: vision-based scene understanding with sparsity promoting priors

VISION-BASED SCENE UNDERSTANDING WITHSPARSITY PROMOTING PRIORS

THÈSE No 5070 (2011)

PRÉSENTÉE LE 10 Mai 2011À LA FACULTÉ SCIENCES ET TECHNIQUES DE L’INGÉNIEUR

Laboratoire des traitements des signaux 2PROGRAMME DOCTORAL INFORMATIQUE ET COMMUNICATION

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

POUR L’OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES

PAR

Alexandre Alahi

Ingénieur diplômé Ecole Polytechnique Fédérale de Lausanne, SuisseDe nationalité francaise

proposée au jury:

Prof. Thiran Jean-Philippe , président du juryProf. Vandergheynst Pierre, directeur de thèse

Prof. Kunt Murat, co-directeur de thèseProf. Aghajan Hamid, Stanford, rapporteur

Prof. Bierlaire Michel, EPFL, rapporteurProf. Cavallaro Andrea, QMUL, rapporteur

Page 2: vision-based scene understanding with sparsity promoting priors

2011

Page 3: vision-based scene understanding with sparsity promoting priors

iii

Chance and coincidence do not exist ;there is a cause and a reason for everything.

– Ostad Elahi –

Page 4: vision-based scene understanding with sparsity promoting priors

Abstract

Human beings are interested in understanding their environments and the dynamic content thatfills their surroundings. For applications ranging from security to marketing, people have installednetworks of cameras to capture the dynamic elements of scenes. In this thesis, we propose a com-plete real-time system to automatically analyze human behavior from any network of cameras. Theproposed system leverages mixed networks of fixed and mobile cameras to locate people, trackthem, and analyze their trajectories. The mathematical frameworks underlying our proposed meth-ods are based on the following claim:

The dynamics of a scene are based on a small set of causes, and therefore can be parameterizedby a few degrees of freedom.

Every processing block of our system is driven by sparsity promoting priors, i.e., just a fewelements are sufficient to capture the scene dynamics. We first present our multi-view people local-ization algorithm that is designed for a network of fixed cameras. An inverse problem with a sparsityconstraint is formulated to detect people using the degraded foreground silhouettes extracted by thecameras. To solve this sparsity driven formulation in a manner appropriate for a real-time imple-mentation, we then propose an approach called "Set Covering Occupancy Object Pursuit" (SCOOP)that outperforms the state-of-the-art. Next, we tackle the data association problem of finding corre-spondences between located people across time. We implement a graph-based greedy approach toreach real-time tracking performance.

Unlike the fixed camera networks considered in the first part of this thesis, mobile cameras areuncalibrated and often monitor non-overlapping fields-of-views with other cameras. We proposea "Cascade of Grids of Image Descriptors" (CaGID) with a sparse search to accurately detect andtrack objects across uncalibrated cameras with non-overlapping fields-of-views. We evaluate theability of such mixed networks of cameras to alert drivers to a potential collision with pedestrians.For this application, a camera mounted in a vehicle collaborates with a network of fixed camerasinstalled in a city.

Finally, the proposed system is evaluated for coaching and marketing purposes. The behaviorof people in sports games and stores is analyzed in real-time with a graph-based algorithm coined"SpotRank". A probability map inspired by the PageRank algorithm is proposed to rank the mostsalient ’hot spots’ based upon mutual flows.

Several public data sets have been used to quantitatively and qualitatively evaluate the perfor-mance of our system. To our knowledge, it is the first system to capture the behavior of people incrowded environments and analyze this behavior in real-time with sparsity priors.

i

Page 5: vision-based scene understanding with sparsity promoting priors

Keywords: People detection, object tracking, object matching, people behavior analysis, spar-sity, convex optimization, multi-view, mobile camera, fixed camera, non-overlapping, spotrank,scoop, cagid, cascade, image descriptor, master-slave, foreground extraction, dictionary.

ii

Page 6: vision-based scene understanding with sparsity promoting priors

Résumé

La nature humaine a besoin d’assimiler son environnement et ses causes. Où se trouvent lesêtres ? Comment se comportent-ils ? Les hommes sont intéressés à capturer l’essence d’une scènepour des besoins de sécurité, de protection, d’études et même d’analyses marketing. C’est princi-palement la raison pour laquelle des caméras ont été installées en tous lieux. Malheureusement,ces énormes flux de données ne peuvent être analysés manuellement. Les caméras doivent devenirintelligentes et extraire automatiquement la sémantique d’une scène. Dans cette thèse, nous propo-sons un système complet de capture et d’analyse en temps-réel du comportement des hommes àl’aide d’un réseau de caméras. Les personnes sont automatiquement localisées, suivies, et leurs tra-jectoires analysées grâce à un réseau de caméras fixes et mobiles. Nous avons proposé des modèlesmathématiques basés sur l’hypothèse suivante :

La dynamique d’une scène est basée sur un petit ensemble de causes et peut donc être paramé-trée par un nombre minimal d’éléments.

Les solutions que nous cherchons sont parcimonieuses, en d’autres termes, très peu d’élémentssont actifs. Notre système est composé de deux étapes : i) la collecte des données et ii) leur ana-lyse. La première étape, la plus difficile, est basée sur une méthode "subordonné-subordonnant".Les caméras fixes, les subordonnants, collaborent pour localiser les personnes grâce à une politiquecentralisée. Un problème de minimisation avec une contrainte parcimonieuse est résolu pour loca-liser les hommes au sol grâce aux silhouettes binaires extraites par les caméras. Une résolution entemps-réel est proposée avec des performances supérieures à l’état de l’art. Nous l’avons intitulée"Set Covering Object Occupancy Pursuit" (SCOOP). Les caméras mobiles, les subordonnées, dé-tectent les objets grâce aux observations capturées par les subordonnants. Une cascade de grilles dedescripteurs d’images (CaGID) est utilisée afin de suivre les hommes à travers un réseau de camérasmobiles.

La localisation et le suivi des hommes a été mis en pratique dans deux applications. Première-ment, comme système de sécurité routière, où les conducteurs sont avertis automatiquement face àune collision potentielle avec des piétons. En effet, une caméra est embarquée dans leur véhiculeet collabore avec les caméras installées dans les lieux urbains. Deuxièmement, notre approche estévaluée à des fins d’études sportives ou marketing. Le comportement des individus est analysé entemps-réel grâce à un algorithme basé sur la théorie des graphes intitulé "SpotRank". Une carte deprobabilité est proposée pour classer les zones les plus saillantes.

Plusieurs bases de données publiques ont été utilisées pour évaluer la performance de notresystème. A notre connaissance, nous avons développé le premier système complet de capture et

iii

Page 7: vision-based scene understanding with sparsity promoting priors

d’analyse de comportements en temps réel en utilisant des contraintes parcimonieuses.

Mots clés : Détection de piétons, suivi d’objet, identification d’objet, analyse de comportement,parcimonie, optimisation convexe, reseau de caméras, caméra mobile, caméra fixe, spotrank, scoop,descripteur d’images, dictionnaire.

iv

Page 8: vision-based scene understanding with sparsity promoting priors

Acknowledgments

My deepest and foremost gratitude is to my great-father Ostad Elahi, who guides my life throughhis thought. He has been the source of my inspiration and will always be. While pursuing myresearch, I have always tried to respect one of his sayings :

"More ignorant than the ignorant is one who says ’I know’ - a true human being always seeksto learn from others."

As a result, my thesis was a collaborative work and could not be performed individually. I wouldlike to first sincerely thank my advisor Prof. Pierre Vandergheynst for his perfect supervision. Hedrove me into fruitful discussions and gave me invaluable guidance. I would like to also expressmy gratitude to Prof. Murat Kunt for his co-supervision. He has always believed in my work andtrusted me. He gave me the honor to be his last PhD student, and shaped with Prof Vandergheynstthe best environment to conduct my research. I would like to specially thank Prof. Michel Bierlairefor very inspiring discussions and contributions to this work.

I would like to also thank the members of my thesis committee, Prof. Hamid Aghajan for hiseffort in reviewing this dissertation, and Prof. Andrea Cavallaro for the meaningful comments andresearch materials he gave me during my Ph.D. studies.

I would next thank my co-authors for considerably contributing to this thesis, Dr LaurentJacques, Prof. Yannick Boursier, Dr David Marimon, and especially Mohammad Golbabaei whowrote the proofs presented in the appendix. A special thank goes to the anonymous reviewers fortheir valuable comments and suggestions about our published papers.

I would also like to thank Prof. Pascal Frossard and Prof. Jean-Philippe Thiran for their pre-cious comments and suggestions during my Ph.D. studies. In addition, I thank my colleagues andfriends Luigi, Emmanuel and David for challenging my ideas and reviewing my work, as well asthe members of LTS2 and signal processing laboratories. Great thank to my office-mates Emma-nuel, Hyunggon, Mourad, Nawal, Francois and Xavier. I would like to also thank the LTS staff :Rosemary De Pietro, Marianne Marion, and Gilles Auric.

Special thanks to my supervised students : Damien Matti, Hadrien Mottaz, Thomas Frendo, Sa-muel Zare, Daniel Domingues, Samuel Egli, Hind Hannouch, Romain Jandard, Karim Mardambey,Mohamad Hammoud, Johan Paratte, Hamed Izadi, Parmeet Bhatia, William Guicquero, and OnurGultekin.

I would like to express my special recognition to Prof. Farhad Rachidi and his wife for all theirhelps and supports during my stay in Switzerland. Farhad has been very close to me and I will never

v

Page 9: vision-based scene understanding with sparsity promoting priors

forget his immense generosity. I would like to also thank Massoud, Soussan, and Malika for makingSwitzerland feel like home.

I want to pay tribute to all my friends in Switzerland, in particular to Sam, Yashar, Alok, Hamed,Emmanuel, Julien, Jeremie, Cedric, Dali, Vincent, and Hedi, who have been close to me over theseyears. I would like to also thank my cousin Jacques for supporting me during the dissertation of thisthesis.

A very special gratitude goes to my heartfelt friend Jeremie Ezri and his family, Pierre, Mirelle,and Jessica who always make my cousin and me feel as part of their family.

My deepest thank goes to Safa who I consider as my second brother. He always supported,encouraged and helped me in any circumstances.

Finally, I want to profoundly thank my family, and specially my parents and my adorable brotherOmid for their encouragements. They have offered me the best environment to pursue my research.This work undoubtedly owes a lot to their presence, love and support.

vi

Page 10: vision-based scene understanding with sparsity promoting priors

Contents

1 Introduction 1

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Network of fixed cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Heterogeneous network of fixed and mobile cameras . . . . . . . . . . . . . . . . 4

1.4 In praise of sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Sparsity-Driven People Localization 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Multi-view people localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Dictionary-based framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Conventions and problem formulation . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Forward model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Semi-supervised dictionary construction . . . . . . . . . . . . . . . . . . . 14

2.4 Sparsity-driven formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Ideal formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 Linearized and re-weighted optimizations . . . . . . . . . . . . . . . . . . 19

2.4.3 Occupancy Lasso (O-Lasso) . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Experimental results and comparisons . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii

Page 11: vision-based scene understanding with sparsity promoting priors

2.5.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Real-Time Scalable People Localization 33

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Greedy algorithm for sparse approximation . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Forward model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.2 Uniqueness of the solution . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.3 Recovery algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.4 Set Covering Object Occupancy Pursuit: SCOOP . . . . . . . . . . . . . . 36

3.2.5 Experimental results and comparisons . . . . . . . . . . . . . . . . . . . . 38

3.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.1 Dimensionality reduction on the observations . . . . . . . . . . . . . . . . 38

3.3.2 Dimensionality reduction in the search space . . . . . . . . . . . . . . . . 41

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 People Behavior Analysis 45

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Tracking people . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Multi-view tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Dijstra Pursuit Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Sparse probability map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 Preliminary remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.2 Point Of Interest detection . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.3 SpotRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Foreground Silhouette Extraction Based on Stereo Imaging 57

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Depth estimation given stereo cameras . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Algorithms and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

viii

Page 12: vision-based scene understanding with sparsity promoting priors

5.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Foreground silhouettes extraction given stereo cameras . . . . . . . . . . . . . . . 61

5.3.1 Temporal depth variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.2 Disparity mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.3 Proposed Total Variation disparity-based extraction . . . . . . . . . . . . . 62

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Object Matching Across Uncalibrated Cameras 71

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 Existing image descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.3 Object descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.1 Collection of grids of image descriptors . . . . . . . . . . . . . . . . . . . 74

6.3.2 Several observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.4 Object matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.4.1 Preliminary remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.4.2 Dense scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.4.3 Sparse selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.4.4 Cascade of coarse to fine descriptors . . . . . . . . . . . . . . . . . . . . . 77

6.5 Experimental results and comparisons . . . . . . . . . . . . . . . . . . . . . . . . 77

6.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.5.3 Image descriptors evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.5.4 GID/CoGID/CaGID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5.5 People recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 Master-Slave Framework for Pedestrian Detection 95

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.2 Pedestrian detection with mobile cameras . . . . . . . . . . . . . . . . . . . . . . 96

7.3 A master-slave object detection approach . . . . . . . . . . . . . . . . . . . . . . 97

7.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.3.2 Detect, track, and validate . . . . . . . . . . . . . . . . . . . . . . . . . . 99

ix

Page 13: vision-based scene understanding with sparsity promoting priors

7.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.5 Experimental results and evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.5.1 Performance of the detect-track-validate . . . . . . . . . . . . . . . . . . . 102

7.5.2 Master-slave detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8 Conclusions 115

Appendix 117

A.1 Proximal methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

A.2 Projection onto a `1 weighted ball . . . . . . . . . . . . . . . . . . . . . . . . . . 119

A.3 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

A.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

x

Page 14: vision-based scene understanding with sparsity promoting priors

List of Figures

1.1 Our system to analyze human behavior consists of three steps: detect, track, andanalyze. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 People localization based on extracted foreground silhouettes from fixed camerasneed to address the following challenges: (a) noisy extraction with missing andfalse positives pixels, (b) a single connected silhouettes for a group of people, and(c) no information from occluded people. . . . . . . . . . . . . . . . . . . . . . . 3

2.1 A scene observed by four planar cameras and one omnidirectional camera (ex-tracted from the APIDIS data set). The green contours represents the degradedforeground silhouettes extracted, and the bounding boxes correspond to the outputof our proposed detection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 People localization with a single camera. Left side: contour (in white) of the fore-ground silhouette extracted by the camera. Right hand-side: Located people by ourproposed algorithm given the silhouette extracted. . . . . . . . . . . . . . . . . . . 12

2.3 To each point p(i) corresponds a silhouette modeling the presence of a person in acamera view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Illustration of the atoms modeling the given foreground silhouettes. The grid is onlyfor visual purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Training stage: extracted foreground silhouettes are attached to the ground planepoints using a planar homography wrapping the silhouettes on the ground. Thecenter of mass of the intersected wrapped silhouettes corresponds to the groundplane point of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Dictionary construction: after the training stage, each ground plane points haveextracted a set of clusters of silhouettes representing the various shapes of all theobjects of interest. The presented silhouettes correspond to people, vehicles, trucks,and bicycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Objects detected by a fixed camera running the sparsity driven algorithm given themulti-class dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.8 Objects detected by a network of two fixed cameras running the sparsity drivenalgorithm given the multi-class dictionary. . . . . . . . . . . . . . . . . . . . . . . 18

xi

Page 15: vision-based scene understanding with sparsity promoting priors

2.9 Detecting and tracking people given the PETS data set. White contours representsthe degraded foreground silhouettes used. . . . . . . . . . . . . . . . . . . . . . . 24

2.10 Locating people with either 3 cameras (left hand-side) or a single camera (righthand-side) given the PETS dataset. White contours represents the degraded fore-ground silhouettes used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.11 Precision and recall rate on the APIDIS dataset given four cameras monitoring thescene (camera’s id 2, 4, 7, and 1). Our proposed approaches (RW-BPDN, RW-Lasso, and O-Lasso) are compared with the state-of-the-art probability of occu-pancy (POM) by Fleuret et al. in [1]. . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.12 Precision and recall rate with the synthetic data given four cameras. Our proposedapproaches (RW-BPDN, RW-Lasso, and O-Lasso) are compared with the state-of-the-art probability of occupancy (POM) by Fleuret et al. in [1]. . . . . . . . . . . . 27

2.13 Regions considered to report the number of detected people in the PETS 2009 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.14 Precision and recall rate when the number of cameras monitoring the scene areincreased. The number in each bubble represents the number of cameras used.(The sequence of cameras id 7,2,4,1 is used for planar cameras, and camera id 5 forthe omnidirectional one). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.15 Recall rate with respect to the PO present for the people given the O-Lasso approach. 29

2.16 Illustration of our dictionary-based framework applied to the problem of countingcells in microscopic images. The proposed O-Lasso is able to detect cells even ifthey partially overlap or are connected with each other. . . . . . . . . . . . . . . . 29

2.17 Recall rate with respect to percentage of missing silhouette region extracted foreach people given the adaptive O-Lasso approach. . . . . . . . . . . . . . . . . . . 30

2.18 Illustration of the detected people with various number of cameras given theAPIDIS data set. The green contours represents the degraded foreground silhou-ettes used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Precision and recall rate with the synthetic data given four cameras. Our proposedapproach SCOOP is compared with other sparsity driven formulation and the prob-ability of occupancy (POM) approach presented by Fleuret et al. in [1]. . . . . . . 39

3.2 Precision and recall rate with the synthetic data given four cameras. Our proposedapproach SCOOP is compared with other sparsity driven formulation and the prob-ability of occupancy (POM) approach presented by Fleuret et al. in [1]. . . . . . . 39

3.3 Recall rate with respect to the degradation on the foreground silhouettes. . . . . . . 40

3.4 Illustration of the detected players with SCOOP algorithm given the APIDIS data set. 40

3.5 Illustration of the adaptive sampling process. Top row: sample points given a regu-larly spaced grid. Bottom row: proposed non-regular grid. . . . . . . . . . . . . . 41

3.6 Overview of the adaptive sampling process. . . . . . . . . . . . . . . . . . . . . . 42

xii

Page 16: vision-based scene understanding with sparsity promoting priors

3.7 Illustration of the sample points used given the three strategies. (a) Camera viewexamples. (b) Corresponding foreground silhouettes. (c) People exact locations (topview). Sampled points are given in (d) for "Foreground pixels only" assumption (topview), in (e) for "Intersecting foreground pixels" and in (f) for "Least significantsilhouette". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.8 Precision and recall rate of our proposed algorithm (O-Lasso) given four camerasmonitoring the scene (camera’s id 2, 4, 5, and 7 in the APIDIS dataset) and varioussearch space reduction assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Spatio-temporal graph-based tracking. . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Illustration of people’s trajectories in a retail store . . . . . . . . . . . . . . . . . . 50

4.3 Given the captured trajectories of people navigating in the store, a sparse numberof salient ’hot spots’ (POIs) have been detected. . . . . . . . . . . . . . . . . . . . 51

4.4 (a) Trajectories between four POIs, also referred to as spots. (b) The correspondingSpotRank coefficients illustrating the most salient POIs. . . . . . . . . . . . . . . . 52

4.5 Automatically ranked salient spots in retail store given dozen of peoples’ trajectories. 54

4.6 Output of the SpotRank algorithm over various captured trajectories . . . . . . . . 55

5.1 Disparity algorithms applied on the Middelbury data set [2]. . . . . . . . . . . . . 65

5.2 Disparity algorithms applied on a typical scene made of walking people. . . . . . . 66

5.3 Foreground silhouettes obtained by depth variation, i.e. absolute subtraction of thebackground depth with the estimated depth in the presence of people. For compari-son purposes, we also present in (h) the output of our TV based approach describedin Section 5.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 Two examples of disparity mismatch given three algorithms. . . . . . . . . . . . . 67

5.5 Illustration of the three schemes to extract foreground silhouettes given stereo cam-eras. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Illustration of our proposed TV disparity based foreground extraction algorithm. . . 68

5.7 Corresponding mask given our TV disparity-based foreground silhouettes extrac-tion approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 The KINECT Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.9 Output of our real-time people detection and tracking algorithm with the Kinectcameras (SCOOP and Dijkstra Pursuit Tracker). . . . . . . . . . . . . . . . . . . . 70

6.1 A collection of grids of descriptors. Left column is the object of interest. Rightcolumn is another object to compute similarity. Colored blobs are kept to computethe global distance (β = 50%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xiii

Page 17: vision-based scene understanding with sparsity promoting priors

6.2 Illustration of an object described by three IP. The most similar IPs in the targetcamera leads to 3*6 candidate regions only. . . . . . . . . . . . . . . . . . . . . . 76

6.3 A three stages cascade of coarse to fine descriptors. . . . . . . . . . . . . . . . . . 77

6.4 Illustration of the most similar regions after each stage of the algorithm (in Jetformat, white regions are the least similar and black ones the most). . . . . . . . . 78

6.5 Recall for various image descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.6 Recall for various image descriptors with 3 different schemes to describe an objectbased on a dense scan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.7 Recall with respect to the number of image descriptors needed. . . . . . . . . . . . 82

6.8 Mean recall of the cascade of HOG and covariance descriptors for various β value. 83

6.9 Left-hand side are the objects observed in the fixed camera. Right-hand side are theimage plane of the mobile to be searched. . . . . . . . . . . . . . . . . . . . . . . 84

6.10 Recall for various image descriptors with 3 different schemes to describe an objectbased on a sparse search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.11 Recall with respect to the number of best match kept . . . . . . . . . . . . . . . . 86

6.12 1st and 3rd column: objects of interest seen in a fixed camera. 2nd and 4th column:corresponding detected and matched objects in a mobile camera . . . . . . . . . . 89

6.13 1st and 3rd column: objects of interest seen in a fixed camera. 2nd and 4th column:corresponding detected and matched objects in a mobile camera . . . . . . . . . . 90

6.14 Given the left-hand side image, the same pedestrian is detected regardless ofchanges in viewpoint and scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.15 Examples of images randomly selected from a data set. Left column are manu-ally selected regions, and right column are the corresponding regions detected andmatched by our proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.16 The CMC curve for color histogram given all three schemes. YCrCb color spacewith 64 bins per channel is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.17 The CMC curve for color histogram given various β value. . . . . . . . . . . . . . 92

6.18 The CMC curve given 50 pedestrians present in the scene. . . . . . . . . . . . . . 92

6.19 Examples of objects observed by a camera (left side), and the corresponding mostsimilar pedestrians found by the cascade of grids of color histograms (YCrCb space)within another camera (right side). The images are sorted (most similar are the leftones) and the green bounding box is the correct match. . . . . . . . . . . . . . . . 93

7.1 Left column: objects of interest highlighted in a fixed (master) camera. Right col-umn: Corresponding objects detected and tracked in a mobile (slave) camera by ourproposed approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

xiv

Page 18: vision-based scene understanding with sparsity promoting priors

7.2 Feature-based object detection framework and the proposed system to detect apedestrian in a mobile camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.3 Illustration of the Φ operator. (a) An object x, highlighted in the master camera, ismapped to the best 3 regions in the slave camera. (b) Then, each region yi is mappedback to 3 regions in the master camera. If those regions coincide with x, there is amatch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.4 Illustration of the detect, track, and validate process. Only one object is validatedand tracked across frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.5 Illustration of the bounding boxes x (in red) and xi where C ≈ 0.75, O ≈ 0.4. . . . . 101

7.6 x-axis represents C or O; y-axis represents its contribution to ς and ςl. It can beseen that for values of C or O close to 1, the contribution remains also almost 1(full) for the logistic operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.7 The linear ςl gave 0.63 and the proposed ς gives 0.86 . . . . . . . . . . . . . . . . 102

7.8 Recall/precision graph for various Ns and Nm. . . . . . . . . . . . . . . . . . . . . 103

7.9 Histogram of the similarity measurements φ(x, yi) for a set of T P and FP. . . . . . 104

7.10 Histogram of the similarity measurements in the validation process. . . . . . . . . 104

7.11 Overview of the recall/precision graph for various thresholding strategies. . . . . . 105

7.12 Correct detections and no false positives. First and third column: objects de-tected by a master. Second and fourth column: corresponding objects detectedand matched with a slave. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.13 Some false positives and missed true positives. First and third column: objectsdetected by a master. Second and fourth column: corresponding objects detectedand matched with a slave. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.14 Performance of a slave running the SPD proposed by Tuzel et. al. (2008) andcollaborating with other slaves or masters with either a poor or good detection rate. 109

7.15 Performance of a slave using only the appearance of the detected people in themasters given various schemes and detection rate. . . . . . . . . . . . . . . . . . . 110

7.16 First column: people detected by a master. Second column: corresponding peopledetected with a slave. Third column: output of the SPD proposed by Tuzel et. al.(2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.17 First column: people detected by a master. Second column: corresponding peopledetected with a slave. Third column: output of the SPD proposed by Tuzel et. al.(2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.18 First column: people detected by a master. Second column: corresponding peopledetected with a slave. Third column: output of the SPD proposed by Tuzel et. al.(2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xv

Page 19: vision-based scene understanding with sparsity promoting priors

7.19 First column: people detected by a master. Second column: corresponding peopledetected with a slave. Third column: output of the SPD proposed by Tuzel et. al.(2008). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.20 First column: people detected by the SPD. Second column: remaining people keptafter proposed validation scheme. Third column: people detected by the master. . . 114

xvi

Page 20: vision-based scene understanding with sparsity promoting priors

List of Tables

2.1 Greedy RSS Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Occupancy Lasso (O-Lasso) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Performance of people counting on the scenarios proposed by the PETS organizersgiven 1 camera (id=1) and 3 cameras (id=1,2,3): GT = Ground Truth, PC = PeopleCounted by our O-Lasso algorithm, AFE = Average Frame Error. The dataset "S1"is used with various level of difficulty ("L1": medium density crowd, "L2" and "L3"correspond to high density of people. . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Precision and recall rate using various shape in the Forward Model given the O-Lasso method and 4 planar cameras . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 ’Set Covering Object Occupancy Pursuit’ (SCOOP) algorithm . . . . . . . . . . . 37

3.2 Performance of people counting on the scenarios proposed by the PETS organizersgiven 1 camera (id=1) and 3 cameras (id=1,2,3): GT = Ground Truth, PC = Peo-ple Counted by our algorithm, AFE = Average Frame Error. The data set "S1" isused with various level of difficulty ("L1": medium density crowd, "L2" and "L3"correspond to high density of people). . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Dijkstra Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Impact of the Dijkstra Pursuit Tracker on the detection rates. . . . . . . . . . . . . 49

5.1 Parameters used for the algorithm proposed by Domke et al. [3]. . . . . . . . . . . 60

5.2 Parameters used for the Markov Random Field (MRF) [4] method. . . . . . . . . . 60

5.3 Parameters used for the Zitnick and Kanade algorithm [5]. . . . . . . . . . . . . . 60

5.4 Parameters used for the TV algorithm [6]. . . . . . . . . . . . . . . . . . . . . . . 60

5.5 Parameters used for the Birchfield algorithm. . . . . . . . . . . . . . . . . . . . . 60

5.6 Parameters used for the block matching algorithm [7]. . . . . . . . . . . . . . . . . 61

xvii

Page 21: vision-based scene understanding with sparsity promoting priors

5.7 Parameters used for the Graph Cuts algorithm [8]. . . . . . . . . . . . . . . . . . . 61

5.8 Time for each tested algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.1 Summary of the best performing image descriptors evaluated for the CaGID. x andy are the pixel coordinates, I the grayscale value, Ix and Iy the 1nd order derivatives,Ixx and Iyy the 2nd order derivatives, mg and o the gradient magnitude and angle. . 79

6.2 Recall rate and computation cost of various approaches . . . . . . . . . . . . . . . 86

xviii

Page 22: vision-based scene understanding with sparsity promoting priors

Introduction 11.1 Motivations

The falling cost of digital cameras and the progress in processing large data sets such as digitalvideos have promoted the installation of cameras everywhere, on fixed and moving platforms. Thereare already 50 million video surveillance cameras installed in the world [9]. In 2002, approximatelyfour million cameras were installed in UK cities[10]. More than a billion cell phones are equippedwith cameras in 2010 [11]. The overwhelming volume of data captured raises the question of howto monitor and analyze the information? Humans cannot efficiently perform these tasks over suchlarge amounts of information. Cameras need to be "smart", i.e. capable of automatically extractingapplication-specific information, such as the presence of an object, from the captured images. Onecan see smart cameras as a tool to help humans to perform time-consuming tasks. Moreover, severalcameras often monitor the same scene. How can we fuse the information from multiple cameras tobetter detect objects? The cameras can be fixed or mobile. What is the best way to deal with thedynamics of the cameras? The goal of this thesis is to make heterogeneous networks of cameras"smarter" in order to automatically detect objects, track them, and analyze their behavior (see Figure1.1).

The fields of applications are almost unlimited. Automatic people detection, counting, andtracking can address the following sectors:

Security - Many industries or public facilities such as airports are interested in locating andtracking people. Are humans present in forbidden areas? Are they crossing borders? Arethey walking in wrong directions? Suspicious activities can be automatically detected andshared with security agents.

Safety - Often, public areas or rooms such as lounges or bars have limited occupancy for safetypurposes in case of fire. Their entrance should be forbidden once the limit is reached.

Public Statistics - Institutions and news corporations are interested in estimating crowd sizes.

1

Page 23: vision-based scene understanding with sparsity promoting priors

2 Chapter 1. Introduction

(a) Detect objects

(b) Track them (c) Analyze their behavior

Figure 1.1: Our system to analyze human behavior consists of three steps: detect, track, and analyze.

Typically, the year 2011 has started with few paradigm shifts in the history of several coun-tries. People were protesting against the repressive regime. International medias were inter-ested in estimating the number of protestants.

Comfort -In museums, a restricted number of visitors are allowed in galleries to guarantee thebest experience.

Monitoring - Several events use visitor counts as evidence when requesting sponsors. In caseswhere tickets are not sold, a vision-based system is a key advantage to count the number ofpeople present in the event.

Marketing - Retailers and brand suppliers are interested in knowing the customer behavior,also known as ’path to purchase’. Typically, two stores carrying the same merchandise canhave a 500% difference in sales due to the customer behavior in their store induced by thelayout[12].

Optimization - Statistics on people’s behaviors can be used to optimize flows in railways.How many people are waiting? How long? Where are they usually going? Peak times can bedetermined to optimize the facility management.

Sport - Collecting statistics from players is also of great interest especially in team games suchas basketball. Analyzing players’ behavior is valuable for coaching to evaluate team tacticsand measure players’s performance.

Transportation safety - Car manufacturers and institutions are interested in detecting poten-tial collisions of cars with pedestrians in urban areas [13]. In the EU, there are more than

Page 24: vision-based scene understanding with sparsity promoting priors

1.2. Network of fixed cameras 3

(a) Noisy (b) Grouped (c) Occluded

Figure 1.2: People localization based on extracted foreground silhouettes from fixed cameras need to address thefollowing challenges: (a) noisy extraction with missing and false positives pixels, (b) a single connected silhouettesfor a group of people, and (c) no information from occluded people.

150 000 pedestrians injured yearly and more than 8000 are killed (numbers for 2004, [14]).Many accidents originate from the momentary distraction of the drivers. A driver assistancesystem that detects potential collisions with pedestrians will reduce the number of pedestri-ans killed on roads. For that purpose, they have mounted cameras on cars. Those camerascould collaborate with the fixed cameras installed in the cities to better detect pedestrians orany moving objects (such as bicycle, or dogs). Different challenges arise given the dynamicof the cameras.

1.2 Network of fixed cameras

One of the well-known and most used intelligence by fixed cameras is the motion detector an-alytic. Nowadays, most IP cameras extract foreground silhouettes, i.e. binary masks representingthe connected pixels belonging to the foreground of the scene. Fixed cameras can model the back-ground of a scene by collecting statistics on every pixels. Foreground silhouettes are then extractedwith standard approaches. Porikli presents some of the methods in a brief survey [15]. Typically,a decade ago, Stauffer and Grimson modeled each pixel as a mixture of Gaussian with an on-lineapproximation for the update [16]. Although a lot of efforts have been dedicated to enhance thesealgorithms [17, 18], exact detection and location of people given those foreground silhouettes isdifficult. Extracted silhouettes are:

Noisy - They are made of either missing foreground pixels due to objects having similar ap-pearance as the background, or false positive silhouettes such as shadows and reflections ofpeople on specific materials.

Grouped - Dense crowd of people are represented as a single foreground silhouette. A segmen-tation process is needed to precisely locate people given a single connected binary mask.

Occluded - Given a single viewpoint, people can occlude each other. A single camera is notenough to locate group of people. Figure 1.2 illustrates the challenges.

Note that fixed cameras are calibrated to map 3D points across image planes. Object correspon-dence across cameras is performed given the calibration data. What if cameras are not fixed? Howcan we track objects across moving cameras?

Page 25: vision-based scene understanding with sparsity promoting priors

4 Chapter 1. Introduction

1.3 Heterogeneous network of fixed and mobile cameras

Detecting objects with a network of mobile cameras is more challenging than with fixed cam-eras. Foreground silhouettes cannot be used neither calibration data to map 3D points across cam-eras. The only available information is the image features such as appearance and shape of theobjects. Usually, a training data made of thousands of images of the objects is used to train a classi-fier to detect classes of specific objects[19–21]. What if thousands of images are not available fromour objects of interest?

In this thesis, we also tackle the problem of detecting objects given only a single observationrobust to the following variations:

Size - Object size or distance to the camera can vary.Deformable - Object can be rigid (e.g. vehicles) or deformable such as pedestrians.Lighting - Brightness and lighting conditions can change dramatically with mobile cameras in

outdoor environment.Sensing - The observation of the object can be captured by different sensing devices with a

different color distribution and lower resolution.View point - Object can have a change in pose or be captured by cameras having different

viewing angle.Occlusion - Objects can be partially occluded.Out of scene - Finally, the object can be out of the scene temporarily. In other words, objects

can disappear in the field of view of the cameras and come back.

Every chapter of this thesis tackles one of the presented challenges by briefly describing pre-vious work and the proposed solution. The tremendous amount of data generated by networks ofcameras makes it almost impossible to tackle this challenges. This is why, we propose a sparsitydriven framework that dramatically reduces the degrees of freedom with no obvious loss of accu-racy.

1.4 In praise of sparsity

Recently, there has been a new trend in the signal processing community to reconstruct signalsgiven a sparsity constraint with respect to a suitable basis. Interestingly, many signals and phe-nomena are driven by a sparsity model [22–24], i.e. only few elements of the signals are active(non-zeros). Applications such as geophysics and biomedical imaging [25], image recovery andenhancement [26, 27], statistical estimation [28, 29], deconvolution and super-resolution problems[30–32] have successfully been addressed with sparsity driven formulations. If the signal we arelooking for is sparse in a given basis, very efficient algorithms exist to reconstruct the signal inways that can typically beat the Shannon/Nyquist sampling theorem, which states that to preserve asignal’s information, we should uniformly sample at a rate at least two times faster than its Fourierbandwidth [33–35]. Mathematical frameworks exist to solve inverse problems with sparsity con-straints in convex optimization schemes [36, 37]. Greedy approximations are available leading toreal-time performance [38, 39]. One question is left, how sparse is our world? How sparse arethe solutions we are looking for? Can we model any problems as a sparse contribution of a givendictionary? According to a contemporary philosopher, O. Elahi [40]:

Page 26: vision-based scene understanding with sparsity promoting priors

1.5. Contributions of the thesis 5

"Creation is based on a small number of primary, indivisible elements that combine with oneanother according to a few simple patterns".

The leitmotiv of this thesis is based on this belief. We formulate the following claim:

Claim 1 The dynamic of a scene is based on a small set of causes and therefore can be parameterizedby few degrees of freedom.

According to our Claim 1, the solutions we are looking for are precisely sparse. We show in thenext chapters how we outperform the state-of-the-art given such principle.

1.5 Contributions of the thesis

In this thesis, we propose to detect, track and analyze people behavior with sparsity promot-ing priors given any network of fixed and mobile cameras in real-time. The contributions can becategorized as follows:

Multi-view people localization We propose a novel framework to locate people given severelydegraded foreground silhouettes extracted by a network of fixed cameras with the following novel-ties:

Dictionary-based - A dictionary of atoms is used to approximate the ideal silhouettes of interest.A generative model can be used to model a single class as well as a semi-supervised approachto create a multi-class dictionary given training data. Therefore, our framework is genericenough to detect any class of objects.

Sparsity-driven - An inverse problem is formulated to deduce people location points given asparsity constraint on the ground occupancy grid. We show how to solve it theoreticallyand practically by a modified re-weighted Lasso formulation. The performances of severalreconstruction schemes for sparse approximation methods are evaluated. Finally, a modifiedversion of the Matching Pursuit algorithm is presented with a new selection criterion coined’Set Covering Object Occupancy Pursuit’ (SCOOP) to outperform the state-of-the-art withreal-time performance.

Scalable - The proposed approach is (i) scalable to any number of cameras and already workingwith a single camera, (ii) versatile with respect to heterogeneous cameras network, i.e. ableto merge specific camera geometries such as planar and omnidirectional cameras, and (iii)does not impose any constraint on the scene surface to be monitored thanks to the proposednon-uniform adaptive sampling process.

Foreground silhouettes extraction Monocular foreground silhouettes extraction techniques arenot working during sudden change of lighting conditions. As a result, we propose an algorithmbased on stereo imaging:

Total Variation Disparity-based Foreground Extraction method - An inverse problem with asparsity constraint on the gradient is proposed to extract foreground masks given the back-ground disparity map.

Page 27: vision-based scene understanding with sparsity promoting priors

6 Chapter 1. Introduction

People behavior analysis To analyze people’s behavior, we propose:Graph-based Tracking - We link people’s location points given an iterative method coined Di-

jkstra Pursuit.SpotRank - Instead of considering the full trajectories of people, a few ’hot spots’ are seg-

mented and ranked to understand the most salient points of interest and paths. In other words,the trajectories of people are described into a sparse number of ranked spots.

Object matching A Cascade of Grids of Image Descriptor (CaGID) is proposed to detect andmatch objects across cameras. A detailed comparison is proposed to get some insights on the perfor-mance of a great number of existing image descriptors. To reach satisfying matching performance,the following methods are proposed:

Grid-based object descriptor - An Object is described by a collection of grids of coarse to finedescriptors robust to partial occlusion or partial change on the object appearance.

Sparse scan - Instead of scanning the full image plane of a camera to find the object of interest,a sparse number of points driven by the interest point locations is used to considerably reducethe search space.

Cascade of descriptors - A cascade of coarse to fine grids are proposed to quickly discard re-gions with poor likelihood to match similar objects leading to real-time performance.

Master-slave detection Finally, we compare the performance of our full system made of fixedand mobile cameras to detect people with a more traditional approach based on a large trainingdata. We show how to outperform the state-of-the-art method with a master-slave system.

1.6 Thesis structure

This thesis is organized as follows: Chapter 2 presents the state-of-the-art to locate people thatis designed for a network of fixed cameras. Then, the dictionary-based framework is described. Aninverse problem is formulated to deduce people location points on the ground given a sparsity con-straint. We show how to solve it by a modified re-weighted Lasso formulation. The performancesof several reconstruction schemes are evaluated.

Chapter 3 presents the proposed real-time approach to deduce people location points given anetwork of fixed cameras. A modified version of the Matching Pursuit algorithm is presented witha new selection criterion coined ’Set Covering Object Occupancy Pursuit’ (SCOOP) to outperformthe state-of-the-art. A non-uniform adaptive sampling process is presented to monitor large areas.

Chapter 4 describes the state-of-the-art to track and analyze people with respect to our detectionscheme. Then, our graph-based tracking algorithm is presented coined Dijkstra Pursuit. People’strajectories are studied by our SpotRank algorithm. Point of interests are automatically identifiedand ranked according to their mutual flows.

Chapter 4 ends the description of our system to analyze human behavior in a network of fixedcameras. Before addressing the detection and tracking process across mobile cameras, a briefmethod is presented in Chapter 5 to extract foreground silhouettes robust to sudden change ofillumination. A Total Variation (TV) framework is used to better extract silhouettes with stereo

Page 28: vision-based scene understanding with sparsity promoting priors

1.6. Thesis structure 7

imaging.

Chaptre 6 presents the core method to track objects across uncalibrated cameras. The Cascadeof Grid of Image Descriptors (CaGID) is described. A detailed comparison of image descriptors isperformed.

Chapter 7 presents the state-of-the-art to detect people given a single moving camera. Then, itevaluates the ability to improve the detection rate of a mobile camera with the help of fixed cameraspresent in the scene. The master-slave detection framework is presented to track objects acrosscameras with non-overlapping fields-of-views.

Finally, this thesis ends with concluding remarks. Experimental results have supported ourinitial claim:

A sparse set of elements is enough to capture and understand human behavior.

Page 29: vision-based scene understanding with sparsity promoting priors

8 Chapter 1. Introduction

Page 30: vision-based scene understanding with sparsity promoting priors

Sparsity-Driven PeopleLocalization 22.1 Introduction

Isolated people in an uncluttered scene can be successfully detected with a single camera us-ing pattern recognition techniques. A set of features such as Haar wavelet coefficients [41, 42],histogram of oriented gradient [19, 43] or covariance matrices of a set of features [20, 44], can beextracted from a large number of training samples to train a classifier with a support vector ma-chine [41, 45], or boosting approaches [20, 46]. Given a fixed camera, a moving object can alsobe detected by modeling the background, and tracking simply becomes an object correspondenceproblem across frames. A detailed survey on object tracking is presented by Yilmaz et al. in [47].However, those algorithms fail to detect a group of people due to their mutual occlusions. For in-stance, in sports such as basketball, players can strongly occlude each other and have abrupt changesof behavior. In order to handle the occlusion problem, several cameras should be fused to correctlylocate all people present in the scene.

In this chapter, a novel framework is proposed to use severely degraded foreground silhouettesfrom a set of calibrated pseudo-synchronized cameras to robustly detect moving people occludingeach other given (see Figure 2.1). The only features extracted from the cameras are indeed thebinary masks, or foreground silhouettes, representing the connected pixels belonging to the fore-ground of the scene. Such features suffer from three flaws. First, silhouettes are usually made ofmany false positives pixels (e.g. shadows, reflections) and false negatives ones (i.e. missing fore-ground pixels). Second, a single silhouette can correspond to several people due to their densespatial distribution in crowded environment. Finally, occluded people are not a linear combinationof foreground silhouettes. We propose a sparsity-driven mathematical framework to handle suchchallenges.

Our approach relies on an inverse problem formulation regularized by the assumed “sparsity”of people’s location points on the ground floor. It relies on deducing an occupancy vector, i.e. the

9

Page 31: vision-based scene understanding with sparsity promoting priors

10 Chapter 2. Sparsity-Driven People Localization

Figure 2.1: A scene observed by four planar cameras and one omnidirectional camera (extracted from theAPIDIS data set). The green contours represents the degraded foreground silhouettes extracted, and the boundingboxes correspond to the output of our proposed detection algorithm.

discretized occupancy of people on the ground, from the foreground silhouettes. Reconstructionmethods based on the Basis Pursuit DeNoise (BPDN) [48] and the Least Absolute Shrinkage andSelection Operator (Lasso) algorithms [29] are evaluated. The sparsity measure is reinforced byiteratively re-weighting the `1-norm of the occupancy vector for better approximating its `0 “norm”(referred to RW-BPDN and RW-Lasso in this chapter). A new kind of “repulsive” sparsity is usedto further adapt the Lasso procedure to the occupancy reconstruction (referred to O-Lasso) outper-forming other methods. A dictionary made of atoms representing the silhouettes viewed by thecamera network is used within the formulation. We locate people’s location points on the groundand propagate the detection results in each camera view.

The proposed approach is (i) generic to any scene of people and sensing modality, (ii) versatilewith respect to heterogeneous cameras network, i.e. able to merge specific camera geometries suchas planar and omnidirectional cameras, (iii) scalable to any number of cameras and already workingwith a single camera, and (iv) robust to people having similar appearance and to abrupt change ofbehavior (as for sport players).

The rest of the chapter is structured as follows. First, we recall some important previous worksabout people detection with multiple cameras. In Section 2.3.1, our approach is formulated asthe inverse problem of deducing an occupancy vector from the noisy binary silhouettes observedas foreground pixels in each camera. We show how this problem can be solved theoretically byregularization, i.e. by using a sparsity prior on the occupancy grid. In Section 2.3.2, the dictionaryinvolved in the corresponding forward (or generative) model, i.e. the generation of the observedsilhouettes from the occupancy vector, is detailed. A multi-class dictionary is proposed in the nextsection to detect people in an urban environment. Section 2.4 explains the particular simplificationswe bring to the theoretical methods of Section 2.3.1 to achieve a solvable people localization. First,a re-weighting `1-norm is presented. Then, a repulsive spatial sparsity constraint is consideredwith a dynamic update. Finally, the performance of our approach is evaluated quantitatively andqualitatively on synthetic and real data, in comparison with the state-of-the-art technique.

2.2 Multi-view people localization

In order to deal with a dense spatial distribution of people, and their mutual occlusions, theoutput of several cameras are used to detect the objects of interest. Robustness with respect to theappearance variability between views is achieved by estimating the object coordinates in a commonreference (e.g. the ground plane where people are located). The unique world coordinates, i.e. thecoordinates of the object on the ground plane, is linked to the view coordinates by a planar homog-

Page 32: vision-based scene understanding with sparsity promoting priors

2.2. Multi-view people localization 11

raphy. The planar homography is a 3 × 3 matrix transformation obtained by matching at least fourpoints from two different coordinate systems. Most systems compute the homographies at an initialcalibration step [49]. Stauffer and Tieu’s method in [50] rely on tracking data to estimate homog-raphy from one camera to another one (correspondence between trajectories). Note that insteadof projecting each view on a reference ground plane, some works compute planar homographiesbetween camera views [49–51]. However, those approaches suffer to solve the occlusion problem.

After projecting all detected objects into a common reference, Mueller et al. in [49] mark withthe same label the nearest object with the same size and center of gravity. Orwell et al. in [52]and Caspi et al. in [53] match objects by fusing the estimated trajectories obtained by each camera.However, special care should be applied when using such methods. A point from the object regionin the image coordinate is projected in other coordinates. Ideally, the foot region should be used.However, some works consider that the center of gravity of the detected object is a reliable approxi-mation. If objects are very far from the cameras, then the approximation is correct. Otherwise, suchapproximation will lead to poor matching performance. In addition, object segmentation should beperfect. If a person is extracted with its shadow, again, the matching procedure will be affected.

Kim and Davis in [54] take special care to extract the feet region of the foreground people bycomputing the center vertical axes of the people across views. The axes are mapped to the top-viewplane by homography and their intersection point is estimated as the ground point. However, suchapproaches do not take full advantage of the multi-view infrastructure, as each camera detects theobjects independently without helping each other.

Relevant works have decided to neither detect nor track objects from each camera, but preferredto gather evidences from all the views and locate in a reference plane. The problem is reformulatedas determining the occupied point in the occupancy grid defined by Elfes in [55]. The occupancygrid can be 2-D [1], or even 3-D [56]. It is usually the ground plane or planes parallel to the ground.Yang et al. in [57] compute the occupancy grid with a standard visual hull procedure given an upperand lower bound constraint.

Some works locate people’s head positions instead of their ground plane locations. Zhao andNevatia in [58] locate the head locations given a single camera calibration and a head detector.Eshel and Moses in [59] use a set of cameras to better handle occlusions. Those approaches requirea good observation of the heads or a good foreground extraction at the head level.

Khan and Shah in [60] pay attention to extract the feet region of the foreground people. Eachpoint of the foreground likelihood (foreground silhouettes) from all views is mapped to the groundplane given a planar homography. Multiplying the mapped points segments the pixel correspondingto the feet of the people. Their approach cannot be applied to an object viewed by one camera. Inaddition, poor foreground segmentation - people detected with their shadow or missing foregroundpixels - affects the performance of their system. To handle such noisy segmentation, they applytheir approach to multiple planes parallel to the reference plane in [61]. Delannay et al. in [62] usethe same approach, i.e. they also project the foreground likelihood on multiple planes parallel tothe ground. They combine such process with a heuristic step to handle the non-linearity induced byocclusion. However, wrapping the foreground silhouettes on reference planes do not allow to locategrouped of people specially when a single camera is used such as in Figure 2.2.

More recently, Reddy et al. in [63] use compressed sensing to detect and track people in amulti-view setup. They use the sparsity of the observations, i.e. the foreground silhouettes extractedfrom the cameras. However, their sparsity constraint depends on the distance of the objects to the

Page 33: vision-based scene understanding with sparsity promoting priors

12 Chapter 2. Sparsity-Driven People Localization

Figure 2.2: People localization with a single camera. Left side: contour (in white) of the foreground silhouetteextracted by the camera. Right hand-side: Located people by our proposed algorithm given the silhouette extracted.

cameras. Objects close to the cameras will unfortunately generate large foreground silhouetteswith poor sparsity. To accurately estimate the position of the objects on the ground plane multiplecameras are needed. No dictionary is used to model the presence of a person.

Fleuret et al. in [1, 64] develop a mathematical framework to estimate the probabilities ofoccupancy of the ground plane at each time frame with dynamic programming to track people overtime. They approximate the occupancy probabilities as the marginals of a product law minimizingthe Kullback-Leibler divergence from the true conditional posterior distribution (referred to as FixedPoint Probability Field algorithm). They are able to detect people occluding each other given noisyobservation. We will consider their work as the state-of-the-art and compare it with our proposedalgorithm in Section 3.2.5 since both approaches have a generative model and try to minimizethe difference between a synthetic image and the observed image. However, their mathematicalframework does not explicitly consider the sparsity of the desired solution leading to potentiallyhigh false positives rate.

We propose a framework to cope with the limitations of previous works. It scales to any numberof cameras. A single camera can also be used whereas previous multi-view approaches could notbe applied to group of people viewed by a single camera. We do not have any constraint on thesurface to monitor. Omnidirectional cameras can also be integrated to the system. We used severelydegraded foreground silhouettes representing realistic scenarios. Foreground silhouettes are madeof many false negative and positive pixels. Finally, we explicitly consider the sparsity present inthe desired solution during the detection process similar to other sparsity-based algorithms usedfor localization [22–24]. The strength of the proposed approach is quantitatively and qualitativelypresented in Section 3.2.5.

2.3 Dictionary-based framework

2.3.1 Conventions and problem formulation

The objective of this chapter is to deduce the ground plane points occupied by the people presentin the scene given the foreground silhouettes provided by a set of C calibrated cameras (planar oromnidirectional).

To simplify notations, we will often refer to two-dimensional objects, e.g. the grid of occupancy

Page 34: vision-based scene understanding with sparsity promoting priors

2.3. Dictionary-based framework 13

Figure 2.3: To each point p(i) corresponds a silhouette modeling the presence of a person in a camera view.

or a given camera view, as 1-D vectors, i.e. the vectors obtained for instance by the concatenation ofthe columns of these 2-D objects. This will allow us to model easily the construction and the actionof some important linear operators such as the Multi-Silhouettes Dictionary described in Section2.3.2.

Up to the selection of an appropriate background subtracting method, we assume that at a giventime, each camera is the source of a binary silhouette image yc ∈ {0, 1}Mc , where Mc ∈ N is thenumber of pixels (resolution) of each camera indexed by 1 ≤ c ≤ C. Stacking all these vectorsgives the Multi-Silhouette Vector (MSV)

y = (yT1 , · · · , yT

C)T ∈ {0, 1}M,

with M =∑C

c=1 Mc.

The continuous ground plane is discretized in a 2-D grid of N sub-areas (or cells). The presence(or occupancy) of people on the ground is therefore represented by the binary vector x ∈ {0, 1}N ,coined occupancy vector, with xi = 1 meaning that the ith cell is occupied by somebody. The indexi of each component xi of x is actually linked to a particular position p(i) ∈ R2 on the ground planein the center of one cell. For simplicity, we assume that one and only one observed person is exactlysupported by one subarea of this grid.

Assuming that a person is represented by an invariant volume, it is clear that any configuration ofx will correspond to a particular configuration of silhouettes in y. For instance, if x contains only onenon-zero component, all yc observing the object will contain one silhouette (i.e. a connected area ofnon-zero pixels) with size and location related to the particular projective geometry combining thescene and the cameras (see Figure 2.3).

Our inverse problem is thus to find x from y assuming that x is a sparse vector, i.e. it is com-posed of few non-zero components compared to N. However, the difficulty in the resolution of thisproblem arises from its non-linearity, i.e. the vector y is binary and it does not contain any infor-mation about possible occlusion between persons. In addition, the background subtracting methodsleading to the silhouette definitions are severely degraded (e.g. light reflection, shadows, and noise).

To bypass these two difficulties, we propose to handle them both as a noise on some linearobservation obtained by a generative (forward) model described hereafter.

Page 35: vision-based scene understanding with sparsity promoting priors

14 Chapter 2. Sparsity-Driven People Localization

2.3.2 Forward model

Our forward (or generative) model that associates to the occupancy vector x ∈ RN a certainconfiguration of silhouettes in the cameras is provided by the quantization of a linear operatorapplied on x. As developed hereafter, we obtain it from the one bit quantization of a dictionaryD ∈ RM×N multiplied by x.

The Multi-Silhouette (MS) dictionary D is one of the key ingredient of our approach. It is madeof atoms modeling the presence of a single person at a given location. By construction, it mapsnon-empty locations of the occupancy vector to a linear approximation of the multiple silhouettesviewed by the cameras network. In other words, each atom approximates the silhouette generatedby a single person in all the camera views. The columns of D, i.e. the atoms, live thus in the samespace as the observed Multi-Silhouette Vector (MSV), i.e. in a space of M =

∑Cc=1 Mc dimensions.

Mathematically, the Forward Model generating silhouettes is thus the application of D on theoccupancy vector x, i.e. D x. Of course, by linearity, the components of D x are not binary. Theyare higher than one each time two or more silhouettes occlude. A more faithful, but non-linear,forward model, is therefore achieved by applying a quantization operator Q : RN → {0, 1}N on D x,with (Q[v])i = 1 if vi , 0 and 0 else. We will develop further the use of these two forward modelsin Section 2.4.

The dictionary D ∈ {0, 1}M×N can also be seen as the merging of all the sub-dictionaries Dc ∈

{0, 1}Mc×N made of the index restriction of the atoms of D to the pixel range of each camera c for1 ≤ c ≤ C. Therefore,

D = (DT1 ,D

T2 , . . . ,D

TC)T (2.1)

meaning implicitly that there is no theoretical constraint on the number or on the type of cameraused, e.g. planar or omnidirectional.

Practically, as explained in [6], the atoms of each Dc are generated (i) thanks to the homogra-phies mapping points in the 3-D scene to their 2-D coordinates in the planar view, and (ii) thanks tothe approximation of the silhouettes by simple shapes (e.g. rectangular or elliptical shapes). Indeed,to cope with the various poses and shapes a person can generate in a camera view, a half-cylinder-half-spherical shape is used to approximate the silhouette of a person in the views (see Figure 2.3).Figure 2.4 illustrates an example of severely degraded foreground silhouettes (made of shadows,people’s reflection, missed regions) and the silhouettes used to model their presence in the set ofplanar and omnidirectional cameras.

Note that the shape used to generate the atoms does not affect the computation complexity ofthe approach since the dictionary is computed off-line. We do not need to use rectangular shape asin [1] to take advantage of integral images.

2.3.3 Semi-supervised dictionary construction

The proposed forward model fits well a basket-ball game or scene where extracted foregroundsilhouettes correspond to people only. However, often foreground silhouettes represent severalclasses of moving objects such as vehicles, or bicycles in an urban environment. Therefore, the ob-served foreground silhouettes depend on the object class (i.e. pedestrian, vehicle, bicycle, or truck),the object’s orientation, and the camera viewing angle. If a generative forward model is used, all

Page 36: vision-based scene understanding with sparsity promoting priors

2.3. Dictionary-based framework 15

(a) Camera views

(b) Degraded foreground silhouettes y

(c) y = Q(Dx)

Figure 2.4: Illustration of the atoms modeling the given foreground silhouettes. The grid is only for visualpurposes.

possible observed silhouettes of all the objects at all given locations should be considered. Typi-cally, a car generates various silhouettes given its orientation and the camera viewing angle. With agenerative model, the size of the dictionary is colossal leading to high computationally complexity.Therefore, we propose a semi-supervised dictionary construction to learn the silhouettes of interestoff-line, as opposed to a generative model.

A training stage extracts foreground silhouettes for a defined period of time. Extracted silhou-

Figure 2.5: Training stage: extracted foreground silhouettes are attached to the ground plane points usinga planar homography wrapping the silhouettes on the ground. The center of mass of the intersected wrappedsilhouettes corresponds to the ground plane point of interest.

Page 37: vision-based scene understanding with sparsity promoting priors

16 Chapter 2. Sparsity-Driven People Localization

ettes from the cameras need to be attached to a given ground plane point. The algorithm works asfollows:

1. Each image point of the foreground silhouettes from all views is mapped to the ground planegiven a planar homography computed at calibration stage [49]. Multiplying the mappedpoints segments the pixel corresponding to the region of the object lying on the ground. Thecenter of mass of that region represents the desired ground plane point of the correspondingsilhouettes (see Figure 2.5).

2. For each ground plane point, a set of silhouettes is extracted over time. Those silhouettesare clustered based on their shape given the minor and major axis of the ellipse fitting thesilhouettes. Similar aspect ratio are grouped together. The resulting clusters are manuallylabeled as the object class.

3. For each cluster, the average silhouette is computed to generate the atom of the dictionary.All the silhouettes in the cluster are summed and normalized by the number of silhouettes.Each average silhouette represents an atom. Figure 2.6 illustrates the different clusters.

The potential ground plane points that are not accessed during the training stage are interpolatedwith neighboring points, i.e. the same set of silhouettes are picked from neighboring points. Thetraining period sketches the search space. Often, in a urban environment, some objects have aconstrained behavior such as vehicles. They cannot be located on every ground locations and cannothave all kind of orientation. Therefore, such training process captures automatically the scenedynamics to construct a dictionary of relevant atoms only. Each atom of the dictionary representsthe silhouettes observed in all cameras of an object at a given location. We can see in Figure 2.6 thatsome ground plane points generate less atoms since all objects cannot be present in their location.Such semi-supervised approach does not require to have any prior on the objects of interest to bedetected in a scene.

Figure 2.7 and 2.8 illustrate all detected objects given either one or a network of fixed cameras.Given the presented data set, the training stage segmented four classes of objects: people, cars,bicycles, and trucks. A ground plane point can be occupied by those four different instances andeven more in some of the ground regions where several vehicles’ orientation are possible. Often,atoms cover each other, i.e. the silhouette of an atom describing a pedestrian is fully covered bythe silhouette of an atom describing a vehicle. In fact, to match the silhouette of a truck, the atomsof several pedestrians can be used. We impose a sparsity constraint to solve our inverse problemso that the reconstruction algorithm prefers to pick a single atom instead of several ones to reducethe fidelity term. Therefore, our algorithm correctly detects several classes of objects even if thesilhouettes cover each other.

2.4 Sparsity-driven formulation

2.4.1 Ideal formulations

The problem of reconstructing the occupancy vector x from the observed data y can be of courseill-posed in certain configurations, e.g. if the discretization of the ground plane leads to a highernumber of sample points than the sum of all the cameras resolutions. However, even in overcom-plete scenario (more observation than sample points), the presence of noise in y and the non-linear

Page 38: vision-based scene understanding with sparsity promoting priors

2.4. Sparsity-driven formulation 17

Figure 2.6: Dictionary construction: after the training stage, each ground plane points have extracted a setof clusters of silhouettes representing the various shapes of all the objects of interest. The presented silhouettescorrespond to people, vehicles, trucks, and bicycles.

Figure 2.7: Objects detected by a fixed camera running the sparsity driven algorithm given the multi-classdictionary.

forward model described in the previous section require us to regularize the reconstruction with thea priori spatial sparsity of x.

A first approach is to use the following theoretical optimization problem, i.e. `0-Regularizedproblem:

arg minx∈{0,1}N

‖x‖0 s.t. ‖y − Q(Dx)‖22 < ε (2.2)

where ‖x‖0 = #{i : xi , 0}, ε is the desired residual error and the quantization Q is defined inSection 2.3.2.

Page 39: vision-based scene understanding with sparsity promoting priors

18 Chapter 2. Sparsity-Driven People Localization

Figure 2.8: Objects detected by a network of two fixed cameras running the sparsity driven algorithm given themulti-class dictionary.

Such formulation does not require to know the sparsity of the vector x but needs an upper boundon the residual error for the fidelity term. Since the degradations on the foreground likelihoodare not predictable, another alternative is to bound the sparsity, i.e. reformulate the reconstructionmethod as the `0-Regularized following problem that can be compared to the Lasso problem [29]:

arg minx∈{0,1}N

‖y − Q(Dx)‖22 s.t. ‖x‖0 < εp (2.3)

where εp is the maximum number of people to be detected.

These optimizations are non-convex and also NP-hard [66], i.e. the numerical complexity iscombinatorial in the dimension of the space. This is due to the use of the `0 sparsity term and to thediscrete nature of the binary space {0, 1}N . We need therefore to simplify these formulations, whichwe do in the next sections.

Page 40: vision-based scene understanding with sparsity promoting priors

2.4. Sparsity-driven formulation 19

2.4.2 Linearized and re-weighted optimizations

In order to linearize (2.2) and (2.3), we first remove the quantization operator Q from theirfidelity terms. As a result, occluded people lead to correlated atoms and hence are considered asadditional noise on the measured silhouettes, increasing therefore the value of the desired residualerror ε or the maximum number of people εp in (2.2) and (2.3) respectively. Second, the vector xis now considered in RN

+ and not in the binary space {0, 1}N , while the reconstructed vector will besubsequently one-bit quantized to form the binary occupancy vector x. An adaptive threshold T isused driven by the maximum value of x.

xi =

{0 xi < T · max(x)1 otherwise,

(2.4)

where xi is the value of the occupancy vector x at i. The solution x can converge to any realvalue given our re-weighted scheme. In addition, the missing quantization operator affects ourformulation. Typically, when two people occlude each other, the contribution of each atom shouldbe half in order to best minimize the fidelity term. Therefore, in order to accept the contribution oftwo to three atoms occluding each other, we can keep one-fourth of the maximum value (T = 0.25).Figure 2.12 and 2.11 illustrate the performance of our algorithm for various values of T . A smallthreshold T leads to high recall rate with a low precision.

Interestingly, in the minimizations (2.2) and (2.3), the `0-norm can be approximated with aniterative reweighted `1-norm [67, 68]. The weights used for the next iteration are computed from thevalue of the current solution as introduced by Candès et al. in [67]. Without the re-weighted scheme,the solution is not sparse enough. It leads to very high number of false positives as presented in[6, 7].

Explicitly, (2.2) leads to the Re-Weighted Basis Pursuit DeNoise (RW-BPDN) algorithm, i.e

x(l+1) = argminu∈RN

+

‖W(l)u‖1 s.t. ‖y − Du‖2 < ε, (2.5)

while (2.3) provides the Re-Weighted Lasso (RW-Lasso), i.e.

x(l+1) = argminu∈RN

+

‖y − Du‖22 s.t. ‖W(l)u‖1 < εp, (2.6)

where, for both equations, the diagonal weighting matrix is defined at each iteration l > 0 byW (l)

ii = (|x(l)i | + η)−1, for 1 ≤ i ≤ N, with W0 = Id and the corresponding previous solution x(l). The

parameter η is added to assure stability and guarantees that a zero-valued component in x does notstrictly prohibit a nonzero estimate at the next iteration. We set η = 10−7.

Practically, as explained in Appendices A.1 and A.2, at each iteration of the re-weighted pro-cess, (2.5) and (2.6) are solved by monotone operator splitting and proximal methods [36, 37].

2.4.3 Occupancy Lasso (O-Lasso)

In this section, we specialize further the Re-Weighted Lasso procedure (2.6) to the particu-larities of our occupancy reconstruction. As explained below, this involves the addition of twoprocessings in the reweighting loop.

Page 41: vision-based scene understanding with sparsity promoting priors

20 Chapter 2. Sparsity-Driven People Localization

Table 2.1: Greedy RSS Projection

Input: A sparse vector z.Output: An approximation of ProjRτ z.Program:

1. Initialize: r ← z, p← 0 ∈ RN .

2. Pick the index i∗ = arg max i |ri|.

3. Set pi∗ ← ri∗ , and then ri∗ ← 0.

4. For all j ∈ supp{r}, if ∆i∗ j < τ, set r j ← 0.

5. If supp{r} = ∅, return p and stop; else, return to Step 2.

Repulsive Spatial Sparsity (RSS)

Although the re-weighted `1-norm provides a sparse solution close to the one that would havebeen obtained with the true `0 “norm”, it does not enforce a certain form of spatial sparsity desiredin our application. Indeed, taking a simple example, the linearity of our formulation allows two(or more) neighboring points in x to have non-zero values so that the generated silhouette Dx fits asingle person with a shape slightly larger than what is prescribed in the dictionary model.

We want however to avoid such situation and allow the occupancy reconstruction to spend moreeffort on the reconstruction of other isolated persons in x. We impose therefore that two detectedpoints, i.e. with non-zero components in x, are never closer than a minimum spatial distance relatedto the minimum surface occupied by a person on the ground 1. This is what we call the concept ofRepulsive Spatial Sparsity (RSS).

Mathematically, if j, k ∈ supp{x} = {i : xi , 0} with j , k, we should have

∆ jk , ‖p( j) − p(k)‖2 > τ (2.7)

where p(k) is the location of the kth cell in the discrete ground plane. We choose a typical value of70 cm, i.e. the average width of a standing person, for the minimal spatial distance τ between twooccupied ground points.

We achieve this result by inserting in the iterative algorithm, the non-convex projection

p = ProjRτ z , arg minx∈Rτ

‖x − z‖2,

of the current solution z on the Repulsive Spatial Sparsity (RSS) set

Rτ = {x ∈ RN : ∀ j, k ∈ supp{x} s.t. j , k,∆ jk > τ }.

Practically, we approximate p = ProjRτ z by the suboptimal greedy method detailed in Table 2.1.

This method converges in at most n = ‖z‖0 iterations, i.e. when we have already z ∈ Rτ, andby construction the output belongs to Rτ. In addition, this greedy method implicitly preserves the

1. Notice this approach applies to other objects detection, e.g. cars or traffic, with non-zero ground-surface.

Page 42: vision-based scene understanding with sparsity promoting priors

2.4. Sparsity-driven formulation 21

Table 2.2: Occupancy Lasso (O-Lasso)

Inputs:– The Multi-Silouettes Vector y = (yT

1 , · · · , yTC)T ∈ RM .

– A set of ground point locations (cell):{p( j) : 1 ≤ j ≤ N}.

– A stopping tolerance Tol (e.g. Tol = 10−4).

Output: The occupancy vector x ∈ RN .

Program:

1. Initialize: l = 0, x(0) = 0, W (0) = Id, ε(0) = ‖x‖0

2. Solve:z(l+1) = arg min

u∈RN+

‖y − Du‖22 s.t. ‖W (l)u‖1 < ε(l),

3. RSS Projection:x(l+1) = ProjRτ z(l+1)

4. Re-weight: Define the diagonal matrix W (l+1) by

W (l+1)ii =

(|x(l+1)

i | + η)−1, 1 ≤ i ≤ N

5. Dynamic constraint: ε(l+1) = ‖x(l+1)‖0

6. Stop if‖x(l+1) − x(l)‖2

‖x(l+1)‖2< Tol,

else, set l← l + 1 and return to Step 2.

highest non-zero component of z amongst two or more non-zero components within a distance τ ofthe highest. This guarantees a sub-optimal minimization of the distance ‖z − p‖2. Notice that theStep 2 can be efficiently realized by sorting the non-zero elements of z by decreasing magnitudes,a process that takes at most O(N log N) operations. Indeed, Step 3 in Table 2.1 inserts just somezeros in the sorted z and the selection of the maximum in Step 2 can be realized sequentially, takingeach time the next non-zero element in the sequence.

Adaptive sparsity level

The Lasso formulation requires the knowledge of the number of people present in the scene.In order to make the algorithm generic enough, we propose an adaptive sparsity constraint selec-tion in Equation (2.6). Since the sparsity constraint, εp, bounds the potential number of detectedpeople, an iterative approach is proposed to choose the constraint independently of the number ofpeople present in the scene. First, εp is high enough in order to have a first approximated solution.Typically, εp is initialized to the dimension of x (see Section 3.3). Then, εp is set to the number ofnon-zeros values obtained after each iteration. Experiments show that the algorithm converges to-

Page 43: vision-based scene understanding with sparsity promoting priors

22 Chapter 2. Sparsity-Driven People Localization

wards the right number of people present in the scene when the Repulsive Spatial Sparsity constraintis used.

The final iterative algorithm, including both the Repulsive Spatial Sparsity and the AdaptiveSparsity level, is summarized in Table 2.2. It is coined the Occupancy Lasso (O-Lasso). We willsee in the next Section that it outperforms RW-BPDN, RW-Lasso and the state-of-the-art method.

2.4.4 Computational complexity

The computational complexity is mostly dominated by the reweighted minimization step. Giventhe number of points N in the occupancy grid, the size M of the MSV vector y, the complexity ofO-Lasso (Table II) is COL = O(LCiter), where:

– L stands for the number of O-Lasso iterations (indexed by l); typically, the tolerance Tol > 0set in the stopping criterion in step (6) leads to about L = 30 iterations in our experiments;

– Citer is the complexity of each iteration, that is, the total complexity of steps 2 to 5.From Table II, the value Citer is mainly the sum of:

– the Lasso complexity CLasso (in step 2);– the RSS projection (in step 3) which complexity is dominated by this of vector component

ordering in O(N log N) (see Sec. 5.3.1);– the reweighting with O(N) (in step 4).

The complexity CLasso of Eq. 12 is related to the complexity CD of applying D∗ or D to vectorstimes the number NFB (two order of magnitude) of Forward-Backward iterations described in Ap-pendix A. The complexity related to computing Dx and D∗y for x ∈ RN and y ∈ RM are boundedby O(NM). In reality, the various dimensionality reductions explained in Section 6, reduce the sizeof the occupancy grid by restricting the search to the visible part of the ground (according to threepossible assumptions explained in Sec. 6.2). In order to have an estimation of the reduction ef-fect, which depends on the foreground silhouettes extracted, we observed that each person inducesabout N/100 sampling points in the occupancy grid (under Assumption 3). Therefore, for reason-able crowds of about 10 persons, the reduction is of order 10. This can reduce significantly theproportionality factor integrated in CD = O(NM).

We can finally conclude that CLasso = O(NFBMN) and, considering only the more complextasks above, we deduce that COL = O(LNFBMN).

2.5 Experimental results and comparisons

2.5.1 Experiments

Synthetic and real challenging data are used to evaluate the proposed framework.

Real data have been obtained from the APIDIS dataset 1 and from the PETS 2009 Benchmarkdataset 2.

1. The data set is publicly available at http://www.apidis.org/Dataset/2. http://winterpets09.net/

Page 44: vision-based scene understanding with sparsity promoting priors

2.5. Experimental results and comparisons 23

The APIDIS dataset consists in seven cameras monitoring a basketball game, including oneomnidirectional camera. The dataset has the following challenges:

– Basketball players have abrupt changes of behavior, e.g. they run, jump, crouch, changesuddenly their motion path, etc.

– Players on the same team have the same appearance.– In some camera views, players greatly occlude each other.– Some cameras have very similar viewpoints, affecting the resolution of the ambiguities aris-

ing with the occlusion problem.– The reflection of the players on the floor and their strong shadows lead to severely degraded

foreground silhouettes. Many false positives silhouettes are extracted with a standard back-ground subtraction algorithm (e.g. the work of Stauffer and Grimson [16]).

– Players interact strongly with each other and their spatial distribution on the ground can bevery dense and compact or spatially scattered.

All videos are scaled to a QVGA resolution with approximately 25 fps. Performance over theleft-half of the basketball court is measured since it is the side where the most number of camerasare monitoring the game, i.e. camera’s id 1, 2, 4, 5, and 7.

The PETS 2009 Benchmark datasets are multisensor sequences containing different crowd ac-tivities filmed from multiple cameras and involve up to approximately forty actors [8]. We evaluateour algorithm on sparse crowd, as well as medium and high density crowd (Figure 2.9). Our de-tection scheme is able to count people in high density crowds given multiple or even single camera(Figure 2.10) 1.

Synthetic data are constructed given the same scene geometry as the APIDIS dataset. Randomvectors x are created given a spatial sparsity constraint, e.g. location points have a minimum spatialdistance with respect to each other (> 70 cm). Five to fifteen people are randomly triggered foreach frame (a few hundred frames are generated). The synthetic data will allow us to evaluate theperformance of our algorithms with controlled foreground silhouettes, hence measuring how wellthe noise or occlusion problem is solved.

The performance of the detection process is quantitatively evaluated by computing the precisionand the recall measures given by the ratios T P/(T P + FP) and T P/(T P + FN) respectively, whereT P, FP and FN are the number of True Positive, False Positive and False Negative. A true positiveis when a person is correctly located on the ground plane.

The foreground silhouettes are extracted using the work of Stauffer and Grimson [16]. Theoutcome of the background subtraction algorithm is noisy. The silhouettes are severely degraded.Only part of the people are extracted, their shadow and reflections are considered, and random falsepositives are generated due to lighting conditions, camera noise.

2.5.2 Results

All three reconstruction methods, i.e. RW-BPDN, RW-Lasso, and O-Lasso are compared withthe state-of-the-art POM approach proposed by Fleuret et al. in [1]. Their algorithm depends ontwo parameters: the maximum number of iterations and a constant σ accounting for the quality of

1. Videos are available at the following website: http://lts2www.epfl.ch/~alahi/pets.htm

Page 45: vision-based scene understanding with sparsity promoting priors

24 Chapter 2. Sparsity-Driven People Localization

(a) Sparse crowd

(b) Medium crowd

(c) High density crowd

Figure 2.9: Detecting and tracking people given the PETS data set. White contours represents the degradedforeground silhouettes used.

the background subtraction. We set the maximum number of iterations to 1500, and measure theperformance of their algorithm for various σ.

Figure 2.11 illustrates all performances given four cameras monitoring the APIDIS dataset(camera’s id 2, 4, 5, and 7). The proposed Occupancy Lasso (O-Lasso) clearly outperforms otherapproaches in term of both recall and precision rate. The Re-Weighted Lasso (RW-Lasso) with afixed sparsity bound outperforms the RW-BPDN with various residual error ε. The performanceof the BPDN formulation is affected by the difficulty to estimate the residual error, i.e. the degra-dation occurring in the foreground silhouettes. Finally, the state-of-the-art approach (POM) has amuch lower precision rate for a given recall rate than the sparsity driven methods. Considering thesparsity of the desired solution allow us to reduce the false positive rate consequently.

Figure 2.12 presents all performances given four planar cameras on the synthetic data. It is in-teresting to notice that both formulation BPDN with ε = 0 and Lasso with εp = ppl, i.e. the numberof people present in the scene, leads to the same performance. Since they are equivalent problemand the foreground silhouettes are not noisy, it coincides with our expectations to obtain the sameperformance. However, relaxing the formulations influences the performances accordingly. Relax-ing the fidelity term with the BPDN increases the precision and reduce its recall rate. High fidelityconstraint allows to miss some people hence reduces the recall rate. With the Lasso formulation,increasing εp increases the number of false positives hence reduces the precision rate. Interestingly,

Page 46: vision-based scene understanding with sparsity promoting priors

2.5. Experimental results and comparisons 25

(a) Detection with 3 cameras (b) Detection with 1 camera

Figure 2.10: Locating people with either 3 cameras (left hand-side) or a single camera (right hand-side) giventhe PETS dataset. White contours represents the degraded foreground silhouettes used.

the state-of-the-art outperforms the RW-BPDN and RW-Lasso given the synthetic data. Since noiseis not present in the data, better performances are achieved with POM. However, our proposed O-Lasso outperforms again other methods. The recall rate is above 90% with roughly perfect precisionrate (> 98%).

The strength of our proposed formulation is emphasized with real data, i.e. when noise is presenton the data. BPDN is useful when we can bound the noise. However, in our application, the noisecan severely degrade the observations. Hence, the Lasso formulation suits best our problem. Notethat the adaptive formulation (O-Lasso) does not need any prior on the number of people presentin the scene. It correctly updates the constraint to reach the right number of people. We are ableto approximately count and locate high density crowd given multiple or even single camera. Table3.2 presents the performance of counting people on high density crowd given the PETS data set.Figure 2.13 illustrates the regions (R1, R2, and R3) that are monitored. The Average Frame Error(AFE) represents the number of people missed (or wrongly count) per frame. For the level 1 and2 ("L1" and "L2"), we locate less number of people compared to the ground truth whereas for themost difficult level ("L3"), more number of people are detected. It is due to the very high densitycrowd of people present in the scene.

2.5.3 Validation

Given the proposed approach based on the O-Lasso, we analyze its performance when (i) thenumber of cameras is increased, (ii) when people are occluding each other, and (iii) when thesilhouettes extracted are degraded.

Page 47: vision-based scene understanding with sparsity promoting priors

26 Chapter 2. Sparsity-Driven People Localization

RW-BPDN ε=0

RW-BPDN ε=10

RW-BPDN ε=20

RW-Lasso ε=11

RW-Lasso ε=13

RW-Lasso ε=15

O-Lasso T=0.5POM σ=0.003

POM σ=0.005

POM σ=0.007

O-Lasso T=0.25

O-Lasso T=0.4

0.37

0.42

0.47

0.52

0.57

0.62

0.67

0.72

0.77

0.82

0.37 0.47 0.57 0.67 0.77 0.87 0.97

Recall

Precision

Figure 2.11: Precision and recall rate on the APIDIS dataset given four cameras monitoring the scene (camera’sid 2, 4, 7, and 1). Our proposed approaches (RW-BPDN, RW-Lasso, and O-Lasso) are compared with the state-of-the-art probability of occupancy (POM) by Fleuret et al. in [1].

Sequences R0 R1 R2GT PC AFE GT PC AFE GT PC AFE

S1-L1-2 2861 2186|2305 4.2 |3.8 1237 908|981 2.3|1.9 1130 848|895 1.8|1.8S1-L2-1 - - - 1279 558|710 10.0|8.0 1622 814|940 6.8|6.0S1-L3-1 - - - 230 163|196 2.0|1.2 - - -S1-L3-2 - - - 2632 2084|1550 10.1|10.5 - - -

Table 2.3: Performance of people counting on the scenarios proposed by the PETS organizers given 1 camera(id=1) and 3 cameras (id=1,2,3): GT = Ground Truth, PC = People Counted by our O-Lasso algorithm, AFE =

Average Frame Error. The dataset "S1" is used with various level of difficulty ("L1": medium density crowd, "L2"and "L3" correspond to high density of people.

One of the advantages of our framework is that it scales to any number of cameras. Therefore,the performance of the system with various number of cameras is compared in Figure 2.14. Notethat a precise temporal window is selected having all players in the field of view of all cameras. Itis interesting to see that even when a single camera is used, we can locate as many people as usingmultiple cameras. Nevertheless, adding cameras reduces the number of false positives due to a de-graded foreground silhouettes. Shadows and reflected players on the ground have a strong impacton the precision rate. In addition, merging an omnidirectional camera with other planar camerashave the best performance. Surprisingly, if the omnidirectional camera is monitoring the scenealone, a poor detection is achieved due to the severely degraded foreground silhouette: R = 0.47and P = 0.55 (not shown in Figure 2.14). Most of the time, the people’s shadow is much biggerthan its silhouette affecting considerably the performance. In addition, in some areas, people arealmost missed by the background subtraction algorithm since they occupy only few pixels (smallsurface). Finally, due to the small bounding box of the people, a small offset in the detection con-siderably affects the performance and leads to a missed person. Figure 2.18 presents the foreground

Page 48: vision-based scene understanding with sparsity promoting priors

2.5. Experimental results and comparisons 27

RW-BPDN ε=0

RW-BPDN ε=10

RW-BPDN ε=20

RW-Lasso ε=ppl

RW-Lasso ε=ppl+2

RW-Lasso ε=ppl+4

O-Lasso T=0.5

POM σ=0.01

POM σ=0.003

POM σ=0.005

O-Lasso T=0.4

O-Lasso T=0.25

0.54

0.59

0.64

0.69

0.74

0.79

0.84

0.89

0.94

0.91 0.93 0.95 0.97 0.99

Recall

Precision

Figure 2.12: Precision and recall rate with the synthetic data given four cameras. Our proposed approaches(RW-BPDN, RW-Lasso, and O-Lasso) are compared with the state-of-the-art probability of occupancy (POM) byFleuret et al. in [1].

Figure 2.13: Regions considered to report the number of detected people in the PETS 2009 data set.

silhouettes extracted and the detected people given various number of cameras.

One of the main challenges we want to handle is the mutual occlusions generated by the people.The proposed relaxation step (Section 2.4.2) wrongly considers the occlusion problem as a linearoperator. Indeed, according to our model, overlapping silhouette are not quantized but summed(Dx). Hence, whenever people are occluding each other, their silhouettes are overlapped and gen-erate noise in our fidelity term. We evaluate the performance of our algorithm with respect to thePercentage of Overlap (or PO) present in each MSV:

PO =Number Of Pixels Overlapping

Number Of Pixels Visible(2.8)

Page 49: vision-based scene understanding with sparsity promoting priors

28 Chapter 2. Sparsity-Driven People Localization

4 3 2 1

4 3 2 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

!"#$% !"&$% !"'$% !"($% !")$%

Rec

all

Precision

# Planar + 1 omni

# Planar only

Figure 2.14: Precision and recall rate when the number of cameras monitoring the scene are increased. Thenumber in each bubble represents the number of cameras used. (The sequence of cameras id 7,2,4,1 is used forplanar cameras, and camera id 5 for the omnidirectional one).

We cluster each person present in our synthetic dataset given their PO with other people andmeasure how well we have detected them. Figure 2.15 illustrates the recall rate with respect to thePO present in the people 1. When half of the MSV is overlapping with other MSVs, the recall rateis still higher than 90%. Then, it decreases to reach in the worst case (when overlap is more thanroughly 90%) a recall of 65% which is very satisfying. Therefore, we have a generic measurementabout the performance of the algorithm to detect occluded people regardless of the scene geometryand people’s density.

Finally, we measure the performance of our approach with respect to the percentage of degra-dation present in the foreground silhouettes. The degraded foreground silhouette is made of falsepositive pixels or false negative ones. Setting to zero a percentage of the MSV is equivalent todegrading the foreground silhouettes with false negative pixels. Figure 2.17 illustrates the recallrate when we degrade the silhouettes accordingly. Note that it also informs the percentage of fore-ground silhouettes needed to trigger a positive detection although the silhouettes are only made offalse positive pixels. The recall rate remains higher than 90% although 30% of the silhouettes areremoved. Then, the performance decreases considerably with respect of the degradation applying tothe silhouettes. Such information informs us about the sensitivity of our Forward Model generatingsilhouettes: 30% of the generated silhouette can be discarded. In other words, if our silhouettesare fitting a person with a height of 1m70cm, we can still detect people in the range of 1m20 till2m20 with the same performance. Moreover, we can also say that only 70% of the silhouette modelis relevant. Hence, the proposed approximated shape, i.e. half rectangular and half ellipsoid canhandle 30% of a ’shape’ approximation error. Note that other shapes can be used. Table 2.4 il-lustrates the performance of using other shape. The rectangular shape is used by Fleuret et al. in[1] to efficiently compare their generative model with the foreground silhouettes based on integralimages. However, in our framework, the dictionary is computed off-line and therefore its atoms

1. Remark that the PO cannot be defined for false-positives. This prevents a plot of the precision in function of thePO.

Page 50: vision-based scene understanding with sparsity promoting priors

2.5. Experimental results and comparisons 29

!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

!#+"

!#,"

$"

$!" %!" &!" '!" (!" )!" *!" +!" ,!" $!!"

!"#$%%&

'"(#")*$+"&,-&,."(%$/&

Figure 2.15: Recall rate with respect to the PO present for the people given the O-Lasso approach.

(a) Original microscopic image ofcells

(b) Cells detected by labeling con-nected blobs: 140 detected cells

(c) Proposed dictionary-based re-sult: 158 detected cells (12% more)

Figure 2.16: Illustration of our dictionary-based framework applied to the problem of counting cells in micro-scopic images. The proposed O-Lasso is able to detect cells even if they partially overlap or are connected witheach other.

can correspond to any shape. Typically, our dictionary-based framework can be used to locate cellsin microscopic images. Ellipsoid shapes of different size and orientation can be used to locate andcount cells in medical imaging applications such as in Figure 2.16. The proposed O-lasso segmentscorrectly cells even if they partially overlap with each other or form a single connected blob (afterbinarization).

Page 51: vision-based scene understanding with sparsity promoting priors

30 Chapter 2. Sparsity-Driven People Localization

!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

!#+"

!#,"

$"

$!" %!" &!" '!" (!" )!" *!" +!" ,!" $!!"

!"#$%%&

'"(#")*$+"&,-&./00/)+&0/%1,2"3"&

Figure 2.17: Recall rate with respect to percentage of missing silhouette region extracted for each people giventhe adaptive O-Lasso approach.

Shape Recall PrecisionRectangular 0.64 0.87

Ellipsoid 0.7156 0.85Half rectangular and half ellipsoid 0.6938 0.9001

Table 2.4: Precision and recall rate using various shape in the Forward Model given the O-Lasso method and 4planar cameras

2.6 Conclusions

The problem of multi-view people localization has been studied for decades. Researchers haveproposed accurate algorithms in well-defined environments such as low-density crowds. However,once people are grouped together and occlude each other, previous works often fail to correctlylocate people. In this chapter, we presented a dictionary-based framework to detect and countpeople in high-density crowds in places such as exhibitions, streets and airports. For example, itenables institutions to provide very precise estimation of the number of people attending an event.Our approach relies on the Lasso algorithm with a modified re-weighted scheme reinforcing spatialsparsity. The proposed framework is generic enough to locate several classes of objects and can beapplied to other localization problems. The sparsity driven constraint clearly increases both recalland precision rates outperforming the state-of-the-art. The obtained performance improvementsupports our Claim 1. A few simple elements are enough to locate people using a network offixed cameras. The approach is generic enough to be used with any calibrated camera, e.g. planaror omnidirectional. However, the computational complexity of the reconstruction approach, andspecifically the re-weighted scheme is too high for real-time implementation. In the next chapter,we will present our optimized framework to run the system in real-time.

Page 52: vision-based scene understanding with sparsity promoting priors

2.6. Conclusions 31

Figure 2.18: Illustration of the detected people with various number of cameras given the APIDIS data set. Thegreen contours represents the degraded foreground silhouettes used.

Page 53: vision-based scene understanding with sparsity promoting priors

32 Chapter 2. Sparsity-Driven People Localization

Page 54: vision-based scene understanding with sparsity promoting priors

Real-Time Scalable PeopleLocalization 33.1 Introduction

In the previous chapter, we presented our dictionary-based framework with a sparsity constraintto detect people in a network of fixed cameras. In order to solve the problem, we relaxed the idealNP-hard problem and proposed a re-weighted scheme (referred to as O-Lasso) that is computa-tionally expensive. Therefore, in this chapter, we present a sparsity-driven formulation based ona greedy algorithm to solve the relaxation of the NP-hard problem in Section 3.2. The proposedapproach is 100 times faster with a similar detection rate. A modified version of the Matching Pur-suit algorithm is presented with a new selection criterion coined "Set Covering Object OccupancyPursuit" (SCOOP).

In addition, to enable our multi-view system to scale and monitor large scenes, we need toreduce the search space, i.e. the number of potential ground plane points to classify as occupied.Related works locating people on the ground discretize the ground plane in a uniform samplingprocess, leading to a limited area to be monitored [1, 63, 64]. In Section 3.3, we present an adaptiveprocess to sample the ground plane as a function of both the cameras’ topology and the sceneactivity. We reduce the search space by two orders of magnitude while maintaining the same spatialprecision.

3.2 Greedy algorithm for sparse approximation

3.2.1 Forward model

Let a single person occupy the scene. This refers to an occupancy vector x, with only onenonzero element depending on his location. Moreover, each of the cameras will capture only one

33

Page 55: vision-based scene understanding with sparsity promoting priors

34 Chapter 3. Real-Time Scalable People Localization

silhouette (if they have a view over the position) whose size and the position (possibly the shape)depends on the location. In general, for a given configuration of x (depending on the numberof people and their positions), there exists a Multi-Silhouette Vector (MSV) which may not benecessary unique. This non-uniqueness comes from the occlusions in the camera view, which arehighly depending on the density of the crowd and the positions of the cameras.

Here we define the following forward model which describes the underlying correspondencebetween each occupancy vector and its resulting MSV:

y = D · x ⊕ z. (3.1)

Note that here the operations are Boolean i.e. sum and products corresponds to AND and OR, and⊕ denotes the bitwise XOR operation between two Boolean vectors. The dictionary D ∈ {0, 1}M,N

is a very huge matrix of silhouettes. Each column of D, say di, indicates the corresponding MSV ofan average person who is standing at position i in the scene. Finally z ∈ {0, 1}M denotes the noisevector that corrupts the MSV by both missing and extra foreground pixels. This may occur dueto several reasons e.g., non-ideal silhouette extraction, non ideal modeling of the dictionary atoms,shadows, reflections, etc.

By the definitions above, the problem of detecting and localizing individuals in a scene is equiv-alent to recover the support of an occupancy vector from a set of inaccurate noisy MSVs, providedwith the knowledge of the dictionary D.

3.2.2 Uniqueness of the solution

Given the forward model defined in Equation 3.1, there can be many realizations of theoccupancy vector leading to the same MSV y, which makes any decoding scheme hopeless torecover the original x. In camera detection applications, this ambiguity arises for those positionsin the scene that are hidden from the camera views due to the possible occlusions. Therefore, anydecoder fails to decide correctly whether there are some individuals present at those positions.These situations are related to the setup of the cameras as well as the distribution of the people inthe scene and bringing uncertainty in the localization problem. Let’s define the notion of disjunctmatrices:

Definition 1 A Boolean matrix D with N columns d1, ..., dN is (k, e)-disjunct if for every subset S ⊆[N] with cardinality |S| ≤ k, and every i < S, we have:∣∣∣∣supp(di)\supp(

⋃j∈S

d j)∣∣∣∣ > e,

where∣∣∣∣.∣∣∣∣ represents the l1 norm, and .\. means set difference.

Assuming the occupancy vector to be k-sparse (i.e., x has less than k nonzero elements), thefollowing proposition guarantees the uniqueness of the solution to Equation 3.1:

Page 56: vision-based scene understanding with sparsity promoting priors

3.2. Greedy algorithm for sparse approximation 35

Proposition 1 Having a (k, 2e)-disjunct dictionary implies a one-to-one mapping between all k-sparse x and their corresponding y. Conversely, if every k-sparse x leads to a distinguishable ythen, D must be (k-1, 2e)-disjunct.

The proof of Proposition 1 is presented in the Appendix. In our application, having a (k, 2e)-disjunct dictionary provides a robust guaranty of no full occlusion for any k silhouettes, whichavoids uncertainty at the decoder side. Intuitively, this condition implies that any person at anyposition must have a silhouette with enough distinguishable pixels to be robust against the noiseand not be merged (occluded) into the silhouette of other people.

By increasing the number of people on the scene, the occlusions (counterexamples to disjunct-ness) become more probable, thus, Proposition 1 can also be interpreted as an upper bound for thenumber of individuals that can be reliably localized by a fixed camera setup. Moreover, by chang-ing the camera setup e.g., increasing the number of the cameras and selecting well their positionswith respect to the scene, one can design optimal dictionaries that are disjunct for larger number ofpeople.

Note that, verifying whether a matrix is (k, e)-disjunct is a combinatorial hard problem, whichmakes Proposition 1 impractical for large size setups. Moreover, Proposition 1 provides a worstcase analysis for Equation 3.1, since it guarantees the uniqueness for all occupancy vectors andit is robust against any adversarial noise setup. In practice, simulation results indicate a reliablerecovery under much milder conditions than in Proposition 1, because the worst case situations arenot likely.

3.2.3 Recovery algorithms

The Thresholding algorithm for recovering the occupancy vector x from the correspondingMSVs y can be used [71].

Thresholding: Select the columns of D that satisfy the following equation:∣∣∣∣supp(di)\supp(y)∣∣∣∣ ≤ e, (3.2)

and indicate their corresponding indices as the support of x.

The following theorem characterizes the performance of Thresholding and it shows its optimal-ity:

Theorem 2 Thresholding successfully recovers any k-sparse occupancy vector, if D is (k, 2e)-disjunct.

Theorem 2 shows that thresholding reaches the optimal bound of Proposition 1 as long as thereis a unique solution to Equation 3.1 (the proof is presented in the Appendix). However, note thatif the dictionary is not (k, 2e)-disjunct then there exist column indices with i < S that also satisfy(3.2), and therefore the recovery is not exact but contains the support of the original occupancyvector i.e., supp(x) ⊂ supp(x), where x denotes the recovered occupancy vector.

In practical setups, dictionaries are normally disjunct for very small values of k. The valuek is directly related to the number of people that we can locate. In our application, the highest

Page 57: vision-based scene understanding with sparsity promoting priors

36 Chapter 3. Real-Time Scalable People Localization

value k to be a (k, 2e)-disjunct dictionary, is much smaller than the number of potential individualsin the scene. As a result, for typical populated scenes the occlusions become more probable andtherefore many realizations of the occupancy vector solve Equation 3.1 (no unique solution). Inthis case, Thresholding recovers a solution that is supported on a superset containing the support ofall possible solutions. By its nature thresholding recovers the union of all possible solutions of ourforward model. Therefore, it leads to high number of false positives. We need an additional priorto better recover the exact solution. According to our claims, the desired solution is sparse. Amongall possible solutions, we set our problem to the recovery of the sparsest occupancy vector x:

x = arg minx∈{0,1}N |supp(x)|

s.t. |supp(y ⊕ D · x)| ≤ e. (3.3)

The solution of Equation 3.3 can be approximated with greedy algorithms. In the researchliterature, many greedy algorithms exist to reconstruct sparse signals [38, 39]. Such algorithmsfollow the heuristic of making the locally optimal choice iteratively with the hope of finding theglobal optimum. The Matching Pursuit (MP) algorithm [72] works iteratively and recovers oneelement of the support set per iteration. More precisely, at each iteration, MP selects the atom i ofthe dictionary D which contributes the most in the signal energy y, i.e. the most correlated atomwith the signal, and takes out its contribution to update the signal (remainder). This procedurecontinues until meeting the stopping criteria. This criterion can be either an a priori knowledgeon the sparsity level, or a threshold on the energy level of the remainder. For the first criteria thealgorithm performs k iterations and for the latter it runs the iterations until the remainder energyfalls below a limit ε.

3.2.4 Set Covering Object Occupancy Pursuit: SCOOP

We change the MP algorithm to better suit our dictionary property and the non-linearity ofEquation 3.3. The latter is an extension to the well known set cover problem [73], where, in thenoiseless case (e = 0), it recovers the set with minimal cardinality, so that it covers the support of y.In the noisy case, however, we relax the constraint (fidelity term) by e, since we are not interestedin approximating the noise. Set cover is known as one of Karp’s 21 NP-complete problems, thus,designing feasible algorithms that can approximate the solution with polynomial time complexityis of high importance.

Our proposed algorithm is presented in Table 3.1 and coined as ’Set Covering Object OccupancyPursuit’ (SCOOP). The proposed localization problem consists of non-normalized dictionaries i.e.,the columns corresponding to the objects on the far back of the scene have much less energy thanthe ones in front. As a result, the selection criteria based on the maximal coherence (like in MP)often chooses high-energy columns that cover highly the signal as well as many pixels out of thesignal support. For example, a person at the back of the scene with a small silhouette is mistakenlyapproximated by a person in front with a much larger, but well covering silhouette. This indicatesthat some high energy columns of D, despite of their good covering, are not fitting well the sil-houettes and therefore it urges some modifications in the selecting criteria to avoid such mistakes.We address this problem so that, at each iteration, the selecting criteria searches for the column ithat has the minimum difference with the remainder (i.e., MSV at initial step). Thus, the selected

Page 58: vision-based scene understanding with sparsity promoting priors

3.2. Greedy algorithm for sparse approximation 37

Table 3.1: ’Set Covering Object Occupancy Pursuit’ (SCOOP) algorithm

Input: MSV signal y, the regularization factor w.Output: Support set Λ, hence the occupancy vector x.Program:

1. Initialize: Λ = {}, R⇐ y, y⇐ 0

2. while (ε > 0) dofor (i = 1 : N) do

Compute the sequence of statistics { fi}Ni=1:

fi ⇐ wW(Di ∧ R)W(Di)

+ (1 − w)W(Di ∧ R)W(y)

. (3.4)

j⇐ arg maxi( fi)

Updates:Recovered support: Λ⇐ jRSS Projection: x = ProjRτ Λ

Recovered MSV: y⇐ y + D j

Remainder: R⇐ R/yError: ε ⇐ y ⊕ y

whereW(.) andW(., .) are respectively the Hamming weight and distance of Boolean vectors.

column, in addition to have a good covering (to minimize the difference with the signal), must fitwell the MSV i.e., not contain many extra pixels out of the MSV support:

i⇐ arg mini

{w

∣∣∣supp(y\di)∣∣∣∣∣∣supp(y)

∣∣∣ + (1 − w)

∣∣∣supp(di\y)∣∣∣∣∣∣supp(di)

∣∣∣ },

where w is a regularization factor (w ∈ [0, 1]) to penalize the uncovered pixels in the signal supportand the extra covered pixels out the signal support. This brings a degree of freedom to the algorithmto emphasize more or less on either of the covering and the fitting factors of the columns. Suchmeasure better respects the sparsity constraint. Several atoms with poor coherence can lead to thesame energy reduction as one atom with high coherence. Practically, when cameras are located athead level, atoms approximating far away people can be covered by atoms corresponding to close-by people. Moreover, in urban scenes, atoms corresponding to cars or trucks can cover severalatoms corresponding to people standing at the same locations. The proposed selection criterionpromotes the selection of the largest atom in case of ambiguity. Typically, it prefers to select asingle atom of a truck than several people in the scene. Likewise, it prefers to select one singleclose-by person instead of several far away people. As a result, we can say that it promotes a sparsesolution.

Page 59: vision-based scene understanding with sparsity promoting priors

38 Chapter 3. Real-Time Scalable People Localization

3.2.5 Experimental results and comparisons

We performed the same experiments as in the previous chapter (Section 2.5.1). The APIDISdata set 1 and the PETS 2009 Benchmark data set 2 are used to compare the performance of theproposed SCOOP algorithm with the state-of-the-art O-Lasso approach.

First, we compare SCOOP with the sparsity driven formulations presented in [1] and the workof Fleuret et al. in [1] (referred to as POM). Figure 3.1 illustrates the performance over the APIDISdata set where the foreground silhouettes are severely degraded. It can be seen that the performanceof SCOOP is similar to the state-of-the-art O-Lasso, i.e. similar recall rate for the same precision.Given synthetic data, we also obtain similar performance (see Figure 3.2). SCOOP outperformsother sparsity driven formulations such as RW-BPDN and RW-Lasso presented in [1] as well asPOM [1].

Table 3.2 illustrates the performance over the PETS data set. The proposed SCOOP algorithmis performing better than O-Lasso since the Average Frame Error (AFE) is reduced. Such gainin performance is justified by the Boolean operator used with the greedy approach. The non-linearphenomena occurring with the occlusions are better handle with SCOOP. Therefore, crowded scenesare better handled with such approach.

Figure 3.3 compares the robustness of SCOOP and O-Lasso given degraded foreground silhou-ettes. We measure the recall rate with respect to the percentage of removed foreground silhouette(MSV) per person. Clearly, degrading the silhouettes degrades the recall rate. However, it is in-teresting to notice that SCOOP is more robust to the degradation, i.e. for a given degradation, therecall rate is higher than with O-lasso. By removing half of the foreground silhouettes, we can stilllocate people 90% of the time.

Finally, we measure the computation time of SCOOP as opposed to O-Lasso. Given a non-optimized matlab implementation of both algorithms, SCOOP locates the basketball players in 0.1sec/frame in average whereas O-lasso takes 10 sec/frame given 5 cameras 3.

3.3 Dimensionality reduction

If many cameras are used, the dimensions of y and x become an issue in Equations (2.5) and(2.6). These sizes define indeed both the dimensionality of D, which requires a large memorystorage, and the total computational time of the algorithms. There exist however some possibilitiesof dimensionality reductions that we detail below.

3.3.1 Dimensionality reduction on the observations

The dimension of the observation vector y is by default equal to the sum of each camera res-olution. To reduce the computation cost, all images are first down scaled to a QVGA resolution(320 × 240). A background subtraction algorithm extracts foreground silhouettes on the QVGA

1. The data set is publicly available at http://www.apidis.org/Dataset/2. http://winterpets09.net/3. A C/C++ version of SCOOP has been implemented leading to high frame rate (>25 frames/sec)

Page 60: vision-based scene understanding with sparsity promoting priors

3.3. Dimensionality reduction 39

!"#$%&'()*+(

!"#$%&'()*,+(

!"#$%&'()*-+(

!"#./001(()*,,(

!"#./001(()*,2(

!"#./001(()*,3(

4#./001((5*+63(

%47((8*+6++2(

%47((8*+6++3(

%47((8*+6++9(

4#./001((5*+6-3(4#./001((5*+6:(

!"##$%&'()*%!"##$%&'()+% !"##$%&'(),%

+623(

+6:3(

+633(

+6;3(

+693(

+6<3(

+623( +6:3( +633( +6;3( +693( +6<3( +6=3(

!>?/@@(

%A>?B0B1C(

Figure 3.1: Precision and recall rate with the synthetic data given four cameras. Our proposed approachSCOOP is compared with other sparsity driven formulation and the probability of occupancy (POM) approachpresented by Fleuret et al. in [1].

resolution. Then the image plane of each camera view is cropped to the region where people canoccur. Finally, all images, yc, are normalized to the same size (107 × 80).

!"#$%&'()*+(!"#$%&'()*,+(

!"#$%&'()*-+(

!"#./001(()*223(

!"#./001(()*2234-(

!"#./001(()*22345(

6#./001((7*+89(

%6:((;*+8+,(

%6:((;*+8++<(

%6:(;*+8++9((

6#./001((7*+85(6#./001((7*+8-9(!"##$%&'()*%

!"##$%&'()+% !"##$%&'(),%

+89(

+899(

+8=(

+8=9(

+8>(

+8>9(

+8?(

+8?9(

+8@(

+8@9(

,(

+8@,( +8@-( +8@<( +8@5( +8@9( +8@=( +8@>( +8@?( +8@@( ,(

!AB/33(

%CABD0D1E(

Figure 3.2: Precision and recall rate with the synthetic data given four cameras. Our proposed approachSCOOP is compared with other sparsity driven formulation and the probability of occupancy (POM) approachpresented by Fleuret et al. in [1].

Page 61: vision-based scene understanding with sparsity promoting priors

40 Chapter 3. Real-Time Scalable People Localization

!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

!#+"

!#,"

$"

$!" %!" &!" '!" (!" )!" *!" +!" ,!" $!!"

-./011"

2.3/.4506."78"9:;;:46"<:1=7>.?."

@AB0;;7"

<C@@2"

Figure 3.3: Recall rate with respect to the degradation on the foreground silhouettes.

Frame 550

Frame 650

Frame 750

Frame 850

Figure 3.4: Illustration of the detected players with SCOOP algorithm given the APIDIS data set.

Page 62: vision-based scene understanding with sparsity promoting priors

3.3. Dimensionality reduction 41

Algos R0 R1 R2GT PC AFE GT PC AFE GT PC AFE

S1-L1-2O-Lasso 2861 2186|2305 4.2 |3.8 1237 908|981 2.3|1.9 1130 848|895 1.8|1.8SCOOP 2669|2753 1.6|2.4 1237 1113|1147 1.2|1.2 1130 1065|1070 0.9|1.3S1-L2-1O-Lasso - - - 1279 558|710 10.0|8.0 1622 814|940 6.8|6.0SCOOP - - - 1279 832|914 6.5|5.7 1622 1008|1104 5.3|4.7S1-L3-1O-Lasso - - - 230 163|196 2.0|1.2 - - -SCOOP - - - 230 258|231 0.9|1.0 - - -S1-L3-2O-Lasso - - - 2632 2084|1550 10.1|10.5 - - -SCOOP - - - 2632 2717|1647 8.8|9.2 - - -

Table 3.2: Performance of people counting on the scenarios proposed by the PETS organizers given 1 camera(id=1) and 3 cameras (id=1,2,3): GT = Ground Truth, PC = People Counted by our algorithm, AFE = AverageFrame Error. The data set "S1" is used with various level of difficulty ("L1": medium density crowd, "L2" and "L3"correspond to high density of people).

6880 sample points Sample points

on Ground

Sample points

on Ground

Figure 3.5: Illustration of the adaptive sampling process. Top row: sample points given a regularly spaced grid.Bottom row: proposed non-regular grid.

3.3.2 Dimensionality reduction in the search space

The complexity cost depends on the number N of ground plane points to locate as occupiedor not. Fleuret et al. in [1] discretize the visible part of the ground into a fixed number of pointsregularly spaced. They do not consider the resolution of the cameras and the sparsity of the peoplepresent in the scene to discretize the ground. In this work, we address these considerations.

Two different ground plane points can correspond to the same pixel in the image plane of acamera. A translation of one pixel in the image plane can be equivalent to a translation of a fewmeters on the ground plane for far away regions, just as a translation of a few centimeters onthe ground plane can correspond to a shift of several pixels for closer regions, depending on the

Page 63: vision-based scene understanding with sparsity promoting priors

42 Chapter 3. Real-Time Scalable People Localization

Map on Ground

Quantize

Figure 3.6: Overview of the adaptive sampling process.

resolution of the camera and the distance of the objects to the camera. Therefore, we propose anon-regularly spaced sampling process to discretize the ground (see Figure 3.6). Points regularlyspaced in the image plane of all cameras are mapped to the ground to form the sample points. Themapped location points are quantized to avoid points spaced with less than few centimeters. Figure3.5 compares a regularly spaced grid with our proposed non-regular grid. Although our proposedgrid has less number of points, regions of interest have higher density of points, i.e. higher spatialresolution. In order to have the same spatial resolution in the region of interest with a regular grid,42777 are needed compared to 5644 with our proposed non-regular grid. A further reduction inthe search space can be achieved by measuring the activity of a sample point according to threepossible assumptions.

Assumption 1 (Foreground pixels only)Sample points are ground plane points belonging to the foreground pixels of at least one camera.

In this assumption, each foreground pixel represents the potential feet location of the people.Each image plane is downscaled to reduce the sample points as explained in Section 3.3.1. Giventhe calibration data, each point of each camera is mapped to a ground plane point sampling x. Inorder to be certain to not miss a potential ground point, each foreground pixel is also considered asthe upper limit (the head) of a person. Therefore, missing the feet region in the foreground will notaffect the sampling process.

Assumption 2 (Intersecting foreground pixels)

Page 64: vision-based scene understanding with sparsity promoting priors

3.4. Conclusions 43

Sample points are ground plane points belonging to the foreground pixels of all the cameras ob-serving the corresponding points.

Assumption 2 is similar to the work of Khan and Shah in [61]. However, such step may beaffected by degraded foreground silhouettes extraction, e.g. including shadows (see Figure 3.7).Missing foreground pixels in some views can lead to missing people whereas high false positiveforeground pixels induce high false positive rates.

Assumption 3 (Least significant silhouette)Sample points are ground plane points corresponding to a significant foreground silhouette in allthe cameras observing the corresponding points.

In this last assumption, the sample points x are kept if

∀c ∈ C :yT

c Q(Dcx)‖Dcx‖0

> δ. (3.5)

Typically, δ is set to 20%. The operator Q is the one bit quantizer defined in Section 2.3.2. Theconstraint δ represents the minimum amount of foreground pixels needed to keep a sample point.

Figure 3.7 presents the three strategies used to reduce the search space. As expected from theirdefinition, we observe that these assumptions have different impact on the dimensionality reduction,i.e. we have approximately Assumption 3 ⊂ Assumption 2 ⊂ Assumption 1. When an assumptionfurther reduces the search space, it may have the counter part of potentially removing correct loca-tions. However, reducing the search space increases the likelihood to better localize people. Figure3.8 shows that for each proposed reduction step, the recall and precision rate increases with O-Lasso(given camera’s id 2,5, 7 in the APIDIS dataset).

3.4 Conclusions

In the previous chapter, we showed the benefits of our sparsity driven formulation and algo-rithms to solve it with convex optimization programs. It was computationally expensive. There-fore, in this chapter, we presented our greedy algorithm: "Set Covering Object Occupancy Pursuit"(SCOOP). It is 100 times faster with better detection rates in crowded environments and is also morerobust to noisy observations. In addition, we presented a non-uniform adaptive sampling processthat does not impose any constraints on the scene surface to be monitored. As a result, we developeda real-time people detection software that runs on a standard laptop (with 2.4 GHz), merging thecontributions of 3 cameras (webcams) connected to the USB ports. Most of the CPU usage is takenby the foreground extraction process based on the GMM algorithm [16]. In the next chapter, wecomplete the full picture by adding a tracking algorithm to link the detected people across frames,and propose a probability map to understand people’s behaviors.

Page 65: vision-based scene understanding with sparsity promoting priors

44 Chapter 3. Real-Time Scalable People Localization

(a)

(b) (c)

(d) (e) (f)

Figure 3.7: Illustration of the sample points used given the three strategies. (a) Camera view examples. (b)Corresponding foreground silhouettes. (c) People exact locations (top view). Sampled points are given in (d) for"Foreground pixels only" assumption (top view), in (e) for "Intersecting foreground pixels" and in (f) for "Leastsignificant silhouette".

Foreground

pixels only

Intersec!ng

foreground

pixels

Least

significant

silhoue#e

0.705

0.725

0.745

0.765

0.785

0.805

0.825

0.845

0.7 0.75 0.8 0.85 0.9

Recall

Precision

Figure 3.8: Precision and recall rate of our proposed algorithm (O-Lasso) given four cameras monitoring thescene (camera’s id 2, 4, 5, and 7 in the APIDIS dataset) and various search space reduction assumptions

Page 66: vision-based scene understanding with sparsity promoting priors

People Behavior Analysis 44.1 Introduction

In the previous chapters, we focused on techniques to robustly detect and locate people in afixed time frame. In this chapter, we aim to capture the dynamics of moving people by trackingthem across time frames. The tracking problem has been extensively studied in the research com-munity [47, 75]. We are interested in tracking algorithms addressing the temporal evolution of ouroccupancy vectors presented in Section 2.3.1. Once people are located on the ground, how can welink the detections across time? In other words, each active point in our occupancy vector needsto find a correspondence in the next vector, often referred to as the data association problem. Inthe next section, we present tracking algorithms that can fit into our sparsity driven framework. Topursue our initial goal to develop a full real-time system, we present a graph-based tracking algo-rithm called Dijkstra Pursuit Tracker (DPT). Finally, to complete the system, we present in Section4.3 a sparse probability map to understand the most salient locations and paths of people in a givenscene.

4.2 Tracking people

4.2.1 Multi-view tracking

In this chapter, we focus on tracking algorithms fitting our detection framework policy. Taj andCavallaro in [76] cluster the multi-view tracking problem as three categories: ’track first’, ’fusefirst’, and ’manifold-based’. ’Track first’ supposes that people are being detected and tracked byeach camera and fused afterwards [77]. ’Manifold-based’ approaches correspond to methods wherecamera calibration information is not available and features are projected into a manifold [78]. Such

45

Page 67: vision-based scene understanding with sparsity promoting priors

46 Chapter 4. People Behavior Analysis

category is addressed in Chapter 6. Our framework policy matches the second category, i.e. ’fusefirst’ approaches similar to the following work:

Mittal and Davis in [79] use the well known Kalman filter to solve the tracking problem. How-ever, their approach does not deal with non-linear and non-Gaussian systems. Focken and Stiefelha-gen in [80] present a Multi-hypothesis Probabilistic Tracker to outperform the best-hypothesis ap-proach based on kalman filter. Using multiple hypotheses copes some of the limitations of Kalmanfiltering. As a result, Particle filtering is a quite popular alternative [81]. Kim and Davis in [54]use particle filtering to link detected people. Bayesian filtering technique is very popular. However,these approaches are based on short time interval due to the computational complexity of monitor-ing a large time window and hence can still assign wrong identity. In order to consider large timewindow (several hundreds of frames), graph-based methods are used. Khan and Shah propose agraph-based algorithm where every node corresponds to a point in a space-time volume [61]. Allthe occupancy maps are stacked together to form the given volume. They segment people’s trajec-tories given a graph cut algorithm. In [82], Shafique and Shah use a directed graph where nodescorrespond to the detected points only. They use a greedy algorithm by pruning unlikely motions.Fleuret et al. use dynamic programming based on the Viterbi algorithm [1] similar to the previouswork from Wolf et al. [83]. In order to scale with the size of the problem and reduce the complex-ity of segmenting several trajectories simultaneously, Fleuret et al. propose a sequential approach.However, their approach is sensitive to false negatives and poor detection results. It mixes trajecto-ries in dense crowd according to Berclaz et al. in [84]. Leibe et al. in [85] couple the detection andtracking into Quadratic Boolean Programming solved by the EM algorithm. The thresholding stepto localize people can be seen as a bottleneck to track the resulting detections. Taj and Cavallaro in[86] present the ’track-before-detect’ Bayesian approach to track people with lower signal strengthbased on particle filtering. However, it is still computationally expensive.

4.2.2 Dijstra Pursuit Tracker

We propose a simple tracking algorithm that can process large temporal window and robust tothe error produced by the thresholding step such as missed detections and false positives. The pro-posed approach is similar to [64] where they use Viterbi to estimate the trajectories whereas we useDijkstra. With the Dijkstra method, shortest paths (used as track candidate in the Dijkstra pursuit)can integrate connections between close but non-consecutive temporal frames. This allows trackingwhen few frames have missing or corrupted information. This could be somehow interpreted as asecond order Hidden Markov Model in the Viterbi formalism. That is why we picked Dijkstra asopposed to Viterbi. Our objective is simply to prove that the non-empty locations of the occupancyvectors detected at each time of a video, i.e. the positions of the spatio-temporal occupancy vector,can be tracked across time according to a simple spatio-temporal connectivity criterion. The outputof this procedure is also a sorting of people trajectories by decreasing tracking-period.

Spatio-Temporal Graph

Our tracking method relies on the definition of a directed graph on the spatio-temporal occu-pancy vector x(t) ∈ RN , where t is taken in the discrete time interval {t1, · · · , tNt } composed of Nt

instants t j < t j+1.

Page 68: vision-based scene understanding with sparsity promoting priors

4.2. Tracking people 47

The graph G = (V,E, dG) of interest corresponds to:

(i) a set of spatio-temporal vertices

V ={(

q, t j)∈ R3 : 1 ≤ j ≤ Nt, q ∈ p

(x(t j)

) },

with p(u) = {p(i) : 1 ≤ i ≤ Nt, ui , 0} and p(i) ∈ R2 is as before the location of the ith cell inthe discrete ground plane,

(ii) an edge set E = V ×V defining the connectivity between vertices inV,(iii) and a distance dG : E → R+ weighting these edges.

In this graph G, the length |P| of a given connected path P = v1v2 · · · vK of K − 1 “hops”between K distinct vertices v j ∈ V following K − 1 edges (v j, v j+1) ∈ E is simply defined as thesum |P| =

∑K−1j=1 dG(v j, v j+1).

The vertex set V contains at most NNt elements. However, in our case the graph is essentiallyempty compared to its potential dimensionality. Indeed, as a result of our methods, for each time t j

the occupancy vector x(t j) is spatially sparse inducing a small coordinate set p(x(t j)

).

Moreover, as described hereafter, the proposed connectivity follows a particular causal geom-etry that prevents too long or unrealistic connections. First, we consider a directed connectivitywhere the vertex v = (q, t) is connected to v′ = (q′, t′) only if t′ > t. Second, given a prior max-imal speed Vmax > 0 for the people motion, two vertices v and v′ in V cannot be connected if‖q − q′‖2 > Vmax (t′ − t) > 0. This induces a causality in the connectivity preventing connec-tions between events that cannot result of a valid people motion. Third, for vertices respecting thiscausality, the corresponding edge (v, v′) is weighted by

dG(v, v′) = ‖q − q′‖2 + γ ϕ(t′ − t), (4.1)

for a certain factor γ > 0 balancing the influence of the spatial and time domains, and given aparticular increasing function ϕ.

The role of this function ϕ is to allow us a certain flexibility in the selection of paths of minimallength in G, i.e. what will define our tracking procedure described in Section 4.2.2. For instance, forϕ(t) = t, a direct path v1v3 joining v1 = (q, t j) to v3 = (q, t j+2), with both vertices sharing the samespatial coordinate q, will always have a smaller length, i.e. |v1v3| = γ (t j+2 − t j), than an indirectpath v1v2v3 with v2 = (q′, t j+1) of length |v1v2v3| = γ (t j+2 − t j) + 2‖q− q′‖2. We consider howeverthat the indirect path is valid in our tracking if the causality between v1 and v2 is respected.

It is easy to prove that taking a ϕ that increases quicker than the linear function prevents sucha case. We took ϕ(t) = exp t, a choice that also penalizes temporally too long “1-hop” path. Thefactor γ is set in function of Vmax so that the indirect path in the example above is selected againstthe direct one as soon as the causality between vertices is respected.

Dijkstra Pursuit

Given the directed graph defined above, our tracking method uses iteratively the well knownfast Dijkstra algorithm computing the shortest paths in a graph, or geodesics, between one sourcev ∈ V and all the other vertices ofV [87].

More precisely, let us first define P(v) ⊂ V as the longest geodesic initiated from v ∈ V in G,

Page 69: vision-based scene understanding with sparsity promoting priors

48 Chapter 4. People Behavior Analysis

x

y

Time

(a) Located people on the ground in (x, y) coordinatesat every time frame

x

y

Time

(b) Graph construction

x

y

Time

(c) 1st iteration of Dijkstra Pursuit

x

y

Time

(d) 2nd iteration of Dijkstra Pursuit

x

y

Time

(e) 3rd iteration of Dijkstra Pursuit

x

y

Time

(f) Final trajectories

Figure 4.1: Spatio-temporal graph-based tracking.

i.e. the longest of all the shortest paths between v and any other point v′ ∈ V.

At the first iteration, our method removes from the initial graph G the vertices of the path P(v∗),with v∗ the vertex providing the longest P, i.e.

v∗ = arg maxv

|P(v)|.

The next iterations are then defined by applying the same process on the residual graphs until reach-ing an empty one. This iterative procedure, coined Dijkstra Pursuit, is summarized in Table 4.1 andFigure 4.1.

Table 4.2 illustrates the impact of our DPT algorithm. First, the located points obtained with theseverely degraded foreground silhouettes are used given the same experiment presented in Section

Page 70: vision-based scene understanding with sparsity promoting priors

4.3. Sparse probability map 49

Table 4.1: Dijkstra Pursuit

Input: A graph G = (V,E, dG).Output: A set of tracks T of decreasing length.Program:

1. Initialize: R ← V, T ← ∅.

2. Pick v∗ = arg max v∈R |P(v)|.

3. Store: T ← T ∪ P(v∗).

4. Update: R ← R \ P(v∗), and recompute connectivity.

5. If #R = 0, return T and stop; else, return to Step 2.

Located points given No tracking TrackingRecall Precision Recall Precision

Severely degraded FS 0.82 0.92 0.88 0.91Synthetic FS 0.93 0.965 0.96 0.967

Table 4.2: Impact of the Dijkstra Pursuit Tracker on the detection rates.

2.5.1. Then, the located points obtained with the synthetic foreground silhouettes (i.e. perfect sil-houettes) are tested. With both scenarios, the recall rate is increased without degrading the precisionrate.

4.3 Sparse probability map

4.3.1 Preliminary remarks

Once people are detected and tracked across a long period of time, there is a need to summarizetheir trajectories. Presenting all the raw trajectories in a centralized interface is not helpful asillustrated in Figure 4.2. Typically, market researchers can not identify most salient spots and pathsgiven such visualization summary.

Recently, lots of work have been dedicated to understand and analyze automatically peoplebehavior using their trajectory. Morris and Trived in [88] present a survey of trajectory-basedactivity analysis for video surveillance. They present the POI/AP framework, which consists ofdefining points of interest (POIs) and analyzing activity paths (AP) between POIs. Makris and Ellisin [89] propose a graph-based approach where nodes represent the POIs and edges the APs. POIsare learned by clustering tracking points with speeds less than a threshold. Brandle et al. in [90]keep points that remain in a circle of radius R for more than a time threshold. Andrienko et al.in [91] determine the POIs by counting the time objects spent at each position. POIs are rankedby simply comparing the times people spend in the POIs. Such method is referred to as "NaiveRanking" in the rest of the paper.

Page 71: vision-based scene understanding with sparsity promoting priors

50 Chapter 4. People Behavior Analysis

Figure 4.2: Illustration of people’s trajectories in a retail store

Once POIs are established, most of the research effort is on learning activity paths (APs). Morrisand Trived segment the learning into 3 steps: preprocessing, clustering, and modeling. The prepro-cessing step involves trajectory representation suitable for clustering: zero-padding [92], smoothing[93], Hidden Markov models [94], or dimensionality reduction with PCA[95]. Clustering computessimilarity measures to compare tracks based on the Euclidean distance, Dynamic Time Warping(DTW), or the Hausdorff distance. Once trajectories are clustered, a more compact representationis used to model trajectories into subpaths [96]. Finally, the modeled trajectories are usually usedto classify an event as being abnormal or not. Most of the works related to trajectory analysis arerelated to abnormal situation detection [97, 98] and prediction [99]. Note that the more generalactivity recognition problem has been addressed in various applications such as home monitoring[100, 101].

In this work, we focus on ranking POIs. Once POIs are located, previous works consideredthem as transition points to several APs to compute likelihood. In this paper, we are interestedin better modeling the importance of the POIs. We propose a novel approach to rank them. Weconsider the likelihood for a random walker to arrive on a given POI. Such strategy is used byGoogle to rank websites. It uses the PageRank algorithm [102] to assign weights to websites. Wepropose a similar algorithm named ’SpotRank’ in the next section which allows us to estimate theprobabilities of someone going from one spot (i.e. POI) to another as well as their relevance. Theoutput of our SpotRank algorithm is a sparse probability map where POIs are identified, linked andranked. Recently, several works have also proposed an algorithm similar to PageRank to rank placesof interest in a large scale data flow [103, 104]. However, to our knowledge, researchers have notyet applied such algorithm to analyze people’s walking behavior in scenes such as in retail stores.

4.3.2 Point Of Interest detection

Given the computed trajectories, POIs need to be identified. Typically, market researchers canlabel manually the POIs in retail stores. We use an automatic detection scheme similar to theapproaches presented in the previous section. The ground is segmented into cells and the numberof people standing in each of them is computed. A person is considered as standing in a cell if he

Page 72: vision-based scene understanding with sparsity promoting priors

4.3. Sparse probability map 51

(a) Typical customers’ trajectories in a retail store (b) The corresponding ’Hot Spots’

Figure 4.3: Given the captured trajectories of people navigating in the store, a sparse number of salient ’hotspots’ (POIs) have been detected.

stays there for a period of time longer than a threshold (e.g. more than 2 seconds). The averagenumber of people standing over the cells µ and the standard deviation σ is computed. Accordingto our Claim 2 below, we are interested in those few cells with more than µ + 2σ people standing.These cells will be our Points of Interest (POIs).

Claim 2 We hypothesize that a scene is made of sparse number of Points Of Interest (POIs).

Figure 4.3 illustrates the POIs detected given dozens of people’s trajectories captured in a retailstore. It can be seen that the detected POIs have a semantic meaning: one is at the cashier, an otherin front of the main aisle, and the remaining in front of specific products. We are now interested inranking these spots (POIs) to better understand peoples’ behavior, i.e. identify most salient spots byranking the POIs.

4.3.3 SpotRank

A naive procedure to rank POIs is to simply compare the number of people occupying eachof them (referred to ’Naive Ranking’) [91]. Such approach suffers from one main drawback. Itconsiders the POIs as independent, i.e. the flows between POIs are not taken into account. Althoughpeople behavior can be random, it often respects some patterns. Typically in retail stores, people doimplicitly follow some schemes such as going from one product to its complementary, standing infront of the main aisle, browsing in the sales area, and finally ending at the cashier. We propose tohandle these flows explicitly as edges on a directed graph whose vertices are given by the POIs.

A modified version of the PageRank algorithm, coined SpotRank, is proposed to rank POIsgiven their connectivity. PageRank follows the model of a random walker who will, from an initialweb page, choose randomly one of the websites it has a link to. A crucial difference is that SpotRankexplicitly considers the number of times a link is taken. Typically, on figure 4.4(a), there are 3trajectories going from C to D. The usual PageRank algorithm only considers if there is a link ornot between two points. In our case, the ’popularity’ of a link is also measured by simply countingthe number of observed trajectories between the POIs.

Page 73: vision-based scene understanding with sparsity promoting priors

52 Chapter 4. People Behavior Analysis

!" #"

$" %"

(a) Graph of POIs

!"#$ %"#$

%&#$ '&#$

(b) SpotRank

Figure 4.4: (a) Trajectories between four POIs, also referred to as spots. (b) The corresponding SpotRankcoefficients illustrating the most salient POIs.

The random walker initially starts at a POI selected uniformly at random. It will randomly goto any POI linked to the current with a probability proportional to the edge weight. If the randomwalker is trapped in a sink (i.e. a POI with no outgoing edge), its next step is chosen uniformly atrandom over all POIs. In Figure 4.4(a), D is a sink. Therefore we suppose that D has a trajectorygoing to A, B and C. The SpotRank coefficients for Figure 4.4(a) are computed as follows:

– SpotRank coefficient of point A: Only one trajectory goes to it from D. As 3 uniformlyweighted paths leave uniformly D, the probability of going to A is one third hence its Spo-tRank is:

S P(A) =S P(D)

3, (4.2)

where, S P(.) is the SpotRank coefficient of a given POI.– SpotRank coefficient of point B: From the 4 paths leaving A, 2 go to B, 1 of the 4 trajectories

leaving C goes to B and again 1 over 3 trajectories leaving D:

S P(B) =2S P(A)

4+

S P(C)4

+S P(D)

3. (4.3)

We can therefore generalize the SpotRank coefficients with the following equation:

S P(u) =∑v∈Bu

S P(v)ku(v)L(v)

, (4.4)

where u and v are two distinct POIs, Bu is the set of all the POIs having a path to u, L(v) is the totalnumber of paths leaving v, and ku(v) is the number of paths going from v to u.

Similar to PageRank, we now introduce a damping factor d which is supposed to representthe "boring" effect of going from one POI to another. Indeed, the random walker is eventuallysupposed to stop at one POI. For our analysis, we use d = 0.85 according to the test realized in[105]. Equation (4.4) can therefore be rewritten as:

S P(u) =1 − d

N+ d

∑v∈Bu

S P(v)ku(v)L(v)

, (4.5)

where N is the number of POIs.

Page 74: vision-based scene understanding with sparsity promoting priors

4.3. Sparse probability map 53

We can therefore write equation (4.5) in matrix form:

R =

S P(u1)S P(u2)

...

S P(uN)

(4.6)

R =

1−dN

1−dN...

1−dN

+ d

l(u1, u1) · · · l(u1, uN)

l(u2, u1). . . l(u1, uN)

... l(ui, u j)l(uN , u1) · · · l(uN , uN)

R, (4.7)

where :

l(ui, u j) =

0 if u jdoes not link to u jkui (u j)L(u j)

if u jlink to ui(4.8)

Therefore∑N

i=1 l(ui, u j) = 1.From equation (4.7) we can easily compute the solution for R:

R = (I − d M)−1 1 − dN

1, (4.9)

where M is the modified adjacency matrix used in equation (4.7).Note that if the number of POIs, N, is small (which is typically the case in retail applications), Rcan be directly computed from Equation 4.9. If N is big though, one could still rephrase Equation4.9 as an Eigen value problem and use of the power method. Figure 4.4(b) illustrates the SpotRankcoefficients given the configuration presented in figure 4.4(a).

4.3.4 Discussions

We analyzed the trajectories of dozens of people in the retail store illustrated in Figure 4.3(a).Using our interpretation and the knowledge of the store’s configuration, we can comment on Figure4.5. The ’Naive Ranking’ gives much less importance to the cashier’s POI than our SpotRankalgorithm although the cashier zone is known to be salient. In addition, as expected, the main aislein the middle of the store is the most salient POI. Globally, it can clearly be seen that consideringthe connectivity of POIs with our SpotRank algorithm leads to a different ranking than the naiveranking which justifies its use.

Another alternative would be to only count how many tracks go to a POI and simply set therank of a POI as being the number of tracks going towards it divided by the number of tracks in theset of trajectories. SpotRank takes advantage of more information provided by the trajectories, thansuch description. Indeed, by simply counting the number of tracks going to a POI, the relevance ofthe source POI is not taken into consideration.

Figure 4.6 presents the results of the SpotRank algorithm over different scenarios. Take theexample of the basketball field. Figure 4.6(d) clearly shows the typical zones occupied by a single

Page 75: vision-based scene understanding with sparsity promoting priors

54 Chapter 4. People Behavior Analysis

(a) Naive Ranking (b) SpotRank

Figure 4.5: Automatically ranked salient spots in retail store given dozen of peoples’ trajectories.

player as well as the transitions between the zones. Figure 4.6(f) could illustrate a typicall gameplay of the entire team.

Finally, the strength of our proposed SpotRank algorithm is also its complexity cost. On anIntel processor at 2.53GHz, our implementation computes the SpotRank in 10 ms for trajectoriescontaining 10’000 points. When trajectories are made of 500’000 points, the time increases slightlyto 50 ms. As a result, the SpotRank algorithm can be incorporated to the SCOOP detection schemeleading to a complete real-time system to capture and analyze people’s motion behavior.

4.4 Conclusions

In this chapter, we completed the full picture by presenting a system to accurately detect, trackand analyze people trajectories in a network of fixed cameras in real-time. Our Dijkstra Pur-suit Tracker achieves this goal. The aim was not to present the best tracking algorithm since atremendous amount of algorithms have been already proposed in this field, but rather to address thetracking problem as a graph-based greedy approach with real-time performance. Future work cancompare the performance of the Dijkstra Pursuit Tracker with the tracking algorithms presented inSection 4.2.1. In addition, the A∗ algorithm can be evaluated as an alternative to Dijkstra to reduceeven further the computational complexity.

We studied the behavior of people in accordance to Claim 1: the dynamic of a scene is basedon small set of causes. In other words, people motion behavior is driven by a small set of points ofinterest in a given environment. As a result, we studied people behavior with the POI/AP frame-work, where POIs are automatically identified and ranked according to their mutual flows by ourSpotRank algorithm. The proposed system could also be used to quantitatively study human psy-chology over large-scale data about specific events that influence their behavior such as color orlayout.

Page 76: vision-based scene understanding with sparsity promoting priors

4.4. Conclusions 55

(a) Dozens of customers trajectories (b) SpotRank

(c) 1 basketball player trajectory (d) SpotRank

(e) Team trajectories (5 players) (f) SpotRank

Figure 4.6: Output of the SpotRank algorithm over various captured trajectories

Page 77: vision-based scene understanding with sparsity promoting priors

56 Chapter 4. People Behavior Analysis

Page 78: vision-based scene understanding with sparsity promoting priors

Foreground SilhouetteExtraction Based on StereoImaging 55.1 Introduction

The performance of most multi-view people detection algorithms depends on the quality of theextracted foreground silhouettes [49, 54, 59, 62]. The latter are the backbone of more advancedsystems to detect, track and analyze people behavior [106, 107]. Our dictionary-based frameworkhandles degraded silhouettes to some extent. Foreground silhouettes are binary masks representingthe connected pixels belonging to the foreground of the scene. Static cameras can model the back-ground of a scene by collecting statistics on every pixels. Porikli presents some of the methods in abrief survey [15]. Typically, a decade ago, Stauffer and Grimson modeled each pixel as a mixtureof Gaussian with an on-line approximation for the update [16]. Although a lot of efforts have beendedicated to enhance these algorithms [17, 18], sudden changes of lighting conditions dramaticallyaffect the performance of the extraction process. Monocular background subtraction algorithmsmodel the variation of image intensity and hence have limited domain of applications. They do notwork in environments where the lighting conditions are constantly changing such as in sport games,shows, exhibitions with spot light effects, or when videos are projected in the background.

In this chapter, we propose a solution to tackle this problem with stereo imaging. There havebeen many attempts to do segmentation using stereo cameras, mainly for background subtractionapplications [108–110]. At first sight, it seems to be an interesting idea, because stereo vision is ofgreat interest for computer vision. Three strategies are exploited to extract foreground silhouettes:

1. Temporal depth variation - compute the difference between background depth and every es-timated depth at all time.

2. Disparity mismatch - measure the disparity mismatch given the background depth.

3. Total Variation extraction - extract foreground silhouettes with variational techniques giventhe background disparity map.

The first approach depends on the performance of the depth estimation algorithms. Although a

57

Page 79: vision-based scene understanding with sparsity promoting priors

58 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

lot of effort has been dedicated to accurately estimating depth maps, the performance is not consis-tent across scenes. Algorithms often perform poorly on uniform and non-textured regions. Someexamples of the estimated depth maps on various scenes are illustrated in the next section.

Disparity mismatch algorithms compute the background depth only. Even if the backgrounddepth is correctly estimated, the extracted foreground silhouettes are not sufficiently accurate tobe used within our people detection algorithm. Therefore, we propose an algorithm leveraging theadvantages of both methods to extract foreground silhouettes with stereo imaging. A Total Variation(TV) disparity-based extraction algorithm is presented in Section 5.3.

The rest of this chapter is structured as follows: first, we recall some works on disparity esti-mation with stereo cameras. In the following section, we present all strategies to extract foregroundsilhouettes with the estimated disparity map followed by our proposed TV disparity-based extrac-tion. Experimental results are presented to illustrate the advantage of our method (see Figure 2.18).The chapter ends with concluding remarks.

5.2 Depth estimation given stereo cameras

5.2.1 Algorithms and methods

The problem of estimating depth maps using stereo cameras has been investigated for manyyears. The efficiency of a system depends on the choice of the stereo camera and the correspondencealgorithm. The latter has to compute the correspondences between pixels in the left and right image,referred to as the disparity map. For an actual overview and ranking of the different methods, theMiddelbury test-bench [2] is an excellent reference. According to Arce and Marroquin [111], thereare two kinds of approach to solve the stereo correspondence problem:

Local optimization methods A pixel-wise process is adopted to look for the most similar pixelsacross images according to a correspondence metric. Correspondences can be established betweenselected features computed on both images. As a consequence, results are usually sparse and thushave no big computation [112–114]. These methods are interesting when it is not possible to com-pute dense disparity map at a satisfying frame rate. Windows can be used to avoid all-to-all com-putation. Correspondences between two images are found by computing the correlation over smallareas. The correlation can be measured using different metrics such as cross-correlation [115] ,or squared intensity differences [116]. The result is not globally consistent according to someimportant parameters (edges, discontinuities) because of the locality of the computation. Thosetechniques are especially efficient for textured area. The phase of the Fourier Transform can alsobe estimated to match two patches on the images in order to find the correspondence. This is a sortof gradient-based optical flow if we consider the difference of the Fourier phase of the images fortime-derivative. A hierarchical approach can be exploited to avoid local minima problem [117].Birchfield et al. in [118] use dynamic programing to minimize the dissimilarity of pixel intensitieswith a specific cost function. Each scanline is processed independently to give more global resultsthan window based techniques and faster than energy based solutions. Indeed, the latter can beNP-hard, when dynamic programming can compute disparities in polynomial time. Furthermore,partial occlusions can be handled by matching multiple pixels from one image to a single pixelin the other, which is not possible with common window based methods [2]. Finally, cooperative

Page 80: vision-based scene understanding with sparsity promoting priors

5.2. Depth estimation given stereo cameras 59

algorithms are processing local information using nonlinear functions with a similar behavior asglobal optimization method. Such techniques are inspired by the human visual system. [119, 120]

Global optimization methods The disparity map is obtained as the solution of a minimizationprocess which usually involves an energy functional. Generally speaking the functional includes adata term to enforce the brightness consistency between the images and a smoothness term on thedisparity map D:

J(D) = Jdata(D) + Jsmooth(D), (5.1)

The data term can Jdata basically ensures that the disparity map D is consistent with both theleft IL and right image IR and can be written as:

Jdata =∑

Ψ(IL(x), IR(x),D), (5.2)

where Ψ is a cost function which is usually chosen to be robust to outliers and illumination changes.A typical choice of the data term is the following:

Jdata =∑|(IL(x) − IR(x − D)|, (5.3)

The smoothness term Jsmooth usually depends on an a-priori model of the disparity map. Manydifferent functions have been proposed over the past 20 years starting from the seminal work ofHorn-Schunck [121] and later on with [122, 123]. All of them try to impose a spatial smoothnesson the disparity possibly trying to preserve discontinuities. It usually produces the following term:

Jsmooth = ρ(∇D), (5.4)

where ρ is typically a monotonically increasing function of the discrete disparity gradient ∇D.

The disparity map D is then obtained as the solution of the minimization problem:

D = arg minD

J. (5.5)

Over the years several techniques have been developed to solve the problem described in Equa-tion (5.1) (see [2] for a full description). We personally classify these in two main classes: discreteand continuos one. The discrete classes of methods perform the optimization on a predeterminedset of disparities. Among the many algorithms probably the most representative are those basedon Graph-Cuts [8, 124] and belief propagation [125]. Continuos methods on the other hand treatdisparities as continuos functions and make no assumptions on the disparity range [126]. Thesetechniques are usually referred to as variational and among the other advantages they are easilyimplemented on GPU [127]. In the next section, we evaluate the performance of several of thementioned algorithms to estimate the disparity maps.

5.2.2 Experimental results

It is out of the scope of this work to compare all algorithms to estimate disparity since a tremen-dous number of methods have been proposed in the past decades. Nevertheless, we evaluate qual-itatively the performance of some well-known algorithms on the public Middelbury data set [2]

Page 81: vision-based scene understanding with sparsity promoting priors

60 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

and then on a typical scene to extract foreground silhouettes. The following algorithms have beenevaluated: the old but fast algorithm proposed by Birchfield et al. [118], the Block Matching (BM)method [7], the Graph Cuts (GC) algorithm [8] , the cooperative stereo algorithm proposed byZitnick et al. [5], the energy minimization method for Markov Random Field (MRF) [4], the prob-abilistic framework proposed by Domke et al. [3], and the TV `1 formulation framework [6]. Satis-fying results are obtained on such public data set made of piece-wise planar textured scenes. Table5.1 to 5.7 present the parameters used for all algorithms. Figure 5.1 presents the estimated depths.The running time of the algorithms is also compared in Table 5.8. However, once the presentedalgorithms are applied to more challenging scenes, i.e. poor textured regions with less piece-wiseplanar regions, the performance of the depth algorithms severely degrades (see Figure 5.2). As aresult, foreground silhouette extraction is almost unfeasible with such noisy estimated depth mapsif a regularization method is not used.

Domke et al. [3] Minimum Disparity Maximum Disparity Alphasmall images 5 80 10big images 0 90 5

Table 5.1: Parameters used for the algorithm proposed by Domke et al. [3].

MRF Disparity Level Birchfield and Tomasi cost Squared difference Algorithmsmall images 16 active active ICMbig images 80 active active ICM

Table 5.2: Parameters used for the Markov Random Field (MRF) [4] method.

ZKS Minimum Disparity Maximum Disparity # Iterationsmall images 0 20 5big images 40 60 10

Table 5.3: Parameters used for the Zitnick and Kanade algorithm [5].

TV Alpha # Levelssmall images (10) (5)big images (10) (6)

Table 5.4: Parameters used for the TV algorithm [6].

Birchfield Max disparitysmall images 100big images 100

Table 5.5: Parameters used for the Birchfield algorithm.

Page 82: vision-based scene understanding with sparsity promoting priors

5.3. Foreground silhouettes extraction given stereo cameras 61

Block-Matching SAD Window Size Minimum Disparity # Of Disparities Uniqueness Ratiosmall images 41 -16 64 5big images 41 -74 64 5

Table 5.6: Parameters used for the block matching algorithm [7].

Graph Cut # Of Disparities # Iteration Minimum Disparitysmall images 16 2 0big images 75 2 10

Table 5.7: Parameters used for the Graph Cuts algorithm [8].

Algorithm name Testbench1 [s] Testbench2 [s]BM 0.08 0.18Bierchfield 0.22 0.43TV 1.8 4.6GC 5.08 85.09MRF 7.44 136.88Domke 9.56 24.5ZKS 10.89 36.69

Table 5.8: Time for each tested algorithms.

5.3 Foreground silhouettes extraction given stereo cameras

5.3.1 Temporal depth variation

The estimated depths computed across time can be used to model the background and fore-ground of the scene. The foreground image F(t) at time t is computed as follows:

F(t) =

{1 if |D(0) − D(t)| > θ,0 otherwise,

(5.6)

where D(t) is the estimated disparity at frame t, and θ the threshold corresponding to the mini-mum depth variation. Note that the disparity D is inversely proportional to the estimated depth andhence is enough to extract foreground objects. Figure 5.3 presents the silhouettes extracted givensuch approach. It can be seen that the estimated depths are too noisy and not consistent across time.As mentioned previously, the performance of depth estimation algorithms depends on the contentof the scene. The background disparity, D(0), corresponds to an empty scene with poorly texturedand uniform regions, which is one of the most challenging scenes to estimate depth. Therefore, theresulting depth is noisy and leads to useless foreground silhouettes as illustrated in Figure 5.3.

Page 83: vision-based scene understanding with sparsity promoting priors

62 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

5.3.2 Disparity mismatch

The temporal depth variation technique requires to estimate depth at every frame in real-time.Advanced high complexity algorithms cannot be used. In order to reduce the computational com-plexity for disparity-based segmentation, stereo mismatch are used [108–110, 129, 130]. In such acase, the background disparity is computed once at the beginning. Any high computational com-plexity algorithm can be used since the depth is computed off-line. Then for every frame, thebackground disparity is used to wrap every right image on the left one to measure similarity. Theforeground silhouettes are computed as follows:

F(t) =

{1 if |IL(t) − IW(t)| > θ,0 otherwise,

(5.7)

whereIW(t) = IR(x + D(0), t), (5.8)

with IL(t) and IR(t) being respectively the left and right images of the stereo camera at time t. Thewrapping operation, IR(x + D(0), t), maps the image plane of the right image on the left one. Bydefinition, if the background disparity map D(0) is perfect, we have:

IL(0) − IW(0) = IL(0) − IR(x + D(0), 0) = 0. (5.9)

Equation 5.9 is equivalent to the brightness consistency assumption. To compute the disparitymaps, we suppose that the brightness does not change between the left and right images. This can bedone without loss of generality by supposing identical calibration or applying histogram matchingmethods.

Once foreground objects are present in the scene, the pixels correspondence given by the dis-parity D(0) are not correct anymore over the objects’ region. If the objects have non-uniform colordistribution, then their silhouettes are correctly extracted. However, upon a uniform intensity, suchapproach extracts the contour of the foreground silhouettes instead of filled shapes as required (seeFigure 5.4). The reason is that although pixels are not mapped correctly, they are compared withneighboring pixels having similar intensity value. Objects with uniform regions induce intensitydifferences on boundaries only. A solution would have been a method to fill these contours, but asthey are not stable and almost never closed, some heavy and slow processing should be applied. Inaddition, disparity method induces a lot of noise. Therefore, we propose a TV framework to extractforeground silhouettes based on the disparity mismatch and hence promote filled silhouettes.

5.3.3 Proposed Total Variation disparity-based extraction

We formulate the foreground extraction problem as a correspondence problem from the leftimage IL to a wrapped one, IW . The latter is obtained by applying the background disparity tothe right image as in Equation 5.8. The correspondence problem is solved given the total varia-tion framework with the `1 regularization term to specifically penalize a sketch-like solution withcontours and promote the solution with filled shapes. In other words, our approach to improve theforeground silhouette extraction is to replace the subtraction operator in the disparity mismatch by

Page 84: vision-based scene understanding with sparsity promoting priors

5.4. Conclusions 63

a disparity computation with a TV `1 constraint. Figure 5.5 summarizes our approach with respectto the presented schemes.

We first compute the background disparity between the left and right image, D(0). Then, for ev-ery frame, the foreground image, F(t), is computed by measuring the disparity between the wrappedand the left image. To get a mask, a simple threshold is used on the output.

Here is the complete algorithm:1. We first compute the background disparity D(0). Any algorithm can be used. Since it is an

off-line process, there is no constraint on the computational complexity. We use the TV `1framework proposed in [6] to compute the disparity background.

2. For each new frame, we wrap the right image given the background disparity :

IW = IR(x + D(0)) (5.10)

3. The foreground disparity DF is computed by measuring the disparity between the wrappedand left image:

DF = arg minDF

∑x|IL(x) − IW(x + DF)| + λ|∇DF |, (5.11)

where λ is the regularization factor, and ∇ represents the discrete gradient operator. Thefidelity term |IL(x) − IW(x + DF)| is similar to the disparity mismatch Equation 5.7 witha correcting factor DF . As mentioned previously, in the presence of objects, the disparitymismatch is not correct anymore and hence need to be rectified by DF . To solve Equation5.11, we relax the problem as follows:

DF = arg minDF

∑x|IL(x) − IW(x)− < ∇IW(x),DF > | + λ · |∇DF |, (5.12)

where < ., . > is a scalar product.4. Finally, the solution of Equation 5.12 needs to be thresholded. Ideally, any non-zero values

should be considered as foreground. However, since a residual shift exists after the minimiza-tion process, we set a small threshold β:

F =

{1 if DF > β,

0 otherwise.(5.13)

Figures 5.6 and 5.7 illustrate the performance of our TV disparity-based foreground silhouettesextraction algorithm. The TV framework clearly promotes filled shapes as opposed to the disparitymismatch techniques. Indeed, when looking for the foreground disparity DF that minimizes thefidelity term in Equation 5.11, the possible candidates made of contours only are not picked sincethe TV norm is higher than the candidate made of filled shapes. In other words, the piece-wisesmooth candidate is selected leading to foreground silhouettes suitable for our multi-view peopledetection framework.

5.4 Conclusions

We presented a TV disparity-based approach to extract foreground silhouettes in a manner ro-bust to sudden change of illumination. It outperforms other approaches based on temporal depth

Page 85: vision-based scene understanding with sparsity promoting priors

64 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

variation or disparity mismatch with stereo imaging. However, the proposed method remains com-putationally expensive. Foreground silhouette extraction is the first step of a more global system todetect, track and analyze object behavior. As a result, limited computational resources should beallocated to this step. In the last quarter of 2010, there was a paradigm shift in the cost of activedepth cameras. Low-cost depth cameras have been proposed based on an infrared projector andcamera 1. The Microsoft Kinect camera is one of them (see figure 5.8). To build the final depthmap, the infrared projector projects a specific pattern to be compared with a pattern memorizedfor a predefined depth. The depth offered by these cameras is stable enough to accurately extractforeground silhouettes based on the simple temporal depth variation technique. Therefore, we de-veloped a real-time system based on these Kinect cameras to detect people that is unaffected by thesudden change of lighting conditions, as illustrated in Figure 5.9.

This chapter ends the description of our complete real-time system to capture people’s behaviorand extract high-level analysis with a network of fixed cameras. What if cameras are not fixed andnot calibrated? In the remaining chapters, we tackle the problem of detecting similar objects withmoving uncalibrated cameras with non-overlapping fields-of-view.

1. by Primesense company

Page 86: vision-based scene understanding with sparsity promoting priors

5.4. Conclusions 65

(a) Original (b) Ground truth (c) Domke et al. [3] (d) Block Matching[7]

(e) Graph Cuts [8] (f) TV framework[6, 128]

(g) Zitnick et al. [5] (h) Birchfield et al.[118]

(a) Original (b) Ground truth (c) Domke et al. [3] (d) Block Matching[7]

(e) Graph Cuts [8] (f) TV framework[6, 128]

(g) Zitnick et al. [5] (h) Birchfield et al.[118]

(a) Original (b) Ground truth (c) Domke et al. [3] (d) Block Matching[7]

(e) Graph Cuts [8] (f) TV framework[6, 128]

(g) Zitnick et al. [5] (h) Birchfield et al.[118]

Figure 5.1: Disparity algorithms applied on the Middelbury data set [2].

Page 87: vision-based scene understanding with sparsity promoting priors

66 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

(a) Original (b) Domke et al. [3] (c) Block Matching [7] (d) Graph Cuts [8]

(e) TV framework [6] (f) Zitnick et al. [5] (g) Birchfield et al. [118] (h) [RMF method [4]

(a) Original (b) Domke et al. [3] (c) Block Matching [7] (d) Graph Cuts [8]

(e) TV framework [6] (f) Zitnick et al. [5] (g) Birchfield et al. [118] (h) RMF method [4]

Figure 5.2: Disparity algorithms applied on a typical scene made of walking people.

Page 88: vision-based scene understanding with sparsity promoting priors

5.4. Conclusions 67

(a) Domke et al. [3] (b) Block Matching [7] (c) Graph Cuts [8] (d) TV framework [6, 128]

(e) Zitnick et al. [5] (f) Birchfield et al. [118] (g) RMF method [4] (h) Our TV disparity-basedmethod

Figure 5.3: Foreground silhouettes obtained by depth variation, i.e. absolute subtraction of the backgrounddepth with the estimated depth in the presence of people. For comparison purposes, we also present in (h) theoutput of our TV based approach described in Section 5.3.3 .

(a) Domke et al. [3] frame 1 (b) Zitnick et al. [5] frame 1 (c) RMF method [4] frame 1

(a) Domke et al. [3] frame 2 (b) Zitnick et al. [5] frame 2 (c) RMF method [4] frame 2

Figure 5.4: Two examples of disparity mismatch given three algorithms.

Page 89: vision-based scene understanding with sparsity promoting priors

68 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

!"

#$%&'%$()*"+,-.$(&/&+"

0$12(3&"*,+24%,35"""6789"

:;789" :<789"

0$12(3&"*,+24%,35"""""6739"

:;739" :<739"

(a) Temporal depth variation

!"#$%!&'()

*+",-"+./0)1234+.,5,1)

%6&'()

7)

%8&'()

9+:$.',)021$#"2';))<&=()

%8&=() %6&=()

(b) Disparity mismatch

!"#$%!&'()

*+",-"+./0)1234+.,5,1)

%6&'()

78)l9)

%:&'()

;+<$.',)021$#"2'=))>&?()

%:&?() %6&?()

(c) Proposed TV disparity-based extraction

Figure 5.5: Illustration of the three schemes to extract foreground silhouettes given stereo cameras.

(a) Frame #100 (b) Frame #200 (c) Frame #300 (d) Frame #400

(e) Frame #500 (f) Frame #600 (g) Frame #700 (h) Frame #800

(i) Frame #900 (j) Frame #1000 (k) Frame #1100 (l) Frame #1200

Figure 5.6: Illustration of our proposed TV disparity based foreground extraction algorithm.

Page 90: vision-based scene understanding with sparsity promoting priors

5.4. Conclusions 69

(a) Original image (b) TV disparity extraction (c) Mask (thresholded)

(d) Original image (e) TV disparity extraction (f) Mask (thresholded)

(g) Original image (h) TV disparity extraction (i) Mask (thresholded)

(j) Original image (k) TV disparity extraction (l) Mask (thresholded)

Figure 5.7: Corresponding mask given our TV disparity-based foreground silhouettes extraction approach.

Page 91: vision-based scene understanding with sparsity promoting priors

70 Chapter 5. Foreground Silhouette Extraction Based on Stereo Imaging

(a) The KINECT device (b) Uncovered KINECT

Figure 5.8: The KINECT Device

(a) Day time detections

(b) Night time detections

Figure 5.9: Output of our real-time people detection and tracking algorithm with the Kinect cameras (SCOOPand Dijkstra Pursuit Tracker).

Page 92: vision-based scene understanding with sparsity promoting priors

Object Matching AcrossUncalibrated Cameras 66.1 Introduction

Often, network of cameras are uncalibrated. The correspondence between 3D points and theimage plane of the cameras are not available due to their dynamics. Cameras have been integratedinto many devices such as phones or vehicles. Such moving cameras cannot extract foregroundsilhouettes and can not project extracted points onto a common reference given by a single homog-raphy or fundamental matrix. Matching objects across cameras can only be performed based onimage features such as appearances and shapes.

In this chapter, we present an image-based object descriptor to represent objects and find sim-ilar ones in the image plane of other cameras. An Object Descriptor (OD) is proposed to handledeformations occurring in the presented applications (e.g. safety, surveillance, or robot navigation)such as: (i) photometric deformation, (ii) viewpoint changes ( i.e. rotation around the vertical orhorizontal axes), (iii) object deformation (e.g. walking pedestrians), and (iv) partial occlusions.The OD is a collection of features describing the full image region of an object of interest. The ODis made of a cascade of coarse to fine grids of image descriptors (CaGID). The matching processis sped-up by the interest point locations, leading to a sparse search space. The state-of-the-artimage descriptors such as the covariance matrix [131] of various features, the histogram of colors[132], the histogram of oriented gradients [43], the scale invariant feature transform (SIFT) [133],the speeded-up robust features (SURF) descriptors [134], and the color interest points [135] areevaluated to provide insights on the object matching problem.

To evaluate the performance of the proposed OD, detected objects by a fixed camera aresearched in a mobile camera to find a match. In the next section, we briefly present the exist-ing image descriptors. In Section 6.3, the proposed object descriptor is presented along with thematching strategy. A detailed evaluation of the object matching strategies is studied in Section 6.5.

71

Page 93: vision-based scene understanding with sparsity promoting priors

72 Chapter 6. ObjectMatching Across Uncalibrated Cameras

6.2 Existing image descriptors

A wide assortment of image descriptors has been proposed in the literature to address specificgoals. Ranging from monocular or multi-view tracking problems, to image retrieval, simple andcomplex descriptors have been used.

The most basic high dimensional descriptor is the vector of pixel intensities [136]. Cross-correlation can be used to compute the distance between the descriptors. Its high dimensionalityleads to high computational complexity without being robust to geometric deformation. A natu-ral alternative is to describe the distribution of the pixel intensities by histograms. It copes withtranslations and rotations. Striker and Orengo [132] quantize the HSV color space instead of theRGB. They use 16 bins for Hue and 4 for the Saturation and Value to match images. The log-RGB[137], YCrCb [138], or Lab color spaces can also be used. Color histogram can be sufficient formonocular tracking [139] but leads to poor performance in a multi-view system. It is vulnerableto bad camera calibration and illumination changes. The inter-camera illumination change can bemodeled to reduce such an effect [140]. Nevertheless, in many applications, color histograms arenot discriminative enough to match or detect objects.

Another efficient-to-compute descriptor is the Histogram of Oriented Gradient (HOG). It isbased on the first order derivatives with respect to x and y of the image intensity (denoted by Ix

and Iy). From these derivatives, a gradient field is computed assigning to each pixel a magnitudemg(x, y) and an angle o(x, y):

mg(x, y) =

√I2

x (x, y) + I2y (x, y), (6.1)

o(x, y) = arctan(

Iy(x, y)Ix(x, y)

). (6.2)

The angle values o ∈ [0, 360[ are quantized to N discrete levels oi. A histogram is formed whereeach bin is the sum of all magnitudes with the same orientation on in a given region. HOG has beenextensively used to detect pedestrians in static images [19, 43, 141]. It is also the key componentof the descriptor proposed by Lowe in [133]. Wang et al. in [142] use HOGs that incorporatedetailed spatial distribution of objects’ color across different body parts. Likewise, Gheissari et al.in [143] segment the body into salient parts and combine color and edgel histograms for appearancerepresentation and person re-identification.

To compare the histograms, any distance measure can be used: Correlation, `1-norm , `2-norm,intersection, Chi-square, Bhattacharyya. We compare all those distances and conclude that Bhat-tacharyya [139] distance is performing either better or similar than other distances:

σr(H1,H2) =

√1 −

∑i

H1(i).H2(i)√∑i H1(i).

∑i H2(i)

(6.3)

where H(i) is the histogram value of bin i.

Lowe presents a method to extract feature points invariant to scale, rotation, substantial rangeof affine distortion, 3D viewpoint, illumination, and addition of noise: scale-invariant feature trans-

Page 94: vision-based scene understanding with sparsity promoting priors

6.2. Existing image descriptors 73

form (SIFT) [133]. Scale-space extrema is detected by difference-of-Gaussian function. Histogramsof gradient direction are assigned to keypoints and used to create the descriptors. Bay et al. pro-pose an interest point detector and descriptor outperforming SIFT in terms of speed and accuracy:speeded-up robust features (SURF) [134]. Their descriptor is based on the distribution of the Haar-wavelet responses within the interest point neighborhood. Their detector and descriptor do not usecolor information. Gabriel et al. in [135] consider color interest points. The R,G,B values and first-order derivatives of the (R,G,B) channels are considered to describe each interest point. Similaritybetween two regions is computed by summing the distance between IPs with shortest mahalanobisdistance. However, interest point based matching perform poorly with noisy low resolution images(see Section 3.2.5).

A more complex descriptor is the covariance descriptor. It is first presented by Tuzel et al. [131]to outperform histogram descriptors. For each pixel, a feature vector fn is extracted. The grayscaleintensity, the RGB values, the norm of the first and second order derivatives, the gradient magnitudeand its angle can be considered [9, 10]. Typically,

fn = (x, y, I, Ix, Iy)T , (6.4)

where I is the grayscale intensity, Ix and Iy are the norm of the first order derivatives. The pixel co-ordinates (x, y) are integrated in the feature vector to consider the spatial information of the features.The covariance of a region is computed as:

Ci =1

N − 1

N∑n=1

(fn −m)(fn −m)T , (6.5)

where N is the number of points in the region, and m the mean vector of all the feature vectors. Withcovariance matrices, several features can be fused in a lower dimensionality without any weightingor normalization. They describe how features vary together. Similarity between two regions is givenby the distance proposed by Forstner and Moonen [146] summing the generalized eigenvalues ofthe covariances:

σr(C1,C2) =

√∑i

ln2 λi(C1,C2), (6.6)

where λi(C1,C2) are the generalized eigenvalues of the covariance matrices C1 and C2. Al-though, a fast method based on integral images exists to compute the covariance matrices [131],similarity measurement takes time.

Other descriptors exist such as steerable filters [147], Gaussian derivatives [148], complex fil-ters [149], phase-based local features [150], and moment invariants [151]. However, accordingto Mikolajczyk and Schmid [152], their proposed descriptor, called gradient location-orientationhistogram (GLOH), as well as SIFT descriptor, outperforms these descriptors. GLOH is a vari-ant of SIFT computing the HOGs in a log-polar location grid and reducing the final descriptor sizewith principal component analysis (PCA). Nevertheless, it is computationally more demanding thanSIFT. In Section 6.5, the performances of the presented descriptors are compared.

Page 95: vision-based scene understanding with sparsity promoting priors

74 Chapter 6. ObjectMatching Across Uncalibrated Cameras

σ1σ1

σ2σ2

σ3σ3

σ5σ5

σ4σ4

σ30σ30

x yi

Figure 6.1: A collection of grids of descriptors. Left column is the object of interest. Right column is anotherobject to compute similarity. Colored blobs are kept to compute the global distance (β = 50%).

6.3 Object descriptor

6.3.1 Collection of grids of image descriptors

Describing an object with a single image descriptor is not discriminative enough. A very globalrepresentation is used and can not be relevant in many matching problem as detailed in Section 6.5.Therefore, Grids of Image Descriptors (GID) exist to represent objects in a more detailed fashion.They segment the object into a different number of sub-rectangles of equal sizes (also referred to asblobs). In this chapter, we evaluate the performance of a Collection of Grids of Image Descriptors(CoGID). The local and global information is described given different grid segmentation. Gridsmade of finer blob size describe local information whereas grids made of coarse blob size describea more global behavior.

Similarity between two objects, φ(x, yi), is computed by summing distances σi between corre-sponding blobs segmenting the grids. Since, many objects can be deformable, and some partiallyoccluded, only the most similar blobs are kept, the best β percent. In this way, blobs belonging tothe background can potentially be discarded as well(see Figure 6.1).

Let σ be the set of all distances:

σ = {σ1, σ2, ..., σN}, (6.7)

where N is the sum of all the blobs segmenting the grids. We define the sorted set:

σs = {σa, σb, σc, ...}, (6.8)

where σa < σb < σc < ... and | σs |= N. Hence, the final similarity measurement is :

φ(x, yi) =1p

p∑k=1

σs{k}, (6.9)

where σs{k} is the kth distance of the sorted set σs, and p = dβNe. Therefore, the final similaritymeasurement is the average of a sparse measurement of isolated blobs of various size and position.

Page 96: vision-based scene understanding with sparsity promoting priors

6.4. Object matching 75

6.3.2 Several observations

An object of interest can have several observations. Typically, the appearance of moving ob-jects can change across time from the same view-point. Therefore, the φ operator can use severalobservations of an object in the matching process. Each observation leads to an OD. To computethe similarity of a region in the given image, the minimum distance, σr, between each blob of thegrids is selected among all ODs leading to a distance map.

In order to cover the most different appearances of an object, the most dissimilar observationsare kept. As a result, if an object does not have a similar appearance with the current observation, itmight have a better similarity with an older observation.

Let D be the set of observations of an object, and m the number of observations to keep:

D = {ODi,OD2, ...,ODm}. (6.10)

We define the “similarity" distance σset as the sum of all distances between the ODs of a set:

σset(D) =∑∀k,l∈D

σ(ODk,ODl) (6.11)

Initially, the set D corresponds to the m first observations of the object. Then, given a newobservation ODn, m + 1 choices of the set D are possible, referred to as Dp:

Dp = {D1, ...,Dm+1} =

{{ODn,OD2, ...,ODm},{OD1,ODn, ...,ODm},

...,{OD1,OD2, ...,ODn},{OD1,OD2, ...,ODm}}

(6.12)

The set with the most dissimilarity (highest σset) is kept:

Du = arg max∀Di∈Dp

σset(Di) (6.13)

where Du is the new updated set of observations.

6.4 Object matching

6.4.1 Preliminary remarks

Given the proposed OD, i.e. the collection of grids of image descriptors, we are interested inlocalizing the most similar objects in the target image plane. Two strategies are evaluated to selectthe candidate regions in the target image: a dense or sparse approach.

Page 97: vision-based scene understanding with sparsity promoting priors

76 Chapter 6. ObjectMatching Across Uncalibrated Cameras

M aster C am era Desired C am era

(a) Detected IP on object of interest

M aster C am era Desired C am era

(b) Matched IPs in the target image

Figure 6.2: Illustration of an object described by three IP. The most similar IPs in the target camera leads to3*6 candidate regions only.

6.4.2 Dense scan

All possible regions in the target image are compared with the OD (similar to a brute forcesearch): a window of size proportional to the object bounding box scans the image plane at differentscales. Six scales are used with a 25% scaling factor between two consecutive scales and a jumpingstep equivalent to 15% of the window size.

A basic pruning technique is applied to discard regions with very low similarity: the differ-ence between the proportion of edges in two regions is used to give a quick indication about theirsimilarity. If the proportion of edges is not similar, the region is discarded. As a result, fewer re-gions remain to be analyzed and it increases the likelihood to detect the right object by reducingthe search space. However, this pruning technique does not reduce the search space sufficiently, afurther reduction is needed.

6.4.3 Sparse selection

A dense scan of the target image leads to thousands of regions to evaluate. In order to reducethe cardinality of such a set, a sparse selection given by the interest point (IP) extracted from theobject of interest is proposed. All the interest points found on the object are matched to the mostsimilar IPs in the image. Any existing detector and descriptor can be used. In this work, SURF[134] is used to detect and describe the IPs due to its low computational cost.

Each IP extracted from the object is represented by its coordinates with respect to the center ofthe bounding box. Therefore, a matched IP corresponds to a bounding box with the same spatialcoordinates with respect to the center of the candidate region (up to a scale 1). Figure 6.2 illustratesthe approach.

In Section 3.2.5, both strategies, i.e. dense and sparse selection of the candidate regions arecompared.

1. Six different scales are also used.

Page 98: vision-based scene understanding with sparsity promoting priors

6.5. Experimental results and comparisons 77

Figure 6.3: A three stages cascade of coarse to fine descriptors.

6.4.4 Cascade of coarse to fine descriptors

Comparing the CoGID is computationally costly. Some regions can be easily discarded withoutknowing the local information. Therefore, an approach similar to a cascade of classifiers is pro-posed, called Cascade of Grids of Image Descriptors (CaGID). “Easy regions" are discarded withcoarse grids (i.e. grids with small number of blobs). More challenging regions require the use offiner grids (i.e. larger number of blobs). The detection process is divided into several stages. Ateach stage, a finer grid is used (see Figure 6.3). After each stage, only the best candidates remain,i.e. regions with highest similarity, top ρ% of the evaluated regions. The parameter ρ can be fixed(typically 30%) or chosen such that after each stage the same percentage is kept and one regionremains after N stages:

Nr × ρN = 1 (6.14)

ρ = N−1/Nr (6.15)

where Nr is the total number of regions in the image plane to compare with the object descriptor,and N is the total number of stages to use. Figure 6.4 illustrates the remaining regions with theirsimilarity after each stage.

6.5 Experimental results and comparisons

6.5.1 Data Sets

Indoor and outdoor data sets have been used to evaluate the proposed CaGID. Each data set iscomposed of video sequences captured concurrently by a fixed and a mobile camera from the same

Page 99: vision-based scene understanding with sparsity promoting priors

78 Chapter 6. ObjectMatching Across Uncalibrated Cameras

1 st S tage

o f C ascade

2 nd S tage

o f C ascade

3 rd S tage

o f C ascade

4 th S tage

o f C ascadeF inal

O u tpu t

O bservation :

S earched im age :

Figure 6.4: Illustration of the most similar regions after each stage of the algorithm (in Jet format, white regionsare the least similar and black ones the most).

scene 1. Fixed cameras are located at a height equivalent to the first floor of a building. Mobilecameras are held by pedestrians walking in the scene. The images are recorded at 25 fps with aresolution of 320 × 240. The data sets have meaningful changes in viewpoint, illumination, andcolor distribution between fixed and mobile cameras. Sensing devices are also different. Indeed,mobile cameras have a low-quality capturing device and hence provide noisy images. A roughtemporal synchronization of the cameras is used (few frames delay) similar to the delay that canoccur in real-world applications.

To further evaluate the performance of the proposed descriptor to track object identity, theVIPeR data set is used 2. It contains hundreds of pedestrian image pairs taken from arbitraryviewpoints (45 to 180 degree view difference) under varying illumination conditions[153]. Hence,pedestrian recognition or re-identification can be evaluated give our CaGID matching process.

6.5.2 Experiments

To evaluate the proposed CaGID, thousands of objects are selected within the fixed camera,to find correspondence in the mobile ones. Pedestrians and random rigid objects in the scene areselected to prove the generality of the approach. The performance of the system is quantitativelyevaluated by computing the precision (i.e. number of true positives divided by the sum of truepositives and false positives) and recall (i.e. number of true positives divided by the sum of truepositives and false negatives) measures. A true positive is an object correctly detected in a mobilecamera and correctly matched to the corresponding object in the fixed camera. Only objects presentin the view of the mobile are searched, i.e. cameras have overlapping field-of-views (Section 6.5.4).Hence, the number of false positives is equal to the number of false negatives, leading to a similarrecall and precision measures.

Finally, to measure the performance of the proposed descriptors to recognize or re-identifypedestrians given the VIPeR data set, we measure the cumulative matching characteristic (CMC)curve and similar to Gray et al. in [138]. The CMC curve is calculated by selecting pedestrians inthe first camera view and finding the ranking of their match in the collection of pedestrians observedby the second target camera. All the pedestrians observed by the target camera are sorted given their

1. The video sequences with their ground truth data (in xml format) are available at:http://lts2www.epfl.ch/˜alahi/data.htm

2. The VIPeR data set is available at: http://vision.soe.ucsc.edu/node/178

Page 100: vision-based scene understanding with sparsity promoting priors

6.5. Experimental results and comparisons 79

Image descriptors64 bins for RGB: H(64R,64G,64B)32 bins for RGB: H(32R,32G,32B)64 bins for log-RGB: H(log 64R,64G,64B)32 bins for log-RGB: H(log 32R,32G,32B)64 bins for YCrCb: H(64Y,64Cr,64Cb)

Histogram of 8 bins for YCrCb: H(8Y,8Cr,8Cb)Color 64 bins for HSV: H(64H,64S,64V)

32 bins for H, 8 bins for S, V: H(32H,8S,8V)16 bins for H, 4 bins for S, V: H(16H,4S,4V)64 bins for Lab: H(64L,64a,64b)

HOG8 bins: HOG 812 bins: HOG 1216 bins: HOG 16

Haar-wavelet SURF distribution [134]: Haar(SURF)responses SURF distribution tuned: Haar(SURFtuned)

Covariance

C(x, y, Ix, Iy)C(x, y, Ixx, Iyy)C(x, y,mg, o)C(x, y, I, Ix, Iy)C(x, y, I, Ix, Iy, Ixx, Iyy)C(x, y, I, Ix, Iy,mg, o)C(x, y, I, Ixx, Iyy,mg, o)C(x, y, I, Ix, Iy, Ixx, Iyy,mg, o)C(x, y,R,G, B, Ix, Iy, Ixx, Iyy)C(x, y,R,G, B, Ix, Iy,mg, o)C(x, y,H, S ,V, Ix, Iy, Ixx, Iyy)C(x, y,Y,Cr,Cb, Ix, Iy, Ixx, Iyy)C(x, y,Y,Cr,Cb, Ix, Iy,mg, o)

Table 6.1: Summary of the best performing image descriptors evaluated for the CaGID. x and y are the pixelcoordinates, I the grayscale value, Ix and Iy the 1nd order derivatives, Ixx and Iyy the 2nd order derivatives, mg ando the gradient magnitude and angle.

distance to the probe pedestrian in the first camera.

6.5.3 Image descriptors evaluation

After studying the literature and considering their relevant results, the state-of-the-art imagedescriptors are compared for the object descriptor.

First, the color histogram is evaluated as a benchmark of the simplest low cost descriptor. Vari-ous color spaces and bin partitions are studied. The RGB, log-RGB, HSV, YCrCb, Lab and oppo-nent color space are evaluated. The Bhattacharyya distance is used to compare histograms since itperforms better than other distances (such as `1-norm , `2-norm , intersection, Chi-square).

Page 101: vision-based scene understanding with sparsity promoting priors

80 Chapter 6. ObjectMatching Across Uncalibrated Cameras

Figure 6.5: Recall for various image descriptors.

Then, HOG descriptor is considered since Mikolajczyk and Schmid conclude that gradientbased descriptors (i.e. GLOH, SIFT) outperform other descriptors such as steerable filters [147],Gaussian derivatives [148], complex features [149], phase-based local features [150], and momentinvariants [151]. Eight to sixteen bin partitions are compared.

Haar-wavelet responses are also analyzed since Bay et al. obtained better results with suchdescriptor than HOG based. Experimental results showed that Haar-wavelet responses are verysensitive to the choice of the filter size and the sampling grid. First, the same choices as Bay et al.are tested. Then, by changing the parameter to a finer grid size and a bigger filter size, we reachbetter performance (referred to as Haar SURF tuned).

Finally, the covariance descriptor is exhaustively evaluated for various feature vectors sinceTuzel et al. [131] introduced such a descriptor to outperform histogram descriptors. All the pre-sented color channels, the first and second order derivatives, the magnitude and angle of the gradi-ent, are used to form the feature vector. Many combinations of features are tested.

All these descriptors are intensively studied for various schemes and parameters. Table 6.1illustrates the best performing ones.

6.5.4 GID/CoGID/CaGID

For the sake of clarity, only the best performing descriptors from table 6.1 are presented in theremaining study. Nevertheless, the performance of some descriptors is presented in Figure 6.5 forthe simplest scheme: an object is described by a single descriptor with a dense scan of the candidateregions. Color features perform poorly with histogram and covariance descriptor. Since sensingdevices are different, the color distribution is also changed. Hence, color is not the right featureto use. Increasing the number of features increases the performance of the covariance descriptor.The HOGs perform almost as good as the best covariances. However, it is clear that describingan object with a single descriptor leads to very poor performance. Local information is lost in theglobal behavior. In this work, the cascade of grids of descriptors is proposed to tackle this problem.In order to validate such an approach, the proposed CaGID approach is compared with the differentschemes of GID and CoGID (figures 6.6(a) to 6.6(c)).

First, an object is described by a GID (Figure 6.6(a)). Various numbers of sub-regions per gridare considered. Increasing the number of sub-regions increases the performance with histogram ofcolor, HOG, and covariance descriptors. The color histogram still performs poorly compared toothers. Interestingly, the performance of the descriptor based on Haar-wavelet responses increases

Page 102: vision-based scene understanding with sparsity promoting priors

6.5. Experimental results and comparisons 81

H OG 8 H OG 12 H OG 16

C (I,Ix ,Iy ) C (I,Ix ,Iy ,m g ,o ) C (I,Ix ,Iy ,Ixx ,Iyy ,m g ,o )

H aar (S U R F tu n ed ) H (64H ,64S ,64V ) H (64R ,64G,64B )

(a) One grid per Object

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 4 9 16 25

N u m b er o f b lo b s /g rid

Rec

all

(b) A collection of grids per Object

0%

10%

20%

30%

40%

50%

60%

70%

80%

11+4

1+4+9

1+4+9+ 16

1+4+9+ 16+ 25

N u m b er o f b lo b s /g rid

Rec

all

(c) A cascade of grids per Object

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 C ascad e

1-4

C ascad e

1-4 -9

C ascad e

1-4 -9 -16

C ascad e

1-4 -9 -16 -

25

Rec

all

Figure 6.6: Recall for various image descriptors with 3 different schemes to describe an object based on a densescan.

Page 103: vision-based scene understanding with sparsity promoting priors

82 Chapter 6. ObjectMatching Across Uncalibrated Cameras

35%

40%

45%

50%

55%

60%

65%

70%

75%

0 10 20 30 40 50 60

Number of Descriptors/region

One Grid

Collection of Grids

Cascade of Grids (ρ = 30)

Cascade of Grids (ρ = 50)

Cascade of Grids (ρ = 70)

Cascade of Grids (adaptive ρ)

Re

ca

ll

Figure 6.7: Recall with respect to the number of image descriptors needed.

for a few set of coarse grids and decreases for much finer grids. The filter size and sampling gridare proportional to the sub-rectangle size. As mentioned previously, changing the filter size andsampling grid affects the performance. Hence, such a decrease of performance can happen withfine grids (i.e. high number of small sub-rectangles).

Second, an object is described by a CoGID (Figure 6.6(b)). The final similarity measurement isthe sum of the distances over all the grids. Considering global and local information increases theperformance of all the descriptors reaching a limit.

Finally, Figure 6.6(c) shows that the proposed CaGID leads to a very similar performance as theCoGID but with a much lower computation cost. The number of descriptors to compute is muchless than the GID and CoGID. Figure 6.7 presents the performance of the CaGID for various ρ(refer to Section 6.4.4) with respect to the number of image descriptors needed.

Similarity between two regions is computed by summing the β most similar blobs (see Sec-tion 6.3.1) within the grids of descriptors leading to a sparse similarity measurement. Figure 6.8illustrates the impact of β on the performance of the cascade of HOG and covariance descriptors.The mean performance between the two descriptors is plotted. The impact of β depends on thepercentage of occlusion, object deformation, view-point and photometric changes usually presentin the data set. In our application, keeping 75% of the blobs to compute the overall similarity leadsto the best performance.

All three strategies describe an object in a dense manner (grids of image descriptors). How-ever, an object can be described in a sparse representation obtained by the detected interest points.The state-of-the-art interest points detector and descriptor, i.e. SIFT ([133]) and SURF([134]), areevaluated for comparison purposes. Figure 6.9 presents the matched interest points found acrosscameras with both approaches. The matched interest points do not correspond to the same objectswhere as our proposed cascade of covariances correctly matched the objects across cameras. Someobjects, made of smooth regions, have very few interest points leading to an unfeasible matchingprocess. In addition, the poor image quality affects the detection process. Gabriel et al. in [135]compared IP within the region of interest whereas SIFT and SURF matches the IPs over the wholeimage. By comparing IPs of two regions [135], the performance increases slightly. Various param-

Page 104: vision-based scene understanding with sparsity promoting priors

6.5. Experimental results and comparisons 83

35%

40%

45%

50%

55%

60%

65%

70%

1 Ca sca de

1-4

Ca sca de

1-4-9

Ca sca de

1-4-9-16

Ca sca de

1-4-9-16-25

Recall β = 100 %

β = 75 %

β = 50 %

β = 25 %

Figure 6.8: Mean recall of the cascade of HOG and covariance descriptors for various β value.

eters are evaluated for SIFT, SURF, and the color interest points proposed by [135]. They all lead topoor results. The best configuration leads to a recall less than 15%. Therefore, the proposed denserepresentation of an object outperforms the sparse representation made by interest points. Never-theless, the matched interest points can be used to reduce the search space in the image plane asexplained in Section 6.4.3. A sparse selection of the candidate regions is evaluated in Figure 6.10.

The proposed sparse selection of the candidate regions combined with the dense descriptoroutperforms the approach based on a dense selection (see Figure 6.10). The regions proposedby the interest points are good candidates. The reduced search space increases the likelihood tocorrectly detect and match the objects. The number of regions to keep after each stage of thecascade approach, ρ, can be increased with the sparse selection since few candidates are examined.With both selection, dense and sparse, 30 % of the regions are kept after each stage. Yet, increasingρ can lead to better recall measures with a still low computational cost.

The computational cost of the different approaches to detect and match objects is also a crucialpoint. Table 6.2 summarizes the performance of the presented approaches to search for one object inthe mobile camera. Note that the full cost of the approaches is measured, i.e. the cost of allocatingmemories, computing descriptors, comparing them, and creating and sorting lists of distances. Theimplementation is written in C/C++, without any optimization, and running on a 2.8 GHZ processor.Therefore, the absolute cost of an approach is not informative since it can be reduced, but therelative costs are interesting. The proposed sparse selection combined with the cascade of densedescriptors outperforms other approaches in terms of recall rate and computation cost. The cascadeof covariances has the best recall rate closely followed by the cascade of HOG. However, HOG hasa lower computational complexity. Although, integral images are not used to compute the HOGdescriptors as opposed to the covariances, they still run faster. Hence, if computational complexityis an issue, the proposed cascade of HOG might be a viable alternative.

Qualitative results are given in figures 6.5.4, 6.13, and 7.4. Objects with severe change ofviewpoint or partial occlusion are correctly detected and match. Furthermore, a set of images hasbeen randomly selected from a data set to illustrate the strength of the object localization operatoron challenging images (see Figure 6.15). It can be seen that very low resolution images made ofsmooth areas can also be detected and matched. Also, faces are correctly matched across images

Page 105: vision-based scene understanding with sparsity promoting priors

84 Chapter 6. ObjectMatching Across Uncalibrated Cameras

(a) Matched SIFT interest points

(b) Matched SURF interest points

(c) Proposed approach (cascade of covariances)

Figure 6.9: Left-hand side are the objects observed in the fixed camera. Right-hand side are the image plane ofthe mobile to be searched.

Page 106: vision-based scene understanding with sparsity promoting priors

6.5. Experimental results and comparisons 85

H OG 8 w ith D en se S can H OG 8 w ith S p arse S can

C (I,Ix ,Iy ,Ixx ,Iyy ,m g ,o ) w ith D en se S can C (I,Ix ,Iy ,Ixx ,Iyy ,m g ,o ) w ith S p arse S can

H aar (S U R F tu n ed ) w ith D en se S can H aar (S U R F tu n ed ) w ith S p arse S can

(a) One grid per Object (GID)

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 4 9 16 25

N u m b er o f b lo b s /g rid

Rec

all

(b) A collection of grids per Object (CoGID)

0%

10%

20%

30%

40%

50%

60%

70%

80%

11+4

1+4+9

1+4+9+ 16

1+4+9+ 16+ 25

N u m b er o f b lo b s /g rid

Rec

all

(c) A cascade of grids per Object (CaGID)

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 C ascad e

1-4

C ascad e

1-4 -9

C ascad e

1-4 -9 -16

C ascad e

1-4 -9 -16 -

25

Re

ca

ll

Figure 6.10: Recall for various image descriptors with 3 different schemes to describe an object based on asparse search.

Page 107: vision-based scene understanding with sparsity promoting priors

86 Chapter 6. ObjectMatching Across Uncalibrated Cameras

0%

10%

20%

30%

40%

50%

60%

70%

80%

ke e p be st

1

ke e p be st

2

ke e p be st

3

ke e p be st

4

ke e p be st

5

H OG 8

H OG 12

H OG 16

C (I,Ix ,Iy )

C (I,Ix ,Iy ,m g ,o )

C (I,Ix ,Iy ,Ixx ,Iyy ,m g ,o )

Rec

all

Figure 6.11: Recall with respect to the number of best match kept

Object descriptors Recall CostSIFT detector and descriptor [133] < 0.15 250 msSURF detector and descriptor [134] < 0.15 31 msCovariance descriptor [131] 0.20 4350 msDense selection combined with

Collection of HOGs 0.65 5588 msCascade of HOGs 0.64 520 msCollection of Covariances 0.68 30 703 msCascade of Covariances 0.69 2324 ms

Sparse selection combined withCollection of HOGs 0.72 558 msCascade of HOGs 0.66 75 msCollection of Covariances 0.74 1042 msCascade of Covariances 0.74 291 ms

Table 6.2: Recall rate and computation cost of various approaches

encouraging the use of the descriptor for other applications such as object identification. In Section6.5.5, we explicitly address the intra-category recognition problem showing how the cascade ofgrids of descriptors outperform other schemes.

Figure 6.11 presents the performance of the approach if several regions in the mobile cameraare kept to locate the object of interest. Considering two or three regions is enough to increase theperformance. The validation operator classifies those candidate regions as either matching or notthe object of interest by evaluating the dual problem.

Page 108: vision-based scene understanding with sparsity promoting priors

6.6. Conclusions 87

6.5.5 People recognition

To further evaluate the strength of the proposed cascade of image descriptors, we evaluate itsperformance to solve the people recognition problem. Note that no tuning, neither training is per-formed for such specific problem. The CaGID (1-4-9-16-25) is compared with the CoGID (5 gridsof 1,4,9,16, and 25 image descriptors), and the GID (1, 4, 9, 16, or 25 image descriptors) of colorhistograms in Figure 6.16. All various color spaces and bin partition are compared. The YCrCbcolor space with 64 bins per channel performed best with the Bhattacharyya distance. Using finergrid size increases the performance. For such specific recognition problem, the CoGID and CaGIDschemes also outperform the single GID. Although the cascade and collection of descriptors per-form similarly, the CaGID outperforms other schemes in term of computation cost since a muchsmaller number of image descriptors are computed and compared (referred to Figure 6.7). Figure6.19 illustrates the most similar pedestrians found in a camera given an observation from anothercamera.

The sparse similarity measurement also influences the recognition rate (see Figure 6.17). If allthe blobs are kept (β = 1), the worst performance is achieved. However, keeping half of the blobs,the overall performance increases by roughly 25%. Note that even if beta is low (e.g. β = 0.25), wehave a gain in performance.

In addition, all 4 image descriptors are compared in Figure 6.18 given the CaGID (1-4-9-16-25).Interestingly, the color feature is performing better than other features given such problem. Indeed,in order to identify the pedestrians across a set of pedestrian images (i.e. when only pedestrians arepresent in the search space), the color feature is a relevant cue. It coincides our intuition since all theobjects have similar shapes (all pedestrians), and only their clothes, hence the color feature is themost discriminative feature to match pedestrians across other pedestrians. However, when a set ofrandom objects are present in the search space similar to our presented application (object matchingacross uncalibrated camera views), the features considering the shape of the objects (e.g. gradients)are more relevant and still perform well given the re-identification problem (e.g. the covariancedescriptor).

6.6 Conclusions

A Cascade of Grid of Image Descriptors (CaGID) was presented to match objects across uncali-brated cameras. Only the appearances of the objects are used. We performed a detailed comparisonto gain some insights on the performance of a great number of image descriptors. Using them "asis", i.e. a single descriptor per object, leads to very poor performance. Using a grid of image de-scriptors similar to Low [133], or Bay et al. [134] increases the performance. The proposed CaGIDoutperforms other approaches. The use of coarse to fine grids combined with the sparse similaritymeasurement allows the detection and tracking of deformable objects with partial occlusion andchange of viewpoints. The reduced search space driven by the detected interest points leads toreal-time performance. In fact, a demo has been developed to track any object in a camera viewwith the CaGID in real-time. The object can scale, be partially occluded, and be temporarily outof the field-of-view. Camera manufacturers are interested in integrating such an algorithm to dy-namically track the object of focus. In applications where the objects of interest are tracked withinthe same camera viewpoint, high performance is reached. However, when objects of interest need

Page 109: vision-based scene understanding with sparsity promoting priors

88 Chapter 6. ObjectMatching Across Uncalibrated Cameras

to be tracked across cameras, the performance degrades due to the severe changes of viewpointand image quality. The proposed CaGID computes the intra-grid distances, i.e. the grids’ blobsdistances between the observed object and the candidate regions. Future work can also consider theinter-grids’ blobs distances, i.e. the distance between blobs of the same grids within the objects ofinterest to compare the blobs variation of the objects of interest.

Page 110: vision-based scene understanding with sparsity promoting priors

6.6. Conclusions 89

Figure 6.12: 1st and 3rd column: objects of interest seen in a fixed camera. 2nd and 4th column: correspondingdetected and matched objects in a mobile camera

Page 111: vision-based scene understanding with sparsity promoting priors

90 Chapter 6. ObjectMatching Across Uncalibrated Cameras

Figure 6.13: 1st and 3rd column: objects of interest seen in a fixed camera. 2nd and 4th column: correspondingdetected and matched objects in a mobile camera

Figure 6.14: Given the left-hand side image, the same pedestrian is detected regardless of changes in viewpointand scale.

Page 112: vision-based scene understanding with sparsity promoting priors

6.6. Conclusions 91

Figure 6.15: Examples of images randomly selected from a data set. Left column are manually selected regions,and right column are the corresponding regions detected and matched by our proposed approach

Page 113: vision-based scene understanding with sparsity promoting priors

92 Chapter 6. ObjectMatching Across Uncalibrated Cameras

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 11 21 31 41 51

Re

co

gn

i�o

n P

erc

en

tage

Rank Score

1 Region Descriptor

4 Region Descriptors

9 Region Descriptors

16 Region Descriptors

25 Region Descriptors

Collec!on 1+4+9+16+25

Cascade 1-4-9-16-25

Figure 6.16: The CMC curve for color histogram given all three schemes. YCrCb color space with 64 bins perchannel is used.

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 6 11 16 21 26 31 36 41 46 51

Re

co

gn

i o

n P

erc

en

tage

Rank Score

β=1

β=0.75

β=0.5

β=0.25

Figure 6.17: The CMC curve for color histogram given various β value.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 6 11 16 21 26 31 36 41 46 51

Re

co

gn

i�o

n P

erc

en

tage

Rank Score

H(64Y,64Cr,64Cb)

H(64R,64G,64B)

C(Y,Cr,Cb,Ix,Iy,mg,o)

C(I,Ix,Iy,Ixx,Iyy,mg,o)

HOG 8

Haar(SURFtuned)

C(R,G,B,Ix,Iy,mg,o)

Figure 6.18: The CMC curve given 50 pedestrians present in the scene.

Page 114: vision-based scene understanding with sparsity promoting priors

6.6. Conclusions 93

Figure 6.19: Examples of objects observed by a camera (left side), and the corresponding most similar pedes-trians found by the cascade of grids of color histograms (YCrCb space) within another camera (right side). Theimages are sorted (most similar are the left ones) and the green bounding box is the correct match.

Page 115: vision-based scene understanding with sparsity promoting priors

94 Chapter 6. ObjectMatching Across Uncalibrated Cameras

Page 116: vision-based scene understanding with sparsity promoting priors

Master-Slave Framework forPedestrian Detection 77.1 Introduction

We have presented algorithms to accurately detect people and match them across cameras in theprevious chapters. We now propose to combine all the pieces of the puzzle together and present acomplete framework to understand a scene given any network of cameras. A master-slave approachis presented where master cameras collaborate to detect objects in a centralized policy whereasslave cameras detect objects in a distributed fashion, leading to a hybrid system. In order to validateour master-slave framework, we study its performance for the specific task of detecting pedestriansin an urban environment.

Chan et al. in [154] present an overview of existing technologies to detect pedestrians andillustrate the concept of infrastructure-to-vehicle (I2V) operation, Vehicle-to-Vehicle (V2V) andVehicle-to-Infrastructure (V2I). The increasing popularity of wireless networks in urban domainsand the development of dedicated short-range communication (DSRC) enable the concept ofvehicle-infrastructure integration. The United States Federal Communications Commission re-cently allocated radio frequencies for DSRC in the 5.9 GHz band. They propose an infrastructuresensor to detect pedestrian such as ultrasonic sensor, laser scanner, and a wireless signal to be sentto a vehicle to alert the driver. Often, an approximated position is detected. Mainly people standingin front of the vehicle are of interest to avoid dangerous collisions. As a result, mounting camerasin front of vehicles to detect pedestrians can trigger key alarms with less false positives.

This chapter focuses on presenting a novel system to alert the drivers given existing network ofcameras. A single camera is mounted in the vehicles and benefit from the network of fixed cameraspresent in urban areas to best detect dangerous situations. The underlying motivation is that nowa-days, fixed cameras are installed in major cities at urban intersections. Typically, approximatelyfour million fixed cameras were installed in the UK in 2002 (according to [10]).

95

Page 117: vision-based scene understanding with sparsity promoting priors

96 Chapter 7. Master-Slave Framework for Pedestrian Detection

Figure 7.1: Left column: objects of interest highlighted in a fixed (master) camera. Right column: Correspond-ing objects detected and tracked in a mobile (slave) camera by our proposed approach.

A master-slave system is presented where masters, i.e. fixed cameras, collaborate to detectpeople as presented in the Chapter 3, and help the slaves (mobile cameras) to detect similar objectsas presented in Chapter 6. Objects are detected with mobile cameras mounted in vehicles given thecontribution of fixed cameras present in the scene. Note that the full video is not transferred acrosscameras. Only extracted features are communicated. The foreground pixels are used as features forthe masters, i.e. a binary mask of silhouettes. A collection of covariance descriptors are transferredto the slaves. The features are of low dimensionality and hence do not affect the network capacity,i.e. the network bandwidth needed to communicate between the masters and slaves.

In the previous chapter, we supposed that cameras have overlapping field of view (FOV). In thenext sections, we address the problem of having the possibility to not observe objects in the FOVof the target camera. A detected object in the fixed master camera is not necessarily present in theFOV of a slave camera. A validation step is needed to classify an object as present in the FOV ofthe slave cameras. In the rest of the chapter, "masters" are referring to the fixed cameras whereas"slaves" correspond to the mobile cameras mounted on the vehicles.

To evaluate the strength of our master-slave framework, we compare its performance with astand-alone pedestrian detector in Section 7.5. Current approaches to detect pedestrians with mo-bile cameras are briefly described in the next section. The state-of-the-art algorithm is used as areference to compare our master-slave method.

7.2 Pedestrian detection with mobile cameras

Pattern recognition techniques are usually used to detect pedestrians when cameras are moving.Expensive vision-based systems to detect pedestrians based on mobile stereo cameras exist [155–

Page 118: vision-based scene understanding with sparsity promoting priors

7.3. A master-slave object detection approach 97

158]. Suard et al. in [43] combine stereo cameras with other sensors. Although most of the accidentshappen in daytime conditions, Far Infra-Red (FIR) thermal images are used to detect pedestrians atnight time by Betorzz et al. [159]. FIR systems exploits the radiation emitted by any warm object.Warm objects are bright whereas the rest is dark. Nonetheless, since FIR images depend on thetemperature of the objects, an outdoor scene has a number of factors that affect the images. Strongsun heating can introduce texture due to different thermal behavior of different materials. Moreover,the temperature variation (due to cloud, humidity, etc.) make the road scene difficult to interpret.Near Infra-Red (NIR) sensors detect the radiation reflected by objects in the infra-red range, whichis close to visible light. NIR images have a higher spatial resolution than images formed by FIR.Broggi et al. in [160] evaluate the pedestrian body and legs using geometrical moments.

Low-cost vision-based systems, i.e. a single low resolution mobile camera (320x240), is notperforming well enough to detect people. The variability in the appearance of pedestrians (e.g.clothing), their articulated structure, the non-rigid kinematics, and the cluttered scene affect theperformance of the existing systems. The pedestrian detection problem is addressed as a patternclassification one. A set of features is extracted from a large number of training samples to train aclassifier (see Figure 7.2). Haar wavelet coefficients of a set of normalized pedestrian images canbe used [41, 42] . Image are classified with a support vector machine (SVM) and a "bootstrapping"method. Gavrila in [161] uses a template matching technique based on hierarchical representationof the templates. Shape matching is based on distance transforms (chamfer distance). A reasonableshape extraction is needed. Katz and Aghajan in [162] use the latter in a multi-camera setup. Pedes-trians can be detected without any training [155, 163]. Morphological characteristics of pedestrians(size and aspect ratio), vertical linear filter, and the strong vertical symmetry of the human shapeare possible cues. However, with such approaches, multiple detections of the same pedestrian oc-curs and pedestrian with monochrome clothing are hardly detected. Histogram of oriented gradient(HOG) is an efficient and robust shape-based cue [19, 43, 141]. Tuzel et al. in [20] obtain the bestperformance with a novel object descriptor based on covariance matrices. The local spatial vari-ation and correlation of gradient-based features are encoded in the covariance matrix descriptorsincreasing the robustness toward illumination changes. The classification is performed using Log-itBoost classifiers combined with a rejection cascade designed to accommodate points lying on aRiemannian manifold. They outperform previous approaches. In section 7.5, the performance oftheir work is compared with our proposed system.

7.3 A master-slave object detection approach

7.3.1 Problem formulation

Given an observation x of an object O in a master camera, we wish to detect its presence in theview of a slave camera, and if present, locate it in its image plane. No calibration and training datashould be used.

Let yi be a potential region in the slave. x and yi are rectangular subsets of an image. An "Objectlocalization" operator is defined, Φ, which maps a region x to the N most similar regions in a givenimage I:

Φ(x, I,N) = {y1, y2, ..., yN} = Y (7.1)

Page 119: vision-based scene understanding with sparsity promoting priors

98 Chapter 7. Master-Slave Framework for Pedestrian Detection

Off-line Process

Sample Images

Claissifier

Train

Mobile CameraMobile Camera

Camera 1

Validate

Feature-based Object

Detection SystemProposed System

Feature

Extraction

Feature Extraction

Offff -ff lill nii e Prorr cess

SampleImages

TrainFeature

Extxx ractcc ion

Yes or No

Region Matching Region Matching

Camera 1 Camera 1

Yes or No

Figure 7.2: Feature-based object detection framework and the proposed system to detect a pedestrian in a mobilecamera.

First, the operator Φ is used to match an observation x from the master to the most similarregions in the slave:

Φ(x, Is,Ns) = {y1, y2, ..., yNs} = Yx (7.2)

The same operator Φ is further used to map any yi to a set of xi referred in this paper as the dualproblem:

Φ(yi, Im,Nm) = {x1, ..., xNm} = Xi (7.3)

where Im is the image plane of the master.

In order to validate if a detected region in the slave really matches the same object in the master,the dual problem is evaluated. If a region xi coincides with x, then the corresponding yi should bethe region bounding object O in the slave (see Figure 7.3). If none of the xi coincides with x, objectO is probably not present in the view of the slave. Hence, an operator ϑ validates if a region yi

matches x:

ϑ(yi|x,Φ(yi, Im,Nm)) = ϑ(yi|x, x1, ..., x j) ∈ [0, 1]. (7.4)

Moreover, the dynamic of the system can be considered to increase the performance. If resultsfrom previous frames are available, they help the decision at the current frame. Two types of priorare useful. First, an object moving in a scene can have different appearances across time even froma fixed viewpoint. A set of relevant observations, {xt, xt−i, ..., xt− j}, can be kept to detect the sameobject with a slave camera. The object localization operator becomes:

Φ({xt, xt−i, ..., xt− j}, Is,Ns) = Y tx (7.5)

Second, the results of a detected object in the slave at previous frames, {yt−1, yt−2, ..., yt−k}, can beused to detect the same object at the current frame, corresponding to a tracking approach:

Φ({yt−1, yt−2, ..., yt−k}, Is,Ns) = Y tyt−1 (7.6)

Page 120: vision-based scene understanding with sparsity promoting priors

7.4. Validation 99

Fixed Mobile

(a) Φ(x, Is, 3) = {y1, y2, y3} = YxFixed Mobile

(b)For i = 1 : 3: Φ(yi, Im, 3) = {x1, x2, x3} = Xi

Figure 7.3: Illustration of the Φ operator. (a) An object x, highlighted in the master camera, is mapped to thebest 3 regions in the slave camera. (b) Then, each region yi is mapped back to 3 regions in the master camera. Ifthose regions coincide with x, there is a match.

As a result, the problem can be formulated as follows: find the region ytx in the slave that

maximizes ϑ(yti|x

t,Φ(yti, Im,Nm)) for all yt

i ∈ {Ytx,Y

tyt−1}:

ytx = arg max

yti∈{Y

tx,Y t

yt−1 }ϑ(yt

i|xt,Φ(yt

i, Im,Nm)) (7.7)

If such a ytx does not exist (all ϑ < T ), it means that the object is not present in the image plane of

the slave camera. The choice of the threshold T will be discussed in Section 7.5.

7.3.2 Detect, track, and validate

In order to solve the formulated problem, the approach can be summarized as follows. First, anobject observed by a master is searched in the image plane of the slave with the Φ operator. Thedual problem is evaluated to validate the candidates. Then, at the next frames, prior from the slaveis first used to search the new frames. If the tracked region validates the dual problem, then thecorresponding object is not searched given observation from the master. However, if none of thecandidates match the initial object, the process is repeated without considering the prior from theslave. Algorithm 1 summarizes the approach and Figure 7.4 illustrates an example of a single objectdetected and tracked in the slave camera.

7.4 Validation

The validation operator, ϑ, evaluates the likelihood that object x matches region yi in theslave camera. It considers the dual problem by analyzing the set obtained by Φ(yi, Im,Nm) =

{x1, x2, ..., xNm}. In the next section, the choice of Nm will be studied.

A similarity measure ς between the original x and each xi is estimated based on the spatialarrangement of their bounding boxes:

Page 121: vision-based scene understanding with sparsity promoting priors

100 Chapter 7. Master-Slave Framework for Pedestrian Detection

Algorithm 1: Overview of the approach "detect, track, and validate"Input: A set of objects {x1, x2, ..., xp} observed in the master cameraOutput: Location {yx1 , ..., yxq} of the corresponding objects in the image plane of the slave

camerafor each object x in the master do

1. At t = 1, detect and validate:

y1x = arg max

yi∈{Φ(x1,Is,Ns)}ϑ(yi|x1,Φ(yi, Im,Nm)) (7.8)

2. At t = 2,If y1

x exists, track and validate:

y2x = arg max

yi∈{Φ(y1x,Is,Ns)}

ϑ(yi|x2,Φ(yi, Im,Nm)) (7.9)

If y2x or y1

x do not exist, detect given prior from the master and validate:

y2x = arg max

yi∈{Φ(x1,x2,Is,Ns)}ϑ(yi|x2,Φ(yi, Im,Nm)) (7.10)

3. Repeat step 2 till object x is present in the masterend

ςl(x, xi) = 1 − (1 − O1 − t1

wo +1 −C1 − t1

wc +Dc

t2wd) (7.11)

where– C is a percentage which represents how much of the original bounding box of x is covered

by the bounding box of xi. Likewise, O is the percentage which represents how much of xi iscovered by x. (see Figure 7.5)

– Dc measures the similarity of the center of two bounding boxes. The smallest Euclidiandistance between the center, the highest Dc.

– t1 is the minimum amount of O and C required, and t2 the minimum distance Dc.Note that we choose ς(x, xi) > 0 if and only if C and O > t1 and Dc < t2: t1 = 0.3 and

t2 = 0.75 ∗ max(widthx, heightx) leads to satisfactory results.

A weight w. is associated with each factor to emphasize priority. In this work, focus is first ona high cover of x , then a similar center of mass, finally xi should not be too big with respect to x(decent O) 1.

A linear ςl may be too sensitive to differences. The logistic operator is used to reduce sensitivityto two regions overlapping with a slight difference:

ς(x, xi) =1

1 + c1e−c2 ·Owo +

11 + c1e−c2 ·C

wc +1

1 + c1e−c2 ·Dcwd (7.12)

1. wc = 0.5, wd = 1/3 and wo = 1/6

Page 122: vision-based scene understanding with sparsity promoting priors

7.4. Validation 101

M aster

Desired Cam

M aster

Desired Cam

M aster

Desired Cam

t = 1 t = 2 t = 3 t

! (x 1,Is,1 ) ! (y1,Im ,1 ) ! (y2,Im ,1 ) ! (y3,Im ,1 )

! (y1,Is,1 ) ! (y2,Is,1 )

Figure 7.4: Illustration of the detect, track, and validate process. Only one object is validated and tracked acrossframes.

Figure 7.5: Illustration of the bounding boxes x (in red) and xi where C ≈ 0.75, O ≈ 0.4.

c1 and c2 are the parameters of the logistic function.

Figure plots the behavior of the logistic operator as opposed to the linear one. Figure 7.7presents an example of the value obtained with ς and ςl.

Finally, ϑ(yi|x,Φ(yi)) is computed as follows:

ϑ(yi|x,Φ(yi)) = maxxi∈Φ(yi)

ς(x, xi) × w(yi) (7.13)

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 7.6: x-axis represents C or O; y-axis represents its contribution to ς and ςl. It can be seen that for valuesof C or O close to 1, the contribution remains also almost 1 (full) for the logistic operator.

Page 123: vision-based scene understanding with sparsity promoting priors

102 Chapter 7. Master-Slave Framework for Pedestrian Detection

Figure 7.7: The linear ςl gave 0.63 and the proposed ς gives 0.86

where w(yi) weights region yi with respect to other y j based on the similarity measurement com-puted by Φ(x) (in Section 6.3.1):

w(yi) =φ(x, yi)

maxy j∈Φ(x) φ(x, y j)(7.14)

where φ(x, yi) is the similarity measurement defined in Section 6.3.1.

7.5 Experimental results and evaluations

7.5.1 Performance of the detect-track-validate

We performed the same experiments as in the previous chapter. Masters are fixed cameras andslaves mobile cameras in the scene. Objects observed by masters are searched in all the slaves at alltime frames even if the object might not be present in the views. The validation operator classifiesthe presence of the object. The performance of the latter depends on two parameters: the numberof regions to keep in the searched image plane, Ns, and the number of regions to keep in the dualproblem, Nm. Figure 7.8 presents the recall/precision graph for various N.. They are comparedwith the greedy approach considering the best match proposed by the object localization operatoras the matched object (labeled as “best match") without any validation process. With the proposedvalidation operator, setting Nm = Ns = 2, the number of false positives is decreased by 70 % whilethe true positive rate decreases by only ∼ 2%. In other words, it means that almost all the objectspresent in the view of the slave are correctly classified as present while the others are correctlydiscarded with a success rate of 70 %. For Nm = Ns = 3, the number of false positives is reduced byhalf while the precision is reduced by less than 1%. Higher values for Nm and Ns do not necessarilylead to higher performance. Considering Ns = 2 and Nm = 1 is the best tradeoff for our applicationin terms of cost and precision rate.

In addition, a possible approach to reduce the false positives rate is to threshold the similaritymeasurements φ. However, if the validation scheme is not used, it is not interesting to thresholdφ(x, yi), obtained between the object descriptor from the master and the regions in the slave camera.Figure 7.9 illustrates the histogram of the values obtained when the regions are correctly matched(T P) and the ones for the false positives (FP). There is no clear decision boundary. Typically,setting the threshold to 4.4 reduces the FP rate by 9% and reduces the TP rate by 11%. However, itis possible to threshold the similarity measurement φ(yi, xi), or the sum φ(x, yi) + φ(yi, xi) obtained

Page 124: vision-based scene understanding with sparsity promoting priors

7.5. Experimental results and evaluations 103

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.5 0.55 0.6 0.65 0.7 0.75

Pre

cis

ion

Recall

Ny=5, Nx=1

Ny=3, Nx=1

Ny=1, Nx=1

Ny=3, Nx=3

Best Match

Ny=2, Nx=1

Ny=2, Nx=2

Ny=5, Nx=5

Figure 7.8: Recall/precision graph for various Ns and Nm.

in the validation process. Figure 7.10 shows the histograms for the two cases. Now, an interestingdecision boundary exists: if we keep yi such that φ(yi, xi) < 4.1 or φ(x, yi) + φ(yi, xi) < 8.2, theremaining FP is reduced by 50% while reducing the T P rate by 5% only. Therefore, the proposedapproach can globally reduce the number of false positives by 75% − 85% for a decrease of 5-7%of the precision rate. This is feasible only because of the validation approach considering the dualproblem. Without the validation scheme proposed in this work, a reduction of the false positive rateby 80% (with thresholding), would require a reduction of the precision rate by 50%. Figure 7.11summarizes the overall performance with the different thresholding strategies.

When priors are available, the performance of the system increases. The gain in performancedepends on the behavior of the objects. By keeping three observations from the master, the globalperformance increases by 7%. Moving objects are much better detected. Considering the prior fromthe slave increases the recall rate by 12% and decreases the precision rate by 6%.

Qualitative results are given in figures 7.12 and 7.17. It can be seen that objects are successfullydetected even if the cameras have significant change in image quality, illumination, and viewpoint.In addition, highlighted objects in the master, which are not present in the view of the slave, do notgenerate false positives. Figure 7.18 presents some missed detections and few false positives.

7.5.2 Master-slave detection

We compare our master-slave system with the state-of-the-art feature-based pedestrian detectionsystem proposed by Tuzel et al. in [20] (referred to as SPD in this section). The performance ofthe system is quantitatively evaluated by computing the precision P (i.e. number of true positivesdivided by the sum of true positives and false positives) and recall R (i.e. number of true positivesdivided by the sum of true positives and false negatives) measures. A true positive is a pedestriancorrectly detected. In order to evaluate the performance of the master-slave system in variousscenarios, we consider two cases: first, the impact of a good detection performance with masters,i.e. all objects are correctly detected (referred to as "Masters with Good Detection"), then a poor

Page 125: vision-based scene understanding with sparsity promoting priors

104 Chapter 7. Master-Slave Framework for Pedestrian Detection

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

0

5

10

15

20

25

Fre

qu

en

cy o

f O

ccu

rren

ce

Similarity Measurement

TP

FP

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

0

5

10

15

20

25

Fre

qu

en

cy o

f O

ccu

rren

ce

Similarity Measurement

TP

FP

Figure 7.9: Histogram of the similarity measurements φ(x, yi) for a set of T P and FP.

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

0

2

4

6

8

10

12

14

16

18

20

Fre

qu

en

cy o

f O

ccu

rren

ce

Similarity Measurement

TP

FP

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8

0

2

4

6

8

10

12

14

16

18

20

Fre

qu

en

cy o

f O

ccu

rren

ce

Similarity Measurement

TP

FP

(a) φ(yi, xi)

6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6

0

5

10

15

20

25

Fre

qu

en

cy o

f O

ccu

rren

ce

Similarity Measurement

TP

FP

6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6

0

5

10

15

20

25

Fre

qu

en

cy o

f O

ccu

rren

ce

Similarity Measurement

TP

FP

(b) φ(x, yi) + φ(yi, xi)

Figure 7.10: Histogram of the similarity measurements in the validation process.

Page 126: vision-based scene understanding with sparsity promoting priors

7.5. Experimental results and evaluations 105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Prec

isio

n

Recall

No Validation

Thresholding only

Validation only

Validation with Thresholding

Figure 7.11: Overview of the recall/precision graph for various thresholding strategies.

people detection performance, R = 0.72 and P = 0.76 (referred to "Masters with Poor Detection").

First, we evaluate the performance of slaves running the SPD proposed by Tuzel et. al. (2008)and collaborating with other slaves or masters (see Figure 7.14). Running only the SPD withoutany collaboration leads to poor detection performance: poor precision rate, P = 0.22, with a decentrecall of 0.6. Such system represents the traditional one presented in Figure 7.2 referred to asfeature-based object detection. In the remaining section, we evaluate the gain in performance ifother cameras are considered.

Slaves running the SPD can benefit from other slaves or masters. If pedestrians detected by an-other slave are also searched given our appearance-based matching approach, the recall rate slightlyincreases (referred to the "Slave collaboration with poor SPD in Figure 7.14). The increase dependson the percentage of objects correctly detected in the other slave. The increase is small because thedetection rate in the other slave is poor (R = 0.72, P = 0.76). In addition, both cameras miss thesame objects most of the time, affecting the gain in performance. If we suppose that a master isobserving the scene, i.e. all the pedestrians are detected by the master, detected regions by the slavewith the SPD can be filtered out with the proposed validation step described in section 7.4. If poordetection rates are reached with the master, e.g. R = 0.72 and P = 0.76, the precision rate is almostdoubled: P = 0.39. The recall rate is affected (R = 0.5) since a poor detection is achieved in themaster. Nevertheless, if a good detection is obtained in the master, the precision and the recall rateare clearly increased: R = 0.62 and P = 0.45 validating the effectiveness of our system.

The strength of our master-slave system depends on the objects detected by the cameras. Whencameras are detecting different objects, the master-slave approach allows for the detection of themissed objects without reducing the precision rate, i.e. without increasing the number of falsepositives. Figure 7.15 illustrates such performance. When slaves are not running a SPD, i.e. allpedestrians are missed, we are able to reach high recall and precision rate using only detectedpeople from masters although the cameras have different viewpoints and image quality. When apoor detection is achieved by the masters (R = 0.72, P = 0.76), 83% of the pedestrians detected bya SPD are detected in the desired camera with much less false positives, i.e. R = 0.5 and P = 0.45.

Page 127: vision-based scene understanding with sparsity promoting priors

106 Chapter 7. Master-Slave Framework for Pedestrian Detection

If people are correctly detected in the masters, the slaves better detect people in term of recall andprecision rate as opposed to traditional system with SPD only: R = 0.62 and P = 0.6. First, a strictvalidation scheme is used where only two matched regions are kept in the dual problem to validatethe detections, (see figure 7.15). Then, a soft validation is tested where almost all matched regionsin the dual problem are kept. We can see that using only the observations from a master with evenpoor detection leads to better performance than using a SPD: higher precision for the same recall,R = 0.6 and P = 0.27. When a good detection is achieved in the master, higher recall and precisionare observed: R = 0.67 and P = 0.29 verifying the advantage of our system.

Interestingly, by comparing figure 7.14 and 7.15, we can see that globally it is better to not runthe SPD in the slave and search only for the detected objects observed by the masters. In otherwords, adding an additional source of detection degrades the performance. The underlying reasonis that the false positives detected by the SPD are adding noise to the master-slave approach. Theycan wrongly be matched across cameras degrading the performance.

Figure 7.16 to 7.18 illustrate the output of our proposed master-slave system compared withthe SPD. Our proposed system is only using detected objects from a master. No SPD is runningin the slave. Extracting features from the master enable the detection of pedestrians in challengingsituations. Figure 7.16 presents the success of our approach when pedestrians viewed by the slaveare of very low resolution and image quality. The slave is as far away as the master to the pedestriansof interest. When pedestrians are much closer to the slave, they are still correctly detected (Figure7.17). The approach scales correctly. Figure 7.18 shows that even when the viewpoint is severelydifferent, the algorithm succeeds in detecting the pedestrians. Finally, figure 7.19 and 7.20 evaluatesthe strength of the validation operator. None of the objects present in the master are wrongly foundin the slave (figure 7.19). In addition, detected pedestrians in the slave are correctly filtered outgiven the observation from the master (figure 7.20).

7.5.3 Discussion

Given our challenging data sets, the proposed master-slave system increases the detection rate ofa stand-alone detector, i.e. higher detection rate and fewer false positives. The gain in performancedepends on the objects detected by each camera. If two cameras detect and miss the same objects,there will not be any gain in performance with our proposed system. However, if they detect andmiss disjoint objects, our proposed system will help to detect missed objects within the cameras.That is why the gain in performance can vary given various scenarios.

One of the main advantages of such a system is its generalization to any object of interest.Regarding the driver assistance application, all potential collisions are of interest. A stroller, abicycle, or an animal (e.g. a dog) passing in front of a vehicle can be detected with our system sincethey are detected in the masters given our multi-class dictionary.

Also, some applications require a correspondence within the detected objects across the cam-eras. Typically, if an object observed by a fixed camera is correctly detected and matched in amobile camera, higher resolution features can be extracted leading to a better analysis of the ob-ject (such as facial expression analysis for instance). Techniques to detect an object with a cameracannot be used to match the objects across cameras. By definition, they remove the discriminativeparts between two objects of the same category. Our proposed master-slave system, not only detectthe objects but match them across cameras.

Page 128: vision-based scene understanding with sparsity promoting priors

7.6. Conclusions 107

Recently, there has been a significant increase of interest in the domain of pedestrian modelingin general ([164]). Various applications, ranging from building design to emergency evacuation,require models with different levels of details: walking models ([165, 166]), route choice models([167]), destination choice models ([168, 169]) or activity location models ([170]), to cite a few. Thecalibration and validation of these models require more and more data about position, movementsand group behavior of pedestrians. Our system can produce data for such pedestrian modelingsimilar to previous data collection process ([171, 172]).

Finally, the use of mathematical models describing the content of the scene can be combinedwith the proposed method to make the results even more robust. In particular recent research devel-opments on pedestrian walking behavior ([166, 173]) have proved to be highly relevant and efficientin the context of pedestrian tracking and detection ([174, 175]).

7.6 Conclusions

In this chapter, we presented a proof-of-concept of our detection method for a mixed network offixed and mobile cameras. We evaluated its performance in the context of pedestrian detection witha single camera mounted in a vehicle. Previous works detected objects using training data consistingof thousands of observations. In this chapter, objects were detected using a single observation fromanother viewpoint. A master-slave approach was presented where the collaboration of fixed andmobile cameras adds a significant value. It is flexible, as any state-of-the-art detection method canbe included in the framework. We showed evidence of the value added by our approach, as well as ofits robustness with respect to the quality of the detection algorithm. To the best of our knowledge,this is the first time that such a framework has been proposed. Although the proposed methodincreases the detection rate of stand-alone detectors, it is still below the practical requirements of adriver assistance system. Nevertheless, such a master-slave framework can be useful in surveillanceapplications. Face recognition needs high-resolution features. Mobile cameras benefit from theirproximity to people to capture high-resolution images. Pan-tilt-zoom (PTZ) cameras can even beused as slaves to collaborate with fixed master cameras to track objects across cameras.

Page 129: vision-based scene understanding with sparsity promoting priors

108 Chapter 7. Master-Slave Framework for Pedestrian Detection

Figure 7.12: Correct detections and no false positives. First and third column: objects detected by a master.Second and fourth column: corresponding objects detected and matched with a slave.

Page 130: vision-based scene understanding with sparsity promoting priors

7.6. Conclusions 109

Figure 7.13: Some false positives and missed true positives. First and third column: objects detected by amaster. Second and fourth column: corresponding objects detected and matched with a slave.

No    Collabora)on  

Slave  Collabora)on  w/  Poor  SPD  

 Master  Collabora)on    

w/  Poor  Detec)on  

Master  Collabora)on  w/  Good  Detec)on  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9  

Precision  

Recall  

Slave  with  SPD  

Figure 7.14: Performance of a slave running the SPD proposed by Tuzel et. al. (2008) and collaborating withother slaves or masters with either a poor or good detection rate.

Page 131: vision-based scene understanding with sparsity promoting priors

110 Chapter 7. Master-Slave Framework for Pedestrian Detection

Master  w/  Poor  Detec.on  

Master  w/  Good  Detec.on  

Master  w/    Poor  Detec.on  

Master  w/    Good  Detec.on  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9  

Precision  

Recall  

Slave  Stand-­‐alone  Collabora.ng  w/  Master  given  a    Strict  Valida.on  Scheme  

Slave  Stand-­‐alone  Collabora.ng  w/  Maste  given  a    SoH  Valida.on  Scheme  

Figure 7.15: Performance of a slave using only the appearance of the detected people in the masters givenvarious schemes and detection rate.

Figure 7.16: First column: people detected by a master. Second column: corresponding people detected with aslave. Third column: output of the SPD proposed by Tuzel et. al. (2008).

Page 132: vision-based scene understanding with sparsity promoting priors

7.6. Conclusions 111

Figure 7.17: First column: people detected by a master. Second column: corresponding people detected with aslave. Third column: output of the SPD proposed by Tuzel et. al. (2008).

Page 133: vision-based scene understanding with sparsity promoting priors

112 Chapter 7. Master-Slave Framework for Pedestrian Detection

Figure 7.18: First column: people detected by a master. Second column: corresponding people detected with aslave. Third column: output of the SPD proposed by Tuzel et. al. (2008).

Page 134: vision-based scene understanding with sparsity promoting priors

7.6. Conclusions 113

Figure 7.19: First column: people detected by a master. Second column: corresponding people detected with aslave. Third column: output of the SPD proposed by Tuzel et. al. (2008).

Page 135: vision-based scene understanding with sparsity promoting priors

114 Chapter 7. Master-Slave Framework for Pedestrian Detection

Figure 7.20: First column: people detected by the SPD. Second column: remaining people kept after proposedvalidation scheme. Third column: people detected by the master.

Page 136: vision-based scene understanding with sparsity promoting priors

Conclusions 8We have presented a real-time system enabling any network of cameras to detect, track, and

analyze people’s behavior with sparsity promoting priors. Challenges present in real-world systemshave been addressed. Extracted features are severely degraded, sensing devices are noisy, andcalibration data are not always available. The algorithms we proposed are robust to such challenges.A master-slave framework is described where master cameras locate a dense set of people givenseverely degraded extracted features. Slave cameras detect objects given noisy partially occludedobservation from the masters’ viewpoints. Experimental results have supported our initial belief:

A sparse set of elements is enough to capture and understand people’s behavior.

This conclusion is confirmed over all the processing blocks of our system. From low-level fore-ground extraction given stereo cameras to behavior summarization, sparsity priors have enabled oursystem to outperform previous works in terms of detection rates and even computational complex-ity:

– Foreground silhouettes are better extracted with respect to a regularization term promotingsparsity (TV Disparity-based extraction);

– A small number of primary elements, i.e. binary masks, are enough to locate people on theground (O-Lasso and SCOOP);

– Trajectories of people across time are reconstructed using sparsity priors on the number ofpeople (Dijkstra Pursuit Tracker);

– A few image descriptors are enough to discard most candidate windows in order to detect andtrack objects across uncalibrated cameras with non-overlapping fields-of-views (CaGID);

– And finally, a small set of causes drives people behavior (SpotRank).

The above conclusions and results promote the following future works:– It is first valuable to develop an algorithm to automatically calibrate the network of fixed

cameras with respect to a sparsity constraint. In this thesis, we supposed that fixed camerasare calibrated and hence constructed our dictionary accordingly. It will be very interesting to

115

Page 137: vision-based scene understanding with sparsity promoting priors

116 Chapter 8. Conclusions

learn the dictionary automatically from the data. The extracted foreground silhouettes can beused to find the best dictionary that minimizes the fidelity term in Equation 2.5. As a result,a self-configured system can be distributed to capture human behavior.

– The proposed dictionary is made of atoms approximating the ideal foreground silhouettes.Future work can consider other features such as object contours to approximate the objectof interest. As a result, The dictionary-based framework can be applied to mobile camerasrunning a simple edge detector.

– The CaGID can be further used to develop a content-based image search engine. The imageof an object can be searched within a large database of images. The database of images needsto be pre-processed (offline) to optimize the query (on-line). The grids of image descriptorsneed to be pre-computed in an optimized manner to reduce the computational cost of bothoffline and on-line process. The proposed CaGID can be combined with hierarchical indexstructures or hash based indexing to enable content based image retrieval in large databases.

The vision-based system presented in this thesis has already raised the interest of many insti-tutions, organizations such as airports, trade shows, retailers, medias, and industries for securityand marketing purposes. It can further be used to automatically study human psychology overlarge-scale data in order to discover new influences on people’s behavior. Our system enables thecollection of large amount of behavioral data via a non-intrusive vision-based system.

Page 138: vision-based scene understanding with sparsity promoting priors

Appendix

117

Page 139: vision-based scene understanding with sparsity promoting priors
Page 140: vision-based scene understanding with sparsity promoting priors

A.1. Proximal methods 119

A.1 Proximal methods

In order to solve our different minimization problems, we use a powerful proximal operator-based iterations and monotone operator splitting theory introduced by Moreau in 1962 [176] andbrought to light in the signal processing community by Combettes [36, 177].

Thanks to this theory, very efficient methods can be designed to solve general convex optimiza-tion problems of the form

arg minx

f (x) + g(x).

In the case of the Re-Weighted BPDN (2.5), f = ‖W ·‖1 is convex, and g = ιC = ιB2ε◦A(·) where

ι is the indicator function and A is the affine function such that A = D · −y. The convex sets B2ε and

C are such that B2ε = {x ∈ RN : ‖x‖2 ≤ ε} and C = {x ∈ RN : ‖Ax‖2 = ‖y − Dx‖2 ≤ ε}. As

f and g are both non-differentiable, the Douglas-Rachford splitting method [37, 177] is used. TheDouglas-Rachford recursion to solve the reweighted `1-BPDN can be written in the compact form

x(t+1) = x(t) + µt[S W ◦ (2PC − Id) − PC](x(t)), (A.1)

where S W is the component-wise soft-thresholding operator with threshold vector W, and PC is theorthogonal projection onto the closed convex set C. When D is a bounded linear operator with abound 0 ≤ c < ∞ such that 0 ≤ DD∗ ≤ c Id, the numerical implementation of this projection isdefined as described in [37]. Let {βt}t∈N be a sequence with 0 < inft βt ≤ supt βt < 2/c, and definethe two sequences {ut}t∈N and {pt}t∈N by

u(t+1) = βt(Id−PB2ε)(β−1

t u(t) + D(x − D∗u(t)) − y)

p(t+1) = x − D∗u(t+1) (A.2)

Then from [37] we get that u(t) → u and p(t) → PC(x)

In the case of the Re-Weighted Lasso problem (2.6), f = ‖y−D · ‖22 is convex and differentiablewith a β-Lipschitz gradient, and g = ιB1

W,εpwith B1

W,εp= {x ∈ RN : ‖Wx‖1 ≤ εp}. More precisely,

as the `1-norm is non-differentiable, the Forward-Backward splitting is used [37, 177]. Forward-backward (FB) splitting is essentially a generalization of the classical gradient projection methodfor constrained convex optimization. It can be written in the compact form

x(t+1) = PB1W,εp◦ (Id−µt∇ f ) (x(t)), (A.3)

where 0 < inft µt ≤ supt µt < 2/β for the iteration to converge (weakly in general), ∇ is the gradientoperator, and PB1

W,εpis the orthogonal projection onto the convex set B1

W,εp. This projection can be

efficiently computed thanks to the developments of Appendix A.2. From [36], one can show thatin the both cases presented above, the sequence (x(t))t∈N converges to some point x∗, which is thesolution of the problem.

A.2 Projection onto a `1 weighted ball

We present in this section an algorithm and its numerical implementation that solves the prob-lem

y = arg minu∈Rn

‖u − x‖22 s.t. ‖Wu‖1 ≤ ε, (A.4)

Page 141: vision-based scene understanding with sparsity promoting priors

120

for a non-negative diagonal matrix W ∈ Rn×n. This problem can be seen as the projection of thepoint x ∈ Rn on the weighted `1 ball B1

W,ε = {u : ‖Wu‖1 ≤ ε}.

If ‖Wx‖1 ≤ ε, there is nothing to do and y = x. In the other case, the solution is clearly on thesurface of B1

W,ε, so that we must solve

y = arg minu∈Rn

‖u − x‖22 s.t. ‖Wu‖1 = ε. (A.5)

In addition, since this ball is convex and centered on the origin, it is clear that sign xi = sign yi,therefore, up to the appropriate flipping of some coordinate axis in Rn, we can assume xi, yi ≥ 0.

The Lagrangian form of problem (A.5) is

L(u, θ) =12‖x − u‖22 + θ(

n∑i=1

wiui − ε), (A.6)

where θ ∈ Rn is a Lagrange multiplier. For a given θ, the minimum of L is reached when

ui = xi − wiθ + ζi = xi − wiθ = wi(

xiwi− θ

), (A.7)

if xiwi> θ, and ui = 0 otherwise. In other words,

ui = S wiθ(xi),

where S λ(v) = sign v (|v|−λ)+, is the soft-thresholding of v ∈ R by the threshold λ > 0, with (v)+ = vif v ≥ 0 and 0 else. The minimum of L with respect to θ is thus reached when ‖WS wiθ(xi)‖1 = ε,i.e. when ∑

i

w2i (| xi

wi| − θ)+ = ε. (A.8)

This can be computed very efficiently by the following simple method.

Let the vector z ∈ Rn be the vector obtained by sorting the values | xiwi| by decreasing order, a

process that can be realized in O(n log n) operations.

If we set θ = zn−k+1 for some index k, we get∑n

i=1 w2i (zi − θ)+ =

∑n−ki=1 w2

i (zi − zn−k+1). Let theindex i∗ be such that

i∗ = max{1 ≤ k ≤ n :n−k∑i=1

w2i (zi − zn−k+1) ≥ ε

}.

Therefore, by construction the θ satisfying (A.8) belongs to the interval I = [zn−i∗+1, zn−i∗] ⊂ R. Inthat range, (A.8) becomes

n−i∗∑i=1

w2i (zi − θ) = ε,

so that finally,

θ = θ(i∗) =(∑n−i∗

i=1 w2i zi) − ε∑n−i∗

i=1 w2i

.

Page 142: vision-based scene understanding with sparsity promoting priors

A.3. Proof of Proposition 1 121

A.3 Proof of Proposition 1

Having a (k, 2e)-disjunct dictionary implies a one-to-one map between all k-sparse x and theircorresponding y. Conversely, if every k-sparse x leads to a distinguishable y then, D must be (k-1,2e)-disjunct.

Proof Assume two different k-sparse boolean vectors x and x′ that are supported on sets S andS′. Let y and y′ be correspondingly the bitwise OR of the columns of D indexed by S and S′ i.e.,y = D.x and y′ = D.x′ with boolean arithmetic. Choose a column of the dictionary di so that, i ∈ S′

but not in S. Since D is (k, 2e)-disjunct, it implies the support of di has 2e + 1 elements that are notincluded in supp(y). Therefore, considering noises that flip e bits of di and e bits of y, there will bestill at least one element of supp(di) that is not included in the support of y (recall y = y⊕ z), whichmakes x and x′ distinguishable from their noisy MSVs.

For the converse, assume D is not a (k-1, 2e)-disjunct matrix and pick a pair of a set S ⊆ [n]with |S| ≤ k, and an index i < S that is a counterexample to the (k-1, 2e)-disjunctness. Assumevectors x and x′ that are supported on S and S∪ {i} correspondingly. The noise can configure in anadversarial way, so that, by flipping e zero bits of y to one, and e one bits of di to zero, the MSVoutcome of x and x′ becomes indistinguishable.

A.4 Proof of Theorem 2

Thresholding successfully recovers any k-sparse occupancy vector, if D is (k, 2e)-disjunct.

Proof Assume a boolean vector x, supported on the set S ⊆ [n] with |S| ≤ k. Since the noise hasflipped at most e bits of y, obviously every i ∈ S satisfies (3.2).

In addition, define y to be the bitwise OR of the columns of D indexed by S i.e., the noiselessversion of the MSV. Again by the assumption on the noise power, for any column of D we have,

|supp(di)\supp(y)| ≤ |supp(di)\supp(y)| + e.

Now, if any i < S satisfies (3.2), it implies |supp(di)\supp(y)| ≤ 2e, which violates the assumptionthat the dictionary is (k, 2e)-disjunct. Therefore, thresholding recovers exactly the support of x.

Page 143: vision-based scene understanding with sparsity promoting priors

122

Page 144: vision-based scene understanding with sparsity promoting priors

Bibliography

[1] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera people tracking with a prob-abilistic occupancy map,” IEEE Transactionson Pattern Analysis and Machine Intelligence,vol. 30, no. 2, pp. 267–282, 2008.

[2] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo cor-respondence algorithms,” International journal of computer vision, vol. 47, no. 1, pp. 7–42,2002.

[3] J. Domke and Y. Aloimonos, “A probabilistic notion of correspondence and the epipolarconstraint,” in 3DPVT, pp. 41–48, IEEE Computer Society, 2006.

[4] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen,and C. Rother, “A comparative study of energy minimization methods for Markov randomfields with smoothness-based priors,” IEEE Transactions on Pattern Analysis and MachineIntelligence, pp. 1068–1080, 2007.

[5] C. L. Zitnick and T. Kanade, “A cooperative algorithm for stereo matching and occlusiondetection,” IEEE TransactionsPattern Anal. Mach. Intell., vol. 22, no. 7, pp. 675–684, 2000.

[6] L. Bagnato, P. Frossard, and P. Vandergheynst, “A variational framework for structure frommotion in omnidirectional image sequences,” Journal of Mathematical Imaging and Vision,pp. 1–12, 2009.

[7] K. Mühlmann, D. Maier, J. Hesser, and R. Männer, “Calculating dense disparity maps fromcolor stereo images, an efficient implementation,” International Journal of Computer Vision,vol. 47, no. 1-3, pp. 79–88, 2002.

[8] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,”in ICCV, pp. 377–384, 1999.

[9] “From keeneo website: http://www.keeneo.com/vs1.html,” 2010.

[10] M. McCahill and C. Norris, “Cctv in london,” 2002.

[11] “http://www.economist.com/node/15865270.”

[12] “By herb sorensen, ph.d., scientific advisor, tns retail and shopper,” 2010.

[13] “www.watchover-eu.org.”

123

Page 145: vision-based scene understanding with sparsity promoting priors

124 BIBLIOGRAPHY

[14] “Statistics of road traffic accidents in europe and north america.” Economic Commission forEurope, United Nations, Geneva, 2007.

[15] F. Porikli, “Achieving real-time object detection and tracking under extreme conditions,”Journal of Real-Time Image Processing, vol. 1, no. 1, pp. 33–40, 2006.

[16] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time track-ing,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2 vol.(xxiii+637+663), 1999.

[17] H. Lee, C. Wu, and H. Aghajan, “Nonstationary Background Removal via Multiple Cam-era Collaboration,” in ACM/IEEE International Conference on Distributed Smart Cameras,pp. 321–327, IEEE, 2007.

[18] L. Tessens, M. Morbée, W. Philips, R. Kleihorst, and H. Aghajan, “Efficient approximateforeground detection for low-resource devices,” in ACM/IEEE International Conference onDistributed Smart Cameras, pp. 1–8, IEEE, 2009.

[19] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc.IEEE International Conference on Computer Vision and Pattern Recognition, pp. I: 886–893, 2005.

[20] O. Tuzel, F. Porikli, and P. Meer, “Pedestrian detection via classification on riemannian man-ifolds,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 20, pp. 1713–1727, 2008.

[21] B. Leibe, N. Cornelis, K. Cornelis, L. Van Gool, and E. Zurich, “Dynamic 3D scene analysisfrom a moving vehicle,” in Proc. IEEE International Conference on Computer Vision andPattern Recognition, 2007.

[22] I. Gorodnitsky and B. Rao, “Sparse signal reconstruction from limited data using FOCUSS:A re-weighted minimum norm algorithm,” IEEE Transactionson Signal Processing, vol. 45,no. 3, pp. 600–616, 1997.

[23] D. Malioutov, M. Cetin, and A. Willsky, “A sparse signal reconstruction perspective forsource localization with sensor arrays,” IEEE Transactionson signal processing, vol. 53, no. 8Part 2, pp. 3010–3022, 2005.

[24] V. Cevher, M. Duarte, and R. Baraniuk, “Distributed Target Localization via Spatial Spar-sity,” in Proc. European Signal Processing Conference, 2008.

[25] I. Loris, G. Nolet, I. Daubechies, and F. Dahlen, “Tomographic inversion using l1-normregularization of wavelet coefficients,” Geophysical Journal International, vol. 170, no. 1,pp. 359–370, 2007.

[26] I. Daubechies and G. Teschke, “Variational image restoration by means of wavelets: Simul-taneous decomposition, deblurring, and denoising,” Applied and Computational HarmonicAnalysis, vol. 19, no. 1, pp. 1–16, 2005.

[27] M. Elad, J. Starck, P. Querre, and D. Donoho, “Simultaneous cartoon and texture imageinpainting using morphological component analysis (MCA),” Applied and ComputationalHarmonic Analysis, vol. 19, no. 3, pp. 340–358, 2005.

Page 146: vision-based scene understanding with sparsity promoting priors

BIBLIOGRAPHY 125

[28] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals ofstatistics, vol. 32, no. 2, pp. 407–451, 2004.

[29] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Sta-tistical Society. Series B (Methodological), pp. 267–288, 1996.

[30] D. Donoho, “Superresolution via sparsity constraints,” SIAM Journal on Mathematical Anal-ysis, vol. 23, p. 1309, 1992.

[31] D. Donoho et al., “Nonlinear solution of linear inverse problems by wavelet-vaguelette de-composition,” Applied and Computational Harmonic Analysis, vol. 2, no. 2, pp. 101–126,1995.

[32] J. Starck, M. Nguyen, and F. Murtagh, “Wavelets and curvelets for image deconvolution: acombined approach,” Signal Processing, vol. 83, no. 10, pp. 2279–2283, 2003.

[33] D. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4,pp. 1289–1306, 2006.

[34] E. Candès, “Compressive sampling,” in Proceedings of the International Congress of Math-ematicians, vol. 3, p. 14331452, Citeseer, 2006.

[35] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4,p. 118, 2007.

[36] P. Combettes, “Solving monotone inclusions via compositions of nonexpansive averagedoperators,” Optimization, vol. 53, no. 5, pp. 475–504, 2004.

[37] M. Fadili and J.-L. Starck, “Monotone operator splitting for fast sparse solutions of inverseproblems,” SIAM Journal on Imaging Sciences, pp. 2005–2006, 2009.

[38] T. Blumensath and M. Davies, “Iterative thresholding for sparse approximations,” Journal ofFourier Analysis and Applications, vol. 14, no. 5, pp. 629–654, 2008.

[39] M. Fornasier and H. Rauhut, “Iterative thresholding algorithms,” Applied and ComputationalHarmonic Analysis, vol. 25, no. 2, pp. 187–208, 2008.

[40] B. Elahi, Spirituality is a science: foundations of natural spirituality. Associated UniversityPresses, 1999.

[41] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio, “Pedestrian detection usingwavelet templates,” Proc. IEEE International Conference on Computer Vision and PatternRecognition, vol. 97, pp. 193–199, 1997.

[42] C. Papageorgiou and T. Poggio, “Trainable pedestrian detection,” in Proc. IEEE InternationalConference on Image Processing, vol. 4, 1999.

[43] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestrian detection using in-frared images and histograms of oriented gradients,” in Proc. IEEE Symposium on IntelligentVehicles, (Tokyo, Japan), pp. 206–212, June 2006.

[44] A. Alahi, M. Bierlaire, and M. Kunt, “Cascade of descriptors to detect and track objectsacross any network of cameras,” Computer Vision and Image Understanding, 2009.

Page 147: vision-based scene understanding with sparsity promoting priors

126 BIBLIOGRAPHY

[45] H. Cheng, N. Zheng, and J. Qin, “Pedestrian detection using sparse gabor filter and supportvector machine,” Proc. IEEE Symposium on Intelligent Vehicles, pp. 583– 587, june 2005.

[46] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and ap-pearance,” International Journal of Computer Vision, vol. 63, no. 2, pp. 153–161, 2005.

[47] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38,no. 4, p. 13, 2006.

[48] S. S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAMJournal on Scientific Computing, vol. 20, no. 1, pp. 33–61, 1998.

[49] K. Mueller, A. Smolic, M. Droese, P. Voigt, and T. Wienand, “Multi-texture modeling of 3dtraffic scenes,” in International Conference on Multimedia and Expo, vol. 1, pp. I – 657–60vol.1, 2003.

[50] C. Stauffer and K. Tieu, “Automated multi-camera planar tracking correspondence model-ing,” in Proc. IEEE International Conference on Computer Vision and Pattern Recognition,pp. I: 259–266, 2003.

[51] J. Black, T. Ellis, and P. Rosin, “Multi view image surveillance and tracking,” Proc. IEEEWorkshop on Motion and Video Computing, p. 169, 2002.

[52] J. Orwell, S. Massey, P. Remagnino, D. Greenhill, and G. A. Jones, “A multi-agent frame-work for visual surveillance,” in Proc. IEEE International Conference on Image Analysisand Processing, (Washington, DC, USA), p. 1104, IEEE Computer Society, 1999.

[53] Y. Caspi, D. Simakov, and M. Irani, “Feature-based sequence-to-sequence matching,” Inter-national Journal of Computer Vision, vol. 68, pp. 53–64, June 2006.

[54] K. Kim and L. Davis, “Multi-camera tracking and segmentation of occluded people onground plane using search-guided particle filtering,” in Proc. European Conference on Com-puter Vision, pp. III: 98–109, 2006.

[55] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer,vol. 22, no. 6, pp. 46–57, 1989.

[56] J. Franco and E. Boyer, “Fusion of multiview silhouette cues using a space occupancy grid,”in Tenth Proc. IEEE International Conference on Computer Vision, 2005. ICCV 2005, vol. 2,2005.

[57] D. Yang, H. Gonzalez-Banos, and L. Guibas, “Counting people in crowds with a real-timenetwork of simple image sensors,” in Proc. IEEE International Conference on ComputerVision, p. 122, 2003.

[58] T. Zhao and R. Nevatia, “Bayesian human segmentation in crowded situations,” in 2003Proc. IEEE International Conference on Computer Vision and Pattern Recognition, 2003.Proceedings, vol. 2, 2003.

[59] R. Eshel and Y. Moses, “Homography based multiple camera detection and tracking of peo-ple in a dense crowd,” in Proc. IEEE International Conference on Computer Vision andPattern Recognition, pp. 1–8, 2008.

Page 148: vision-based scene understanding with sparsity promoting priors

BIBLIOGRAPHY 127

[60] S. Khan and M. Shah, “A multiview approach to tracking people in crowded scenes using aplanar homography constraint,” in Proc. European Conference on Computer Vision, pp. IV:133–146, 2006.

[61] S. M. Khan and M. Shah, “Tracking multiple occluding people by localizing on multiplescene planes,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 31,no. 3, pp. 505–519, 2009.

[62] D. Delannay, N. Danhier, and C. D. Vleeschouwer, “Detection and recognition ofsports(wo)man from multiple views,” in Proc. ACM/IEEE International Conference on Dis-tributed Smart Cameras, (Como, Italy), 30 Aug. - 2 Sep. 2009.

[63] D. Reddy, A. Sankaranarayanan, V. Cevher, and R. Chellappa, “Compressed sensing formulti-view tracking and 3-D voxel reconstruction,” in Proc. IEEE International Conferenceon Image Processing, pp. 221–224, 2008.

[64] J. Berclaz, F. Fleuret, and P. Fua, “Robust people tracking with global trajectory optimiza-tion,” in Conference on Computer Vision and Pattern Recognition, 2006.

[65] A. Alahi, Y. Boursier, L. Jacques, and P. Vandergheynst, “A Sparsity Constrained InverseProblem to Locate People in a Network of Cameras,” in 16th International Conference onDigital Signal Processing, (Aegean island of Santorini, Greece), 2009.

[66] B. Natarajan, “Sparse Approximate Solutions to Linear Systems,” SIAM Journal on Com-puting, vol. 24, p. 227, 1995.

[67] E. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted ? 1 minimization,”Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877–905, 2008.

[68] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for compressive sensing,” inAcoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Confer-ence on, pp. 3869–3872, IEEE, 2008.

[69] A. Alahi, Y. Boursier, L. Jacques, and P. Vandergheynst, “Sport Players Detection and Track-ing With a Mixed Network of Planar and Omnidirectional Cameras,” in Third ACM/IEEEInternational Conference on Distributed Smart Cameras, Challenge Prize Winner, (Como),2009.

[70] A. Alahi, L. Jacques, Y. Boursier, and P. Vandergheynst, “Sparsity-driven People Localiza-tion Algorithm: Evaluation in Crowded Scenes Environments,” in Proc. IEEE Int’l Workshopon Performance Evaluation of Tracking and Surveillance, (Snowbird, Utah), 2009.

[71] K. Schnass and P. Vandergheynst, “Average performance analysis for thresholding,” SignalProcessing Letters, IEEE, vol. 14, no. 11, pp. 828–831, 2007.

[72] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans-actions on signal processing, vol. 41, no. 12, pp. 3397–3415, 1993.

[73] V. Vazirani, Approximation algorithms. Springer Verlag, 2001.

[74] A. Alahi, L. Jacques, Y. Boursier, and P. Vandergheynst, “Sparsity driven people localizationwith a heterogeneous network of cameras,” Journal of Mathematical Imaging and Vision,pp. 1–20, 2011.

Page 149: vision-based scene understanding with sparsity promoting priors

128 BIBLIOGRAPHY

[75] A. C. E. Maggio, Video tracking: Theory and Practice. Wiley, 2011.

[76] M. Taj and A. Cavallaro, “Multi-view Multi-object Detection and Tracking,” Computer Vi-sion, pp. 263–280, 2010.

[77] N. Pham, W. Huang, and S. Ong, “Probability hypothesis density approach for multi-cameramulti-object tracking,” in Proceedings of the 8th Asian conference on Computer vision-Volume Part I, pp. 875–884, Springer-Verlag, 2007.

[78] V. I. Morariu and O. I. Camps, “Modeling correspondences for multi-camera tracking usingnonlinear manifold learning and target dynamics.,” in IEEE Conference on Computer Visionand Pattern Recognition, pp. 545–552, 2006.

[79] A. Mittal and L. Davis, “M 2 Tracker: a multi-view approach to segmenting and track-ing people in a cluttered scene,” International Journal of Computer Vision, vol. 51, no. 3,pp. 189–203, 2003.

[80] D. Focken and R. Stiefelhagen, “Towards vision-based 3-D people tracking in a smart room,”IEEE International Conference on Multimodal Interfaces, pp. 400–405, 2002.

[81] K. Smith, D. Gatica-Perez, and J. Odobez, “Using particles to track varying numbers ofinteracting people,” IEEE Conference on Computer Vision and Pattern Recognition, 2005.

[82] K. Shafique and M. Shah, “A noniterative greedy algorithm for multiframe point correspon-dence,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 51–65, 2005.

[83] J. Wolf, A. Viterbi, and G. Dixon, “Finding the best set of K paths through a trellis with ap-plication to multitarget tracking.,” IEEE Transactions on Aerospace and Electronic Systems,vol. 25, no. 2, pp. 287–296, 1989.

[84] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple Object Tracking using K-ShortestPaths Optimization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.

[85] B. Leibe, K. Schindler, and L. Van Gool, “Coupled detection and trajectory estimationfor multi-object tracking,” in IEEE International Conference on Computer Vision, pp. 1–8,IEEE, 2007.

[86] M. Taj and A. Cavallaro, “Multi-camera track-before-detect,” in ACM/IEEE InternationalConference on Distributed Smart Cameras, pp. 1–6, IEEE, 2009.

[87] E. Dijkstra, “A note on two problems in connexion with graphs,” Numerische mathematik,vol. 1, no. 1, pp. 269–271, 1959.

[88] B. Morris and M. Trivedi, “A survey of vision-based trajectory learning and analysis forsurveillance,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 18,no. 8, pp. 1114 –1127, 2008.

[89] D. Makris and T. Ellis, “Path detection in video surveillance,” Image and Vision Computing,vol. 20, no. 12, pp. 895–903, 2002.

[90] N. Brandle, D. Bauer, and S. Seer, “Track-based finding of stopping pedestrians-a practicalapproach for analyzing a public infrastructure,” in Intelligent Transportation Systems Con-ference, 2006. ITSC’06. IEEE, pp. 115–120, IEEE, 2006.

Page 150: vision-based scene understanding with sparsity promoting priors

BIBLIOGRAPHY 129

[91] G. Andrienko, N. Andrienko, and S. Wrobel, “Visual analytics tools for analysis of move-ment data,” SIGKDD Explor. Newsl., vol. 9, pp. 38–46, December 2007.

[92] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A system for learning statis-tical motion patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence,pp. 1450–1464, 2006.

[93] X. Li, W. Hu, and W. Hu, “A coarse-to-fine strategy for vehicle motion trajectory clustering,”Pattern Recognition, vol. 1, pp. 591–594, 2006.

[94] F. Porikli, “Learning object trajectory patterns by spectral clustering,” in IEEE InternationalConference on Multimedia and Expo, vol. 2, pp. 1171–1174, IEEE, 2005.

[95] F. Bashir, W. Qu, A. Khokhar, and D. Schonfeld, “HMM-based motion recognition sys-tem using segmented PCA,” in IEEE International Conference on Image Processing, vol. 3,IEEE, 2006.

[96] C. Piciarelli and G. Foresti, “On-line trajectory clustering for anomalous events detection,”Pattern Recognition Letters, vol. 27, no. 15, pp. 1835–1842, 2006.

[97] A. Naftel and S. Khalid, “Classifying spatiotemporal object trajectories using unsupervisedlearning in the coefficient feature space,” Multimedia Systems, vol. 12, pp. 227–238, 2006.

[98] N. Brandle, D. Bauer, and S. Seer, “Track-based finding of stopping pedestrians - a prac-tical approach for analyzing a public infrastructure,” in Intelligent Transportation SystemsConference, 2006. ITSC ’06. IEEE, pp. 115 –120, 2006.

[99] M. Brand and V. Kettnaker, “Discovery and segmentation of activities in video,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 844 –851, Aug. 2000.

[100] H. Aghajan and C. Wu, “From distributed vision networks to human behavior interpretation,”in Proceedings of the Behaviour Monitoring and Interpretation Workshop at the 30th GermanConference on Artificial Intelligence, Citeseer, 2007.

[101] C. Wu, A. Khalili, and H. Aghajan, “Multiview activity recognition in smart homes withspatio-temporal features,” in ACM/IEEE International Conference on Distributed SmartCameras, pp. 142–149, ACM, 2010.

[102] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringingorder to the web,” 1998.

[103] B. Jiang, “Ranking spaces for predicting human movement in an urban environment,” Inter-national Journal of Geographical Information Science, vol. 23, no. 7, pp. 823–837, 2009.

[104] A. El-Geneidy and D. M. Levinson, “Place rank: Valuing spatial interactions,” WorkingPapers 000026, University of Minnesota: Nexus Research Group, 2010.

[105] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” inSeventh International World-Wide Web Conference (WWW 1998), 1998.

[106] A. Alahi, L. Jacques, Y. Boursier, and P. Vandergheynst, “Sparsity driven people localizationwith heterogeneous network of cameras,” Submitted to Journal of Mathematical Imaging andVision, 2010.

Page 151: vision-based scene understanding with sparsity promoting priors

130 BIBLIOGRAPHY

[107] A. Alahi, M. Golbabaee, and P. Vandergheynst, “Real-time sparsity driven people behavioranalysis,” Submitted to Journal on Computer Vision, 2011.

[108] Y. A. Ivanov, A. F. Bobick, and J. Liu, “Fast lighting independent background subtraction,”International Journal of Computer Vision, vol. 37, no. 2, pp. 199–207, 2000.

[109] S. Lim, A. Mittal, L. Davis, and N. Paragios, “Fast illumination-invariant background sub-traction using two views: Error analysis, sensor placement and applications,” in IEEE Con-ference on Computer Vision and Pattern Recognition, pp. 1071–1078, IEEE Computer Soci-ety, 2005.

[110] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother, “Bi-layer segmentation ofbinocular stereo video,” in IEEE Conference on Computer Vision and Pattern Recognition,pp. 407–414, IEEE Computer Society, 2005.

[111] E. Arce and J. Marroquin, “High-precision stereo disparity estimation using hmmf models,”Image and Vision Computing, vol. 25, no. 5, pp. 623–636, 2007.

[112] J. H. McIntosh and K. M. Mutch, “Matching straight lines,” Computer Vision, Graphics, andImage Processing, vol. 43, no. 3, pp. 386–408, 1988.

[113] A. T. Brint and M. Brady, “Stereo matching of curves,” Image Vision Comput., vol. 8, no. 1,pp. 50–56, 1990.

[114] W. Eric and L. Grimson, “Computational experiments with a feature based stereo algorithm,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 7, pp. 17–34, January1985.

[115] R. Bolles, H. Baker, and M. Hannah, “The JISCT stereo evaluation,” in DARPA Image Un-derstanding Workshop, pp. 263–274, 1993.

[116] P. Anandan, “A computational framework and an algorithm for the measurement of visualmotion,” International Journal of Computer Vision, vol. 2, no. 3, pp. 283–310, 1989.

[117] D. J. Fleet and A. D. Jepson, “Computation of component image velocity from local phaseinformation,” International Journal of Computer Vision, vol. 5, no. 1, pp. 77–104, 1990.

[118] S. Birchfield and C. Tomasi, “A pixel dissimilarity measure that is insensitive to image sam-pling,” IEEE TransactionsPattern Anal. Mach. Intell., vol. 20, no. 4, pp. 401–406, 1998.

[119] J. Schmidt, H. Niemann, and S. Vogt, “Dense disparity maps in real-time with an applicationto augmented reality,” IEEE Workshop on Applications of Computer Vision, vol. 0, p. 225,2002.

[120] P. Fua, “A parallel stereo algorithm that produces dense depth maps and preserves imagefeatures,” Machine Vision and Applications, vol. 6, pp. 35–49, 1993.

[121] B. Horn and B. Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, no. 1-3,pp. 185–203, 1981.

[122] T. Poggio, V. Torre, and C. Koch, “Computational vision and regularization theory,” Nature,vol. 317, no. 6035, pp. 314–319, 1985.

Page 152: vision-based scene understanding with sparsity promoting priors

BIBLIOGRAPHY 131

[123] S. Lee, Y. Kanatsugu, and J. Park, “A hierarchical method of MAP-based stochastic diffu-sion and disparity estimation,” in Image Processing. 2002. Proceedings. 2002 InternationalConference on, vol. 2, pp. II–541, IEEE, 2002.

[124] V. Kolmogorov and R. Zabih, “Multi-camera scene reconstruction via graph cuts,” pp. 8–40,Springer, 2002.

[125] T. Yu, R. Lin, B. Super, and B. Tang, “Efficient message representations for belief propaga-tion,” in IEEE International Conference on Computer Vision, pp. 1–8, IEEE.

[126] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimationbased on a theory for warping,” pp. 25–36, Springer, 2004.

[127] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers, “An improved algorithm for tv-l 1optical flow,” Statistical and Geometrical Approaches to Visual Motion Analysis, pp. 23–45,2009.

[128] A. Weishaupt, L. Bagnato, and P. Vandergheynst, “Fast Structure from Motion for PlanarImage Sequences,” in Proceedings of Eusipco 2010, 2010.

[129] L. Wang, C. Zhang, R. Yang, and C. Zhang, “Tofcut: Towards robust real-time foregroundextraction using a time-of-flight camera,”

[130] W. Sun and S. P. Spackman, “Multi-object segmentation by stereo mismatch,” Mach. Vis.Appl., vol. 20, no. 6, pp. 339–352, 2009.

[131] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection andclassification,” in Proc. European Conference on Computer Vision, 2006.

[132] M. Stricker and M. Orengo, “Similarity of color images,” in Proc. SPIE Storage and Retrievalfor Image and Video Databases, vol. 2420, pp. 381–392, San Jose CA USA, 1995.

[133] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journalof Computer Vision, vol. 60, pp. 91–110, November 2004.

[134] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust Features,” LectureNotes in Computer Science, vol. 3951, p. 404, 2006.

[135] P. Gabriel, J. Hayet, J. Piater, and J. Verly, “Object tracking using color interest points,” Proc.IEEE International Conference on Advanced Video and Signal based Surveillance, pp. 159–164, 2005.

[136] A. Rosenfeld and G. Vanderbrug, “Coarse-fine template matching,” IEEE TransactionsonSystems, Man and Cybernetics, vol. 7, pp. 104–107, 1977.

[137] B. Funt and G. Finlayson, “Color constant color indexing,” IEEE Transactionson Patternanalysis and Machine Intelligence, p. 522529, 1995.

[138] D. Gray and H. Tao, “Viewpoint Invariant Pedestrian Recognition with an Ensemble of Lo-calized Features,” in Proc. European Conference on Computer Vision, pp. 262–275, Springer,2008.

Page 153: vision-based scene understanding with sparsity promoting priors

132 BIBLIOGRAPHY

[139] D. Comaniciu and V. Ramesh, “Real-time tracking of non-rigid objects using mean shift,”2003. US Patent 6,590,999.

[140] B. Prosser, S. Gong, and T. Xiang, “Multi-camera matching under illumination change overtime,” in European Conference on Computer Vision, 2008.

[141] A. Shashua, Y. Gdalyahu, and G. Hayun, “Pedestrian detection for driving assistance sys-tems: Single-frame classification and system level performance,” in International Sympo-sium on Intelligent Vehicles, pp. 1–6, 2004.

[142] X. Wang, G. Doretto, T. Sebastian, J. Rittscher, and P. Tu, “Shape and appearance contextmodeling,” in Proc. IEEE International Conference on Computer Vision, 2007.

[143] N. Gheissari, T. Sebastian, and R. Hartley, “Person reidentification using spatiotemporalappearance,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2006.

[144] A. Alahi, D. Marimon, M. Bierlaire, and M. Kunt, “A Master-Slave Approach for ObjectDetection and Matching with Fixed and Mobile Cameras,” in 15th IEEE International Con-ference on Image Processing, (San Diego), pp. 1712–1715, 2008.

[145] A. Alahi, M. Bierlaire, and M. Kunt, “Object Detection and Matching with Mobile CamerasCollaborating with Fixed Cameras,” in The 10th European Conference on Computer Vision,(Marseilles, France), pp. 1542–1550, 2008.

[146] W. Forstner and B. Moonen, “A metric for covariance matrices,” Qua vadis geodesia,pp. 113–128, 1999.

[147] W. Freeman and E. Adelson, “The design and use of steerable filters,” IEEE TransactionsonPattern Analysis and Machine Intelligence, vol. 13, no. 9, pp. 891–906, 1991.

[148] L. Florack, B. Ter Haar Romeny, J. Koenderink, and M. Viergever, “General intensity trans-formations and differential invariants,” Journal of Mathematical Imaging and Vision, vol. 4,no. 2, pp. 171–187, 1994.

[149] A. Baumberg, “Reliable feature matching across widely separated views,” in Conference onComputer Vision and Pattern Recognition, vol. 1, pp. 131–137, IEEE Computer Society,2000.

[150] G. Carneiro and A. Jepson, “Multi-scale phase-based local features,” in Proc. IEEE Interna-tional Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 171–187, 2003.

[151] F. Mindru, T. Tuytelaars, L. Gool, and T. Moons, “Moment invariants for recognition underchanging viewpoint and illumination,” Computer Vision and Image Understanding, vol. 94,no. 1-3, pp. 3–27, 2004.

[152] K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” IEEETransactionson Pattern Analysis and Machine Intelligence, pp. 1615–1630, 2005.

[153] D. Gray, S. Brennan, and H. Tao, “Evaluating Appearance Models for Recognition, Reac-quisition, and Tracking,” in IEEE International Workshop on Performance Evaluation ofTracking and Surveillance., 2007.

Page 154: vision-based scene understanding with sparsity promoting priors

BIBLIOGRAPHY 133

[154] C. Chan, F. Bu, S. Shladover, California, I. of Transportation Studies, B. U. of California,P. for Advanced Transit, H. (Calif, and D. of Transportation, Experimental Vehicle Platformfor Pedestrian Detection. California PATH Program, Institute of Transportation Studies,University of California at Berkeley, 2006.

[155] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi, “Shape-based pedestrian detection,” Proc.IEEE Symposium on Intelligent Vehicles, pp. 215–200, 2000.

[156] L. Zhao and C. Thorpe, “Stereo-and neural network-based pedestrian detection,” IntelligentTransportation Systems, IEEE Transactionson, vol. 1, no. 3, pp. 148–154, 2000.

[157] A. Ess, B. Leibe, L. Van Gool, and E. Zurich, “Depth and Appearance for Mobile SceneAnalysis,” in Proc. IEEE International Conference on Computer Vision, pp. 1–8, 2007.

[158] S. Bota and S. Nedesvchi, “Multi-feature walking pedestrians detection for driving assistancesystems,” Intelligent Transport Systems, IET, vol. 2, no. 2, pp. 92–104, 2008.

[159] M. Bertozzi, A. Broggi, M. Felisa, G. Vezzoni, and M. Del Rose, “Low-level pedestriandetection by means of visible and far infra-red tetra-vision,” in Proc. IEEE Symposium onIntelligent Vehicles, (Tokyo, Japan), pp. 231–236, June 2006.

[160] A. Broggi, R. I. Fedriga, A. Tagliati, T. Graf, and M. Meinecke, “Pedestrian detection on amoving vehicle: an investigation about near infra-red images,” in Proc. IEEE Symposium onIntelligent Vehicles, (Tokyo, Japan), pp. 431–436, June 2006.

[161] D. Gavrila, “Pedestrian detection from a moving vehicle,” in Proc. European Conference onComputer Vision, pp. II: 37–49, 2000.

[162] I. Katz and H. Aghajan, “Multiple camera-based chamfer matching for pedestrian detection,”in ACM/IEEE International Conference on Distributed Smart Cameras, pp. 1–5, IEEE, 2008.

[163] M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse, A. Fascioli, and A. Tibaldi, “Shape-basedpedestrian detection and localization,” in Proc. International IEEE Conference on IntelligentTransportation Systems 2003, (Shangai, China), pp. 328–333, Oct. 2003.

[164] M. Schreckenberg and S. Sharma, eds., Pedestrian and Evacuation Dynamics, 2002.

[165] D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E,vol. 51, no. 5, pp. 4282–4286, 1995.

[166] T. Robin, G. Antonini, M. Bierlaire, and J. Cruz, “Specification, estimation and validationof a pedestrian walking behavior model,” Transportation Research Part B: Methodological,vol. 43, no. 1, pp. 36–56, 2009.

[167] S. Hoogendoorn and P. Bovy, “Pedestrian route-choice and activity scheduling theory andmodels,” Transportation Research Part B, vol. 38, pp. 169–190, 2004.

[168] B. G. Dellaert, T. A. Arentze, M. Bierlaire, A. W. Borgers, and H. J. Timmermans, “In-vestigating consumers’ tendency to combine multiple shopping purposes and destinations,”Journal of Marketing Research, vol. 35, pp. 177–188, May 1998.

Page 155: vision-based scene understanding with sparsity promoting priors

134 BIBLIOGRAPHY

[169] W. Zhu and H. Timmermans, “Exploring heuristics underlying pedestrian shopping deci-sion processes,” in Innovations in Design and Decision Support Systems in Architecture andUrban Planning (S. Netherlands, ed.), vol. 2, 2006.

[170] A. Borgers and H. Timmermans, “A model of pedestrian route choice and demand for retailfacilities within inner-city shopping areas,” Geographical analysis, vol. 18, pp. 115–128,April 1986.

[171] K. Teknomo, Microscopic Pedestrian Flow Characteristics: Development of an Image Pro-cessing Data Collection and Simulation Model. PhD thesis, Tohoku University, Japan,Sendai, 2002.

[172] S. Hoogendoorn, W. Daamen, and P. Bovy, “Extracting microscopic pedestrian characteris-tics from video data: results from experimental research into pedestrian walking behavior,” in82th Annual Meeting of the Transportation Research Board, (Washington D.C.), pp. 20–30,2003.

[173] G. Antonini, M. Bierlaire, and M. Weber, “Discrete choice models of pedestrian walkingbehavior,” Transportation Research Part B: Methodological, vol. 40, no. 8, pp. 667–687,2006.

[174] G. Antonini, S. Venegas, M. Bierlaire, and J.-P. Thiran, “Behavioral priors for detectionand tracking of pedestrians in video sequences,” International Journal of Computer Vision,vol. 69, no. 2, pp. 159–180, 2006.

[175] S. Venegas, G. Antonini, J.-P. Thiran, and M. Bierlaire, “Automatic pedestrian tracking usingdiscrete choice models and image correlation techniques,” in Machine Learning for Multi-modal Interaction (B. S. and B. H., eds.), vol. 3361 of Lecture Notes in Computer Science,pp. 341 – 348, Springer, 2005.

[176] J. Moreau, “Fonctions convexes duales et points proximaux dans un espace hilbertien,” Bul-letin/Societe Mathematique de France, vol. 255, pp. 2897–2899, 1962.

[177] P. Combettes and V. Wajs, “Signal Recovery by Proximal Forward-Backward Splitting,”Journal on Multiscale Modeling and Simulation, vol. 4, no. 4, p. 1168, 2006.

Page 156: vision-based scene understanding with sparsity promoting priors

Alexandre Alahi 2 ch. de la gravière,

1008 Prilly, Switzerland (+41) 76 539 34 21

OBJECTIVE My long-term goal is to become a professor with entrepreneurial skills

EDUCATION Ecole Polytechnique Fédérale de Lausanne, Switzerland

- PhD candidate in the Signal Processing Institute. May 06 – present - MS Electrical and Computer Engineering Oct 00 - Apr 06

Carnegie Mellon University, Pittsburgh, PA, USA

Electrical and Computer Engineering Exchange Program

Aug 02 - May 03

PROFESSIONAL EXPERIENCE

VisioSafe, Switzerland Founder & CEO May 09 - present

The 1st Hosted Video Surveillance Service with social capabilities and avant-garde video analytics

Mitsubishi Electric Research Laboratories, Cambridge – MA

Developed scheduling algorithms to control PTZ cameras

Logitech, Silicon Valley – CA Optimized each color processing algorithms using parallel programming: SIMD technology from Pentium 4 (SSE2) and cache memory optimization

Cyberonix, Silicon Valley – CA Implemented pattern recognition algorithms Summer 03

Implemented US license plate recognition from videos

Ecole Polytechnique Fédérale de Lausanne, Switzerland Teaching, Department of Electrical Engineering

May 06 - present

- Half the course “Multidimensional Signal Processing” (Master-level) - Fully the course “Advanced Image Processing and Analysis” (PhD-level)

Teaching Assistant, Department of Electrical Engineering

- “Introduction to Information theory and communication” (Bachelor-level) - “Multidimensional Signal Processing” (Master-level) - “Advanced Lab on Signal Processing” with OpenCV (Master-level) - “Advanced Image Processing and Analysis” (PhD-level)

Teaching Assistant, Department of Computer Science

Nov 03 - Feb 04

UNIX / C / Perl (Bachelor-level).

ACADEMIC PROJECTS

Computer Vision May 06 - present Thesis on “Vision-based Scene Understanding With Sparsity Promoting Priors”

Signal Processing / Computer Vision / Pattern Recognition Object detection/tracking/recognition, image retargeting, gesture recognition, content-based music recommendation.

COMPUTER SKILLS

C,C++, Assembly, SIMD technology (MMX, SSE, SSE2), Java, C#, Pascal, and Perl.

LANGUAGES Bilingual: French, English. Intermediate: Spanish ACTIVITES Music, Sports (golf, tennis, basketball)

AWARDS Venture Leaders 2010 winner Summer 10

A jury selected 20 Swiss entrepreneur to offer a business training program in Boston Swissnex Boston Global Pitchfest winner

A full audience of investors and entrepreneurs selected the pitch regarding the best startup among the 20 Swiss venture leaders 2010

ICDSC challenge prize Summer 09 A jury selected our work as the best algorithm to detect and tracking people MEDIAS Research on TV: Euronews, TSR | on Radio: RSR

in Press: 20mintues, leTemps, | on Web: Wall Street Journal, acm.org, Engadget 2011

Measured the Performance of PTZ-based Video Surveillance Systems Sep 05 - Mar 06

Succeeded to increase resolution of their webcams Feb 04 - Sep 04

Page 157: vision-based scene understanding with sparsity promoting priors
Page 158: vision-based scene understanding with sparsity promoting priors

Personal Publications

[1] A. Alahi, L. Jacques, Y. Boursier, and P. Vandergheynst, “Sparsity-Driven People Localizationwith a Heterogeneous Network of Cameras,” Journal of Mathematical Imaging and Vision,2011.

[2] A. Alahi, P. Vandergheynst, M. Bierlaire, M. Kunt: "Cascade of Descriptors to Detect andTrack Objects Across Any Network of Cameras", Journal on Computer Vision and ImageUnderstanding, 2010.

[3] A. Alahi, M. Golbabaee, and P. Vandergheynst, “Real-Time Sparsity Driven People BehaviorAnalysis", submitted to International Journal on Computer Vision, 2011.

[4] A. Alahi, M. Bierlaire, and P. Vandergheynst, "Collaborative Network of Fixed and MobileCameras for Automatic Pedestrians Detection", submitted Journal on Transportation Re-search Part C, 2011.

[5] A. Alahi, L. Bagnato, D. Matti, and P. Vandergheynst, "Total Variation Disparity-based Fore-ground Silhouette Extraction", prepared to submit to IEEE Transactions on Image Processing,2011.

[6] A. Alahi, Y. Boursier, L. Jacques, P. Vandergheynst, "A Sparsity Constrained Inverse Problemto Locate People in a Network of Cameras", 16th International Conference on Digital SignalProcessing, Santorini, Greece, 2009.

[7] A. Alahi, Y. Boursier, L. Jacques, P. Vandergheynst, "Sport Players Detection and Trackingwith a Mixed Network of Planar and Omnidirectional Cameras", Third ACM/IEEE Interna-tional Conference on Distributed Smart Cameras, Challenge Prize Winner, Como (2009).

[8] A. Alahi, L. Jacques, Y. Boursier, P. Vandergheynst, "Sparsity-Driven People LocalizationAlgorithm: Evaluation in Crowded Scenes Environments", Proc. IEEE Int’l Workshop onPerformance Evaluation of Tracking and Surveillance, Snowbird, 2009.

[9] A. Alahi, D. Marimon, M. Bierlaire, and M. Kunt, “A Master-Slave Approach for Object De-tection and Matching with Fixed and Mobile Cameras",’ 15th IEEE International Conferenceon Image Processing, San Diego, 2008.

[10] A. Alahi, M. Bierlaire, and M. Kunt, “Object Detection and Matching with Mobile CamerasCollaborating with Fixed Cameras,” 10th European Conference on Computer Vision, Mar-seilles, 2008.

137

Page 159: vision-based scene understanding with sparsity promoting priors

138 PERSONAL PUBLICATIONS

[11] A. Azarbayejani, A. Alahi and M. Erdem, "System and method for measuring performancesof surveillance systems", US Patent App. 20,080/126,031.

[12] A. Alahi , M. Golbabaee, and P. Vandergheynst, "Method and system for automatic objectslocalization", US Patent App. 12/779,547.