ETUDE DES DÉTECTION DES PIÉTONS DES FLUX DE VIDÉO Extraction de caractéristiques et la méthode Machine Learning

ETUDE DES DÉTECTION DES PIÉTONS DES FLUX DE VIDÉO Extraction de caractéristiques et la méthode Machine Learning

Thèse de Master l'année scolaire (2015)

PRÉSENTÉE LE 7 september 2015

À LA FACULTÉ DES INTERNATIONAL MASTER PROGRAMME COMPUTER SCIENCE ET SYSTÈME EMBARQUÉ

École Supérieure d'Ingénieurs en Électronique et Électrotechnique

POUR TERMINER L’ANNÉE SCOLAIRE DE E4

PAR

ZIHAO WANG

acceptée sur proposition du jury:

Prof. Xavier HILAIRE, directeur de thèse Prof. __________________ , rapporteur

France, Paris 2015

Keep skeptical, think critically, and assume nothing.

To the memory of my grandfather

Thèse de Master l'année scolaire (2015) de ESIEE Paris Extraction de caractéristiques et la méthode Machine Learning

i

Acknowledgements First and most I would like to thank my advisor Xavier HILAIRE. During the study and research period, Professor Xavier inspired me

alot. I appreciate his contribution and teaching to me, his time, ideals and energy. Also as the head of IMC department, Professor

Xavier gave us lots of help on study, I am very grateful for his teaching and leading during last one year.

Also I would like to thanks all professors and theachers who educated me during the past one year, thanks to their patient teaching

I can done this thesis with fluence and enthusasim.

Thanks to my classmates FINOCHIETTO Jose Mariano, SANYA, Waheed Muhammad, Zeina TLAISS and David OSICKA, whthout them

this two months project will be loneliness and slow. Thanks to my roommate Zhifei XU, He made my study in France became variety

and colorful. Thanks to my friends Jinru ZHANG, Ce ZHANG, Dazhen SHI and Shuang LIU, thanks to their help on my daily life and

study.

Thanks to my grandparents who bring me up and educated me. Thanks to my grandfather who was an electrical engineer, he thaught

me to be honest man and delived me his attitude for knowledgement. Thanks to my parents and family, they support my study

aboard.

It is still difficult for me to make a choince between the academic career and industry career. I like explore in edge areas and enjoy

the feeling of soloving problems, other aspect, I cannot adapte to the simple life yet, this maybe the most fatal point for research,

and I cannot totally do as the title page’s wisdom said, although I use it to give myself a caution. I admire the man who hold pointer

on the blackboard, this is influence by my family, my grandfather used to be teacher before enter the industry, and my mother is still

work in the theatre.

I am afraid that I will not have chance to do an academic thesis during my master career, because maybe I will choose to work on

industry for my E5 year’s internship. So I do this thesis to terminal my 2014-2015 school year, in China, for obtain master degree an

thesis defence if necessary, as our school is Grand Ecole the six month research thesis is replaced by 6 month internship. To have a

same experience with my Chinese colleague and to commemorate my academic career I done this project as thesis standard.

One year study in France made me more mature, one year study in ESIEE made me more independent in life. Thanks for this lofty

academic environment, I am much honored to have chance to study here.

Zihao WANG

Paris, le 23 juin 2015

Abstract

ii

Abstract The pedest detection is a key study point in Computer Vision and Artificial Intelligent. With the application of the intelligent technol-

ogy on monitoring, driver assistance system, information retrival and business, the people detection and counting problem become

very hot researching task.

Considering the main task of this paper is application and realizable. We put the key study direction on realize of algorithm and testify

of precision. The main task of this thesis can be summary as below:

Digtal Video Signal Processing

Image noisy and noisy filter, Image scale transform, Morphology process. The video signal process is actually the image signal

process. Considering the noisy sensibility of the feature extraction algorithm, to remove the noisy on image is quite necessary

before image analysis.

Pedest Detection

Testy of precursors’ method, analysied MOG movement detection, SIFT algorithm, HOG method. We analysised the differet

features of the detection methods and choosed the most suitable method for our project.

Features Extraction

After the detection of the people, we tested different method of features detection, Scale-invariant Feature Transform (SIFT),

Harris Key Points Detection, Fast Key Points Detection, SUSAN Key Points Detection, and the segment of the body based on

Distance Transform (DT). We camparad different feature extraction algorithm and finally choosed the most efficient one.

Recognize of the pedest based on Machine Learning method.

SVR (Support Vector Regression) use support vector to solving the regression problem, for the number of person on the image,

in regression problems, can be regarded as labels. The extracted features of images is the input vector. All inputs composed as

Feature Matrix, all labels consisted Labels Vector.

Keywords

Pedest Detection HOG SIFT Segment of Image SVR


iii

Contents Acknowledgements ............................................................................................................................................................ i

Abstract ............................................................................................................................................................................. ii

Keywords........................................................................................................................................................................... ii

Introduction ............................................................................................................................................... 5

1.1 People Counting and Pedestrian Detection .................................................................................................. 5

1.1.1 The State-of-the-art ........................................................................................................................ 5

1.1.2 The research background ................................................................................................................ 6

Image Process and Background Modelling ................................................................................................. 7

2.1 Image Process Methods .............................................................................................................................. 7

2.2 Background subtraction .............................................................................................................................. 8

2.2.1 Naive frames sub ............................................................................................................................ 8

2.2.2 Adaptive Gaussian Mixture Model ................................................................................................... 9

Features Extraction .................................................................................................................................. 11

3.1 Pedestrian features analysis ...................................................................................................................... 11

3.2 Keypoints detection .................................................................................................................................. 12

3.2.1 Scale-invariant feature transform .................................................................................................. 12

3.2.2 Harries corner detection ............................................................................................................... 12

3.3 Histograms of Oriented Gradients ............................................................................................................. 13

3.3.1 HOG method................................................................................................................................. 13

3.4 Histograms of Comprehensive Describe Vectors ........................................................................................ 15

3.4.1 Describer of Pedestrian by Distance Transform .............................................................................. 15

3.4.2 Comprehensive Describe Vectors .................................................................................................. 16

People Counting by Machine Learning Approach...................................................................................... 18

4.1 Support Vector Regression ........................................................................................................................ 18

4.1.1 Liner Regression............................................................................................................................ 18

4.2 Image Data process and Traning of the Machine........................................................................................ 19

4.2.1 Image Data process ....................................................................................................................... 19

4.2.2 10-fold cross validation ................................................................................................................. 22

4.3 Evaluation of the performance .................................................................................................................. 23

Conclusion ............................................................................................................................................... 25

5.1 Achieved results ....................................................................................................................................... 25

5.2 Several other approach and extra work ..................................................................................................... 25

References ....................................................................................................................................................................... 27


5

Introduction “To know what is Where by Looking” [1] David Marr used a plain quote to explain the principle of “Vision”. In the year 1965, Roberts

[2] gave the explainion of 3D world of the computer, Guzman used the combination of topological and geometric rules to de code the

projections of complex Three-dimensionnal sence [3]. The word of “Computer Vision” was appeard in the Processdings edited by

Winston. In 1982, Ballard and Brown published the first monograph to describe the theory of Computer Vision.

1.1 People Counting and Pedestrian Detection

Pedest detection is a very hot study point in computer science. A most forward utilize area is Intelligent Car. People Conting become

more useful in molden society, especially with the development of Big Data, the people counting offer powerful influence in financial

areas, in market detection and sales prediction.

1.1.1 The State-of-the-art

Google autonomous vehicle got its first license in the year 2012, and the performance of this system is very pleasing. However, the

cost of the autonomous car is very expensive, this cause google try to use other low cost method to replace the luxury sensors system.

On the International Conference on Robotics and Automation 2015 google announce that they are trying to use Machine Learning

method and cameras to replace laser radar. The accuracy of modern neural network can be more than 99.5%, while the speed of

detection is still not realtime now.

Figure 1-1 The deep learning method for people detection of Google

Honda (本田技研工業株式会社) developed infrared person detection system. This new system was named as “Honda Sensing”,

The system mainly constituted with two sensors, one is millimeter-wave radar, the other is a camera which was located in the front

of the car; Some of universities such like CMU, MIT, Tsinghua also contributed a lot on this area.


6

Figure 1-2 Honda automatic steering anti-pedestrian collision system

Today, the method of people detection can be roughly summary as two different methods, one is by background subtract, the other

is statistics learning method.

1.1.2 The research background

There are varies of different approaches to achive distinguishing of pedestrian, for all these methods can be roughly summary as

table 1-1.

Table 1-1 [4] Comparison of Evaluated Pedestrian Detectors

For pedestrian detection the detector are quite diverse, normally these detectors are based on the silding window. Based on the

Table 1-1 we can find that all the seven features can be used for different detectors and obiviously, the gradient hist is most common

used for detectors. Normaly more features can be good for detectors to detect, while it is not means that more features equal to

more accuracy, some features can influence the performance of the detectors, for this problem we will discluss in the chapters after.

Adaboost and SVM are two most commom method for automatic recongnize, and on the log-average miss rate on the INRIA set the

Adaboost is obivious than SVM methods.


7

Image Process and Background

Modelling In signal processing, it is often desirable to be able to perform some kind of noise reduction on an image or signal. The median filter

is a nonlinear digital filtering technique, often used to remove noise.

Figure 2-1 Comparison Picture for Salt-Pepper Noise and original picture

2.1 Image Process Methods

By median filter, these noisy points will be removed, and the image will also be smoother in visual. The median filter is a very im-

portant process before the edge detection process, because, the noisy points will be enlarge by the edge detection algorithm. The

median filter is non-linear. This means that for two images A(x) and B(x):

𝑚𝑒𝑑𝑖𝑎𝑛[𝐴(𝑥) + 𝐵(𝑥)] ≠ 𝑚𝑒𝑑𝑖𝑎𝑛[𝐴(𝑥)] + 𝑚𝑒𝑑𝑖𝑎𝑛[𝐵(𝑥)] (2-1)

As can be seen, the central pixel value of 150 is rather unrepresentative of the surrounding pixels and is replaced with the median

value: 124. A 3×3 square neighbourhood is used here --- larger neighborhoods will produce more severe smoothing [16]

Figure 2-2 Comparison Picture for Salt-Pepper Noise and media filter processed.


8

2.2 Background subtraction

2.2.1 Naive frames sub

The naive frames sub method is the simplest method for backgroud substrication.The main ideal is to get a back ground frames which

without the movement objection and then use object frames to sub the frames, then the objects will be extract.

The only problem of the naïve method is that this method will be influencted by the light very apparently and once there is some

object enter the frames and stay inside the frames, the naïve method will recongnize the object wrongly.

int threshold = 40;

for (int y = 0; y < frame.cols; y++)

{

for (int x = 0; x < frame.rows; x++)

{

int b = (int)m2(x, y)[0] - (int)mf(x, y)[0];

int r = (int)m2(x, y)[1] - (int)mf(x, y)[1];

int g = (int)m2(x, y)[2] - (int)mf(x, y)[2];

if (abs(b - r)<threshold && abs(g - r)<threshold && abs(b - g)<threshold)

{

m2(x, y) = Vec3b(0, 0, 0);

}

else

{

m2(x, y) = Vec3b(255, 255, 255);

}

}

Algorithm 2-1 Naive method for background substraction

Figure 2-3 the background picture (without considering the small crowd)


9

Figure 2-4 the object picture (Threshold = 20 Threshold = 40)

The figure 2-1 indicated that the influence of different threshold of RBG values. We tested different values and the threshold can

control the effective good while this normally effective for a range of values rather than a specific values.

2.2.2 Adaptive Gaussian Mixture Model

Friedman and Russel proposed a method to model background pixel using a MOG method, this is basical using the EM algorithm. The

MOG method is an improvement of the naïve sub method, each pixel was comparied with each Gaussian and will be classified based

on the Gaussian value to determine weather this value is objective pixel.

Considering a pixel, the probability of observing the pixel value is given by the formula below [5],

𝑝(𝑥𝑡) = ∑ 𝑤𝑖,𝑡𝜂(𝑥𝐴, 𝜇𝑖⋅𝑡, 𝛴𝑖𝜋)𝑘

𝑖=1 2-1

k is the number of distributions, w is the weight related to the i Gaussian at time t with mean 𝜂.

𝜂 is Gaussian probability density function:

𝜂(𝑥𝑡, 𝜇, 𝛴) =1

[2𝜋]𝑛2|𝛴|

12

𝑒−1

2(𝑥𝑡−𝜇)𝛴−1(𝑥𝑡−𝜇) 2-2

∑ .𝑡�� = 𝜎𝑖,𝑡

2 ∗ 𝐼 2-3

Formula 2-2 is the convariance matrix.

P. KaewTraKulPong and R. Bowden proposed a real-time improvement in their paper [6] .Their algorithm run under the framework

of the realtime robust tracker proposed by Grimson, and the MOG algorithm in OpenCV is developed by this paper. P.

KaewTraKulPong and R. Bowden improved the Online EM for the MOG, they begin the estimating of the Gaussian mixture model by

expected sufficient statistics update equation then switch to L-recent window version when the first L samples are processed [6].

The online EM algorithms are shown below based on the L-recent windows.

L-recent windows,


10

(a) Orignal frame. (b) Detection area.

(c) Object detected without shadow clearn. (d) Object detected after process.

Figure 2-5 Object detection with MOG.

Figure 2-5 is the use of MOG to detect the people in the corridor of ESIEE Paris, the 2-5 (a) is the frame which got from the camera

directly and the 2-5 (b) is the control area of the frames the 2-5 (c) and 2-5 (d) is the movement object that we extract from the

specificed area. The shape of the people can well reconginzed. The advantage of the MOG is that compared with naïve method the

MOG is more robost in the entire environment.


11

Features Extraction

3.1 Pedestrian features analysis

In history, for the static image process, the methods of pedestrian detection can be summary below,

1. Gavrila used a global model method: based on contoures maching on different layers, he constructed almost 2500 contoures to

maching the people. The problem of this method is the number of contoures is very huge and this will cause the performance

of the detector decrease.

2. Broggi used a local model to maching the pedestrian: use different size of binary model to modeling the peoples’ head and

shouder; then use the detected edge image of the original image to compare with the binary model and to detecte the person.

3. Lipton used a method of light flow: computing the residual light flow in the motion area.

4. Heisele used a method of motion detection: detect the movement features of the people’s legs.

5. Wohler used a neural network method: constructing a self-adapt time lag neural network to judge is the movement is caused

by the people.

Now, the methods can be normally summary as two branches,

1. Based on background modeling: extract the foreground, extract the movement objects, and then extract the features of the

object, and make judgement.

2. Based on statistic learning method: based on mass samples to train the pedestrian classifier.

Dalal [7] proposed the Histograms of Oriented Gradients for Human Detection (HOG) in the CVPR in 2005. His method is based on the

gradient of the local image area; and use the histogram to describe the local gradient. Based on all the hisograms, constructing the

features vectors. The steps of HOG can be summary as figure 3-1. [7]

Figure 3-1. An overview of our feature extraction and object detection chain. The detector window is tiled with a grid of overlap-

ping blocks in which Histogram of Oriented Gradient feature vectors are extracted. The combined vectors are fed to a linear SVM

for object/non-object classification. The detection window is scanned across the image at all positions and scales, and conven-

tional non-maximum suppression is run on the output pyramid to detect object instances, but this paper concentrates on the feature

extraction process. [7]

Professor Xavier HILAIRE and I have some different ideal for the people detection. During all the project period we mainly experi-

enced three big correction of the methods. The correction can because of difficulty of realization, the time limitation and the effect

of result.

1. Extract the features of different parts of body, screening keypoints and extract the keypoints of the head, hands, and bodies.

Counting the numbers of keypoints and calculate the number of person.

2. Extract the features of different parts of body, use the description tree to mache the keypoints of the body. Using the support

vector machine to classify the mached three is a person or not.

3. Perform distance transform of the MOG image. Calculate the distance of transform of the bodies, use a vector to describe the

distance transform of the image and then fed the vector to SVR.

The methods is more simple to relized, the method 2 is complex because of the difficulty to construct the describe tree and the

problem of overlapping of the bodies on the image.


12

The sub chapter 3.2 and 3.3 introduced the method 2 and method 3, the sub chapter 3.4 and the left chapteres disclussed the

method 3.

Finally, Professor Xavier proposed the method 3, and we testified this method at last.

3.2 Keypoints detection

3.2.1 Scale-invariant feature transform

SIFT (Scale-invariant feature transform) is an algorithm of computer vision, the mainly useage of the algorithm is to detect the local

features and describe the features. This algorithm was proposed by David Lowe [7] in 1999. SIFT was patented by University of British

Columbia.

The process of SIFT is very complex, considering the length of article and since the SIFT was excluded in our project at last, here, we

just introduce the SIFT and describe the application effect of SIFT in our project. SIFT features can be described as the steps bellows,

- Build the scale space, detect the extreme points, and build the scale-invariant transform.

- Calculate the directions of the feature points.

- Get the keypoint descriptor

- Filt the feature points and then get the accuracy positions.

Algorithm 3-1 Scale-invariant feature transform

(a) Detection area. (b) Object detected without shadow clearn.

(c) Object detected after process. (d) SIFT features extract

Figure 3-2 SIFT features extraction.

3.2.2 Harries corner detection

CHris Harris & Mike Stephens [9] proposed the Harries corner detection method,

1. Full directions response

𝐸𝑥,𝑦 = ∑ 𝑤𝑢⋅𝑣[𝐼𝑥+𝑢,𝑦+𝑣 − 𝐼u,𝑣]2

= ∑ 𝑤𝑢⋅𝑣[xX + yY + O(x2 , y2)]2𝑢,𝑣

𝑢,𝑣 3-1

The gradients are given by

𝑋 = 𝐼 ⊗ (−1,0,1) ≈𝜕𝐼

𝜕𝑥 3-2


13

𝑌 = 𝐼 ⊗ (−1,0,1)𝑇 ≈𝜕𝐼

𝜕𝑦 3-3

Thus,

𝐸(𝑥, 𝑦) = 𝐴𝑥2 + 2𝐶𝑥𝑦 + 𝐵𝑦2

𝐴 = (𝑋)2 ⊗ 𝑤

𝐵 = (𝑌)2 ⊗ 𝑤

𝐶 = (𝑋𝑌) ⊗ 𝑤 3-4

2. Guassian windows,

𝑤𝑢,𝑣 = exp-(𝑢2+𝑣2)/2𝜎2 3-5

3. reformulate the corner measure to make use of the variation of E with the direction of shift[9]

𝐸(𝑥, 𝑦) = (𝑥, 𝑦)𝑀(𝑥, 𝑦)𝑇 3-6

𝑀 = [𝐴 𝐶𝐶 𝐵

]

Let α, β be the eigenvalues of M, three different conditions are considered

1. Both α and β are small, this means that the image inside the Guassian Window is flat.

2. Either α or β is lage, theother is small, this means that edge was detected image inside the Gaussian Windows.

3. Both α and β are large, this means that the corner was detected inside the Guassian Window.

Figure 3-3. Auto-correlation principal curvature spaceheavy lines give corner/edge/flat classification, fine lines are equi-response contours.[9]

3.3 Histograms of Oriented Gradients

HOG (Histogram of Oriented Gradient) is a describer which is used in computer vision area for object detection. The basical ideal of

HOG is using the oriented gradients of local area of image and use the oriented gradients to construct the features space. HOG

combine with support vetor machine for pattern recongnition is widely used. The advantages of HOG is that the feature of geometry

invariability and optical invariance, and non-sensitivity to the small changes of the person, since the person is almost stand the gestion

of arms, feets are not influence very much on detection effect.

3.3.1 HOG method.

Standrard Gamma space and color space. This step can oblivious reduce the influence of the light.


14

𝐼(𝑥 ⋅ 𝑦) = 𝐼(𝑥 ⋅ 𝑦)𝑔𝑎𝑚𝑚𝑎 3-6

Where

Gamma = 1/2

Calculate the gradient of the image,

G(x,y) = dx i + dy j;

dx(i,j) = I(i+1,j) - I(i,j);

dy(i,j) = I(i,j+1) - I(i,j);

Then the gradient in position (x, y),

𝐺(𝑥 ⋅ 𝑦) = √𝐺𝑥(𝑥 ⋅ 𝑦)2 + 𝐺𝑦(𝑥, 𝑦)2 3-7

𝛼(𝑥, 𝑦) = arctan (𝐺𝑦(𝑥, 𝑦)

𝐺𝑥(𝑥, 𝑦))

Divided the image into different parts, each parts called cell,

Figure 3-4 One image divided into 16 cells

For each cell we build a correspond histogram, as figure shown, here we divide the cell into 8 directions which means that there

will be 8 blocks in the correspond histogram of each cell.

Figure 3-5 One cell divided into 8 directions. The amplitude is the volume that contribute to the correspond direction.

0 ~ 45 ~ 90 270 ~ 315 ~360

180 ~ 225 ~ 270 90 ~ 135 ~ 180


15

Figure 3-6 The amplitude in different direction.

Figure 3-7 all the cells’ histogram consitute the whole image cells space.

3.4 Histograms of Comprehensive Describe Vectors

3.4.1 Describer of Pedestrian by Distance Transform

Distance transform is a common use in binary image processing. The figure 3-8 indicate the process of distance transform.

Figure 3-8 (a) the original image; (b) forward raster sweep; (c) backward raster sweep; (d) final distance transform image.

Based on the feature of the head, body, feets and arms are have different geometry features on the image, professor Xavier pro-

posed counting the people by the distance transform of the image. The algorithm 3-2 is the detection by DT method and figure 3-9

indicated the process of the algorithm working.

- 1. Do distance transform on the binary image.

- 1.1. find the largest disk, process it if large enough

- 1.2 Put the distance of the disk into the describer vector. 1.3 remove the disk from the image, transfer to 2

- 1.3 remove the disk from the image, transfer to 2.

- 2. Do 1 untill there is no disk larger than threshold.

Algorithm 3-2 Scale-invariant feature transform

0

2

4

0~45~90 90 ~ 135 ~180

180 ~ 225 ~270

270 ~ 315~360

Hist of the cell

0~45~900

2

4

6

cell1

cell2

cell3

cell4

cell5

cell6

cell7

cell8

cell9

cell10

cell11

cell12

cell13

cell14

cell15

cell16

HOG Space of Image 4 x 4 cells

0~45~90 90~135~180 180~225~270 270~315~360


16

(a) Original image. (b) Distance transform.

(c) First body detected. (d) All bodies detected and extracted.

(e) Remove the bodies and then do algorithm again. (f) Heads detection and legs detection.

Figure 3-9 the process of the principle of DT in people detection.

3.4.2 Comprehensive Describe Vectors

After all the disk that satisify the extract threshold extracted, the distance of the disk will construct the distance transform vector,

and with this vector, we can get correspond DT feature histogram of the image. As figure 3-10 and 3-11 shown.

(a) (b)

Figure 3-10 (a) the original image; (b) the image after extract the DT feature vector


17

(21.6333 17.4642 15.1327 15.0000 9.0554 8.5440 7.0711 7.0711 7.0000 7.0000 7.0000 6.0828 6.0000 5.6569 5.3852 5.0990 5.0000 5.0000 5.0000 5.0000 4.4721 4.4721 4.1231 4.1231 4.0000 4.0000 4.0000 3.6056 3.6056 3.6056 3.1623)

Figure 3-11 the DT feature vector and its histogram, the red demensions give the information about the bodies of the person, the yellow demensions give the

information of the arms, heads; and the left dimension contains some other informations or noisy.


18

People Counting by Machine

Learning Approach Statistic Learning is the most popular method for pedestrian detection, similarly, for people counting, statistic learning is a useful

method. The application of machine learning in pedestrian detection is actually an application of chassifier. The classifier used for

people detection include Neural Network, Support Vector Machine, adaboost [10] and the Deep Learning [11].

Since the year 1990, Support Vector Machines (SVM), was a major theme in the theoretical development and applications. The theory

of SVM is based on the combined contributions of the optimisation theory, statistical learning, kernel theory and the algorithmic. [13]

4.1 Support Vector Regression

Support Vector Machine can be applied to regression problem by the loss function [12].

4.1.1 Liner Regression

The main idea of a SVR is to find a function which fits the data with a deviation less than a given quantity ε for every single pair Pi ,

Yi . At the same time, we want the solution to have a minimum norm w. This means that SVR does not minimize errors less than ε,

but only higher errors.[13]

Given the traning set T = {(X1,Y1), …,(Xl,Yl)} ∈(Y+X)’, Xi ∈ X= Rn, Yi∈Y = R, I = 1, …, l.

Choose the suitable ε and C; find suitable kernel K(x, x’).

The liner SVR problem can be formulated as below,

Find support vector based on KTT condition[14] stand, which means that only the lagrange mutiplicateurs of

support vector equal to zero.

Choose different kernal fuctions K(Xi, Yi) to realize different decision studing.

Construcing the decision fuction

For �� ,choose ��𝑗 𝑜𝑟 ��𝑘∗ from(0,

𝑐

𝑙),

if ��𝑗 choosed

�� = 𝑦𝑖 − ∑ 𝛽𝑖𝑘(𝑥𝑖 , 𝑥𝑗) + 𝜀

𝐿

𝑖=1


19

Else

�� = 𝑦𝑘 − ∑ 𝛽𝑖𝑘(𝑥𝑖 , 𝑥𝑗) − 𝜀

𝐿

𝑖=1

LIBSVM [15] is a library for Support Vector Machines (SVMs), it was developed by Chih-Chung Chang and Chih-Jen Lin. We use it for

Machine Learning tools in our project. The version 3.20 was released, all the code are open-source and there are both package for

OpenCV and Matlab.

4.2 Image Data process and Traning of the Machine

4.2.1 Image Data process

As we proposed in charpter 3.4.2, the frames will be processed as a histogram, and each histogram have correspond feature describe

vector as below, the form 4-1 to 4-4 are four different representation of four different frames with different number of person on

image.

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

0 0 0 4 1 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Form 4-1 one of the feature vector of one person frame

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

0 0 0 9 4 6 4 4 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Form 4-2 one of the feature vector of two person frame

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

0 0 0 25

9 7 4 2 2 1 2 1 1 3 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Form 4-3 one of the feature vector of three person frame

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

0 0 0 37

14

3 11

4 4 2 2 1 1 1 2 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Form 4-4 one of the feature vector of four person frame

For all the frames with different number of people from 0 to 7, we do this transfer and then we get a feature space of different

number of persons. From figure 4-1 we can find that the feature distribution of different number of persons are different; and the

figure 4-2 given a more clear describtion of features between disk radius 5 to 20 of different person number of frames.


20

Figure 4-1 the distribution of feature space of different number of person. (A) Distribution of one person. (B) Distribution of two person. (C) Distribution of

three person (D) Distribution of four person.

Figure 4-2 the distribution of feature space of different number of person on feature section 5~20.


21

(a) Traning Set. (b) Validation Set.

Figure 4-3 the data set of the feature space. Set (a) is the traning set; Set (b) is the validation set. Almost 5500 frames was setted as traning set, feature vec-

tors’ dimension was 35. Z axis is the volume of correspond disk radius features.

(a) (b)

Figure 4-4 Training set information (a) the distribution of features on different dimention of training set. (b) the volume of all features distribution represent

on all the frames space.


22

(a) (b)

Figure 4-5 Validation set information (a) the distribution of features on different dimention of traning set. (b) the volume of all features distribution represent on all the frames space.

4.2.2 10-fold cross validation

Figure 4-6 Cross Validation of the algorithm

Here we use 10-fold cross validation to verify the accuracy of algorithm. We divided the training data set into 10 sub sets, and use 9

of the sets as training sets and the left one set as test set. Every time we get an accuracy, do the validation alternative for ten times,

and the average accuracy of the ten time is the last validated accuracy of the algorithm.

bestcv = 0;

for log2c = -5 : 5,

for log2g = -5 : 5,

cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];

cv = svmtrain(label, matrix, cmd);

if (cv >= bestcv),

bestcv = cv;

bestc = 2^log2c;

bestg = 2^log2g;

end

fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);

end

end

Algorithm 4-1 10-fold cross validation.


23

4.3 Evaluation of the performance

Figure 4-7 performace of the SVR regression on test set.

Figure 4-8 performace of the SVR on validation set.

If �� is a vector of n predictions, and 𝑌 is the vector of observed values corresponding to the inputs to the function which generated

the predictions, then the MSE of the predictor can be estimated by

4-5

Squared correlation coefficient can be used as a good indicator of the model is good or not. In our test set the MSE is 0.117277 and

the SCC is 0.946681, which means that the regression is well related. In validation set, the MSE is 0.49741 and the SCC is 0.835084,

this result also indicate that the model is working and the regression is well related. The result of matlab experiment was shown as

form 4-5.

optimization finished, #iter = 104756

nu = 0.392805

obj = -6179.471339, rho = -4.063571

nSV = 3332, nBSV = 1369


24

Mean squared error = 0.117377 (regression)

Squared correlation coefficient = 0.946681 (regression)

Mean squared error = 0.47941 (regression)

Squared correlation coefficient = 0.835084 (regression)

Form 4-5 the result of regression on test set and validation set.


25

Conclusion

5.1 Achieved results

In this document, we proposed a new approach to count the number of person. Based on features of the entire frame, use the

distance transform to describe the features, and counting the number of person by the features vectors. With this methods to count

the sparse crowed and the not very covered crowed will be effective.

During the research and study period with Professor Xavier, we discluss a lot and modified our methods, codes many times. Follow

along the main line of our model, I was alos thought about the industry application of our project. Unforturantlly, until now the

system is still far away from application in real world. The result indicated that this approach have a good performace when the

number of person less than four per detection window. The first problem is the accuracy, as we tested, the accuracy is not statisified

for application. The other problem is the speed and cost of the system.

Limited by the length of project, actually, for a compelet people counting system, the ablilty of continuous counting for accumulate

counting is necessary. While the problem of frames sequence anylsis itself is a big acdamic task, for a two month academic research,

this is out of bound. So for our project, till now, we only solve the problem of COUNT per frame. For counting total frames sequence,

extra research is needed.

5.2 Several other approach and extra work

The world is diversity, and the method to expore the world are variable.

Considering the problem of accuracy of the system, and as the early ideal I talked with Professor Xavier, I spent part of my time

studying count the person by heads. Becaused this simple ideal is easy to realize and under most situation, this method works well

but limited by the lighting and envoroment, but if we only consider the speed of the system and more important ---- embedded

useage in industry, this method will be much better.

Since various reasons we abandon some methods of counting which we did not testify. One method Professor Xavier proposed is

describe and mache the person by tree structure. The ideal can be summary as:

Find the center of the object and based on the position of head to grow a tree which can well describe the objection. And then

labeling the person and learning the tree frames by frames.

This method have a good describtion of the person, and we belive the accuracy describtion can well improve the counting accuracy.

While the problem of bulild the tree to describe the person can be very complex and this is not only limited by the duration of the

project. Considering the consecutivness of the entire project, I abandon this method, and then choose the method of distance trans-

form features describtion.

I and Prof. Xavier also talked about the influence of the position of camera, because the camera was put in the top to crowed and

the direction have an angle with the ground as below,

Figure 5-1 the position of camera*

Prof. Xavier did a deduction for the correction of image, based on the angle of the camera with the ground and the height of the

position of the camera. The person on the image shoud be corrected with an specific ratio, the calculation of the ratio was given by

*The figure was drown by Xavier HILAIRE form the instruction of project


26

formula 5-1.

(5-1)

During the implement of this ideal, I found a key problem of relization, to correct the person scale on frames, the problem of detcet

the person on the image to correct is actual an paradox. The ideal to solve the problem is correct the whole image, based on the

deduction of prof. xavier. I finally used the “projection transformation” to solve the problem.

(a) (b)

Figure 5-2 (a) The frames before rescale; (b) the image after rescal

We can see that the people are adjusted into same scale. Projection transformation can solve the scale of people, however, the

correction also caused the problem of distorsion. Due to paper length the details of projection transformation and the deduction of

the correction ratio was not introducted in the text. For projection transformation, readers with an interest in this can find the

mathmatic details in reference [17].

Considering the engineering application and user friendly, I also design and IDE based on MFC for the system, the effect was shown

as figure 5-3.

figure 5-3 IDE for the people counting system


27

References

[1] Marr D. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. New York:

Freeman, 1982,.

[2] Lawrence G. Roberts: Machine Perception Of Three-Dimensional Solids. Massachusetts Institued of Technology Lincoln

Laboratory 22 May 1963, Reissued May 1965

[3] Jean Ponce, Andrew Zisserman, Martial Hebert, Objection Representation in Computer Vision II. ECCV'96 International

Workshop Cambridge, UK., April 1996. 79 - 80.

[4] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona: Pedestrian Detection:An Evaluation of the State of the Art. IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.

[5] Thierry Bouwmans, Fida El Baf, Bertrand Vachon. Background Modeling using Mixture of Gaussians for Foreground Detection

- A Survey. Recent Patents on Computer Science, Bentham Science Publishers, 2008, 1 (3), pp.219-237.

[6] P. KaewTraKulPong and R. Bowden, An Improved Adaptive Background Mixture Model for Realtime Tracking with Shadow

Detection. Brunel University, Middlesex, UB8 3PH, UK Sept 2001.

[7] David G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints. Computer Science Department University of British

Columbia Vancouver, B.C., Canada. January 5, 2004

[8] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. INRIA Rhône-Alps, 655 avenue de

l'Europe, Montbonnot 38334, France.

[9] Chris Harris & Mike Stephens. A COMBINED CORNER AND EDGE DETECTOR Plessey Research Roke Manor, United Kingdom ©

The Plessey Company pic. 1988

[10] David Ger onimo, Angel D. Sappa, Antonio L opez and Daniel Ponsa, PEDESTRIAN DETECTION USING ADABOOST LEARNING

OF FEATURES AND VEHICLE PITCH ESTIMATION. Computer Vision Center and Universitat Aut`onoma de Barcelona, Spain.

[11] Wanli Ouyang and Xiaogang Wang, Joint Deep Learning for Pedestrian Detection Department of Electronic Engineering, the

Chinese University of Hong Kong

[12] A. J. Smola. Regression estimation with support vector learning machines. Master’s thesis, Technische Universit at M unchen,

1996.

[13] Tarik AL ANI, Machine learning and pattern recognition Part 2: Classifiers. Département Informatique et Télécommunication,

ESIEE-Paris Page 38, 75 - 77, 2013.

[14] Xu yitian. The application of SVR in the food industry. China Agriculture University, 2004:8.

[15] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A Library for Support Vector Machines. National Taiwan University, Taipei,

Taiwan. Initial version: 2001 Last updated: March 4, 2013.

[16] 2003 R. Fisher, S. Perkins, A. Walker and E. Wolfart. (2003). "Image Process Learning Resource".

http://homepages.inf.ed.ac.uk/rbf/HIPR2/median.htm

[17] Unknow, Cameras, Projection, and Image Formation UW CSE vision faculty;

https://courses.cs.washington.edu/courses/cse455/09wi/Lects/lect5.pdf

Documents

ETUDE DES DÉTECTION DES PIÉTONS DES FLUX DE VIDÉO Extraction de caractéristiques et la méthode Machine Learning