54

Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Face Recognition for Annotation

in Media Asset Management

Systems

Sebastian Gröhn

Master's Thesis in Computing Science

30 ects credits

June 1, 2013

Supervisor ExaminerFrank Drewes Fredrik Georgsson

umeå university

Dept of Computing ScienceSE-901 87 umeå, sweden

Page 2: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 3: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Abstract

The goal of this thesis was to evaluate alternatives to the Wawo face recog�nition (fr) library, used by the company CodeMill AB in an application forvideo-based fr, implemented as a plugin to the media asset management sys�tem Vidispine. The aim was to improve the fr performance of the application,and the report tried to compare the performance of recent versions of OpenSource Biometrics Recognition (OpenBR) and Open Source Computer Vision(OpenCV ) to Wawo.For comparison of the fr systems, roc curves and area under roc curves

(auc) metrics were used. Two di�erent test videos were used: one simplershot with webcam and one excerpt from a tv music show. The results aresomewhat inconclusive; the Wawo system had technical di�culties with thebiggest test case. However, it performed better than OpenBR in the two othercases (comparing auc values), which leads to the conclusion that Wawo wouldhave outperformed the other systems for all test cases if it had worked. Finally,the comparison shows that OpenBR is better than OpenCV for two of the threetest cases.

Page 4: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 5: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Project Aim and Goal . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Working Method . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Face Recognition in Video 5

2.1 Face Recognition Work�ow . . . . . . . . . . . . . . . . . . . . 5

2.2 Techniques for Face Recognition . . . . . . . . . . . . . . . . . 6

2.2.1 Eigenfaces and Principal Component Analysis . . . . . . 6

2.2.2 Fisherfaces and Linear Discriminant Analysis . . . . . . 9

2.2.3 Local Binary Patterns . . . . . . . . . . . . . . . . . . . 10

2.2.4 Joint Multiple Hidden Markov Models . . . . . . . . . . 11

2.3 Evaluated Face Recognition Systems . . . . . . . . . . . . . . . 13

2.3.1 Wawo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 OpenBR . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 OpenCV's Face Recognition Library . . . . . . . . . . . 15

3 Comparison 17

3.1 Criteria for Evaluation . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Test Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Introduction to ROC . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Tables of Confusion . . . . . . . . . . . . . . . . . . . . 19

3.3.2 ROC Plots . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.3 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . 20

3.3.4 Area under ROC Curves . . . . . . . . . . . . . . . . . . 22

3.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Results 25

4.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

Page 6: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Contents

5 Conclusions 29

5.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Acknowledgements 33

References 35

A Detailed Results 37

iv

Page 7: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Figures

2.1 Local Binary Patterns Illustration . . . . . . . . . . . . . . . . 12

2.2 hmm Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Test Video Sample Frames . . . . . . . . . . . . . . . . . . . . . 18

3.2 Table of Confusion Illustration . . . . . . . . . . . . . . . . . . 20

3.3 roc Plot Example . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Confusion Matrix Illustration . . . . . . . . . . . . . . . . . . . 21

4.1 Sample Frames after fr . . . . . . . . . . . . . . . . . . . . . . 27

4.2 roc Results for codemill . . . . . . . . . . . . . . . . . . . . . 27

4.3 roc Results for SA_SKA_DET_LATA-long . . . . . . . . . . . . . 28

4.4 roc Results for SA_SKA_DET_LATA-short . . . . . . . . . . . . 28

A.1 Detailed roc Results for codemill . . . . . . . . . . . . . . . . 40

A.2 Detailed roc Results for SA_SKA_DET_LATA-long (part 1) . . . 41

A.3 Detailed roc Results for SA_SKA_DET_LATA-long (part 2) . . . 42

A.4 Detailed roc Results for SA_SKA_DET_LATA-short (part 1) . . 43

A.5 Detailed roc Results for SA_SKA_DET_LATA-short (part 2) . . 44

v

Page 8: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 9: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Tables

2.1 Face Recognition Algorithms . . . . . . . . . . . . . . . . . . . 6

2.2 Evaluated Face Recognition Systems . . . . . . . . . . . . . . . 13

3.1 Test Videos Used in the Comparison . . . . . . . . . . . . . . . 18

4.1 auc Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A.1 Test Video Details . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.2 Detailed auc Results . . . . . . . . . . . . . . . . . . . . . . . . 39

vii

Page 10: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 11: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 1

Introduction

In the media industry, media asset management systems (mam systems) are in�creasingly important, as companies start publishing video and audio materialon multiple platforms in multiple formats. The responsibility of mam systemsis to catalog, store, and distribute media assets (digital �les and their accom�panying usage rights).

There are many attributes that are interesting to store about a media �le, andemerging among them is more hard-to-extract time-coded information, such as

Who is present at di�erent times in a video? or

What do people say in a video and when do they say it?

Only recently has functionality for (semi)-automatic extraction of this beenintroduced into mam systems, making this a �eld with potential for big im�provements.

Of interest for this thesis project is the area of automated face recognition(fr) in video material. One tries to answer what people are present in a video,at which times they are present, and how they move in the picture. Below aretwo reasons for mam system users showing interest in fr:

1. An organization handles recordings of board meetings with a mam systemand needs to know which meetings a given board member has attended.

2. The editorial sta� of a public service broadcasting corporation wants tocompose a montage of a newly deceased celebrity using archive footage.The video archive is managed by a mam system.

This chapter gives an introduction to the project, its aim and goal, and howand where it was conducted.

1

Page 12: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 1 Introduction

1.1 Background

The thesis project was conducted at CodeMill AB1, a Umeå-based informationand communications technology (ict) consulting and application developmentcompany: It focuses on medical-resource and media-production applicationsand currently has an application for face recognition (fr).

CodeMill works closely with Vidispine AB2, another Swedish company thathas been developing a mam framework since 2009. The Vidispine product iscomposed of several modules, whereas the transcoder is the most interesting forthis work. It processes video streams in a multitude of ways: scaling resolution,changing frame rate, extracting parts of streams, merging several streams, etc.It also has support for extracting time-coded metadata via a plugin architec�ture.

CodeMill's fr application hence works as a plugin to the Vidispine trans-coder, leveraging its broad support of di�erent video decoders and demuxers.It uses the Wawo library (presented in more detail in Chapter 2) for doing theactual face recognition.

1.2 Project Aim and Goal

The aim of the thesis project was to improve the performance of CodeMill'sfr application, as it had room for improvement. Earlier thesis projects at thecompany had looked at the aspects of face detection and video object tracking;this project was to focus on face recognition. More speci�cally, the goal was toevaluate alternatives to Wawo for doing fr.

The original problem statement was to �evaluate alternatives � both propri�etary and free/open source software (foss) systems � comparing them with thecurrently used system for face recognition (Wawo).� The evaluation was to bebased on �the reliability of the di�erent systems in matching faces, amount oftraining required, and amount of user interaction needed.�

As a next step, the implementation of an existing open-source system forface recognition would be analyzed and improved with available algorithms. Itwould then be compared to the original implementation and the other systemsto improve the impact of the improvements. However, this step was removedfrom the project plan because of time constraints.

1http://www.codemill.se/2http://vidispine.com/

2

Page 13: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

1.3 Working Method

1.3 Working Method

The thesis project was divided into three broad phases:

1. Preparation: Apart from concretizing the project idea into a thesis pro�posal and plan the work, a theoretical study of the scienti�c area of facedetection and recognition was conducted, followed by a �eld study ofavailable fr systems. Comparison criteria was also investigated.

2. Evaluation: A test framework for evaluating the chosen fr systems wasdeveloped, as well as plugins to use them with the Vidispine transcoder.Then, test material to use for the evaluation was prepared, semi-manuallymarking up faces in video streams to use as reference when comparing.Finally, the comparison was made.

3. Documentation: The work was documented in this thesis.

1.4 Related Work

This thesis is in some way a continuation of previous thesis projects conductedat CodeMill, all with the aim at improving the face recognition application.Among those are the work of Thomas Rohndal [9], comparing two systems forface detection, and Linus Nilsson [7], adding support for face tracking. Anotherproject, related to the overall aim of extraction of time-coded metadata, isTobias Nilsson's [8], looking into speech-to-text in audio�video-streams.

1.5 Thesis Structure

This thesis is divided into the following chapters:

Chapter 2: A theoretical background to face recognition and common meth�ods and algorithms is given, as well as an introduction to thecompared fr systems.

Chapter 3: How the comparison was conducted is explained: First, com�parison criteria and the test material is presented. Then, anintroduction to statistical tools of interest are given.

Chapter 4: The results of the comparison are presented and explained.

3

Page 14: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 1 Introduction

Chapter 5: The conclusions drawn from the comparison result are given.Furthermore, limitations of the presented work are discussedand future work is proposed.

Appendix A: For interested readers, details of the test video and detailedcomparison results are included here.

4

Page 15: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 2

Face Recognition in Video

This chapter describes the process of face recognition from a theoretical per�spective. First, the whole process work�ow of face detection, object tracking,and face recognition is outlined. Then several techniques for doing actual facerecognition are described. Finally, the evaluated fr systems are presented.

2.1 Face Recognition Work�ow

Face recognition in video streams can be divided into three steps:

1. Face detection (fd) is the act of discovering potential faces in an imageframe from the stream. The image is scanned for areas that containfeatures of a human face: skin-toned color, eyes, nose, mouth, etc.

2. After that, video object tracking is used to follow the identi�ed �faceobjects� frame by frame in the video stream. This is done during thewhole scene, until the video is cut. Then, the detection process must berestarted from the beginning of the next scene.

3. Finally, the actual face recognition (fr) step is performed. The imagesof the identi�ed faces are matched against a database of known ones.Depending on the size of the database and the number of faces in thescene, this process can range from simple (single person in the scene,where every person occurring in the video is known) to very complex(hundreds of people, choosing among everyone �guring somewhere in avideo archive of a company).

In the Vidispine transcoder, Step 1 and 3 are performed by one of the fr systemplugins, and Step 2 is done by an object tracker via another plugin structure.

As explained in Chapter 1, this thesis focuses on the face recognition step.

5

Page 16: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 2 Face Recognition in Video

Algorithm Description Sect.

Eigenfaces Holistic approach, using pca for dimension reduction. 2.2.1

Fisherfaces Holistic approach, using (pca and) lda for data analysisand dimension reduction.

2.2.2

lbp Local-feature based approach, developed from the �eld oftexture analysis. . .

2.2.3

jm-hmm 2.2.4

Table 2.1: Face recognition algorithms that are used by the evaluated fr sys�tems. Each algorithm is described in more detail in its respectivesubsection.

2.2 Techniques for Face Recognition

The compared fr systems use algorithms developed speci�cally for or adaptedto face recognition, some common and some speci�c to only one product. Thissection describes the ones used in the evaluated systems, namely

1. the Eigenfaces approach, using Principal Component Analysis (pca);2. the Fisherfaces approach, using Linear Discriminant Analysis (lda);3. Local Binary Patterns (Histograms) (lbp), and4. Joint Multiple Hidden Markov Models (jm-hmm).

Each algorithm is described in its respective subsection. Table 2.1 contains asummary. (Below, �(· · · )� denotes lists or sequences, and �[· · · ]� denotes vec�tors or matrices. Also, boldface variables, such as X and x, represent lists,sequences, vectors, or matrices. This in contrast to scalar variables such as Xand x.)

2.2.1 Eigenfaces and Principal Component Analysis

The Eigenfaces algorithm, proposed by mit researchers Matthew Turk and AlexPentland in 1991 [10], takes a holistic approach to face recognition. Holistic inthe sense that each face image is looked at as a whole, rather than trying toextract its local features.Each face image (converted to gray-scale) of resolution p × q pixels corre�

sponds to a vector in a pq-dimensional vector space, the space of all potentialfaces. If all vector components are uncorrelated, i.e., contain no overlapping

6

Page 17: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

2.2 Techniques for Face Recognition

information, it is a huge task to store and compare these vectors; e.g., a set of100× 100-pixel training images yields a 10 000-dimensional vector space.The goal of Principal Component Analysis (pca) is to �nd the components

with greatest variance, called the principal components, so the original spacecan be reduced to one of lower dimensionality by converting correlated vectorcomponents into uncorrelated ones. This is done with the following steps:

1. Treat the sample images � a set of representative face images, or the faceimages of people to train for � as a list of column vectors

X = (x1,x2, . . . ,xN ) ,

each vector xi of length K = pq corresponding to the pixel values ofimage i.

2. Compute the sample mean vector x of the images:

x = [xj ] =1

N

∑i≤N

xi . (2.1)

Each component xj (j ≤ K) is the mean of the jth components of everyxi ∈ X.

3. Compute theK×K (sample) covariance matrix Q from each image vectorxi, �rst subtracting the mean x:

Q = [qjk] =1

N − 1

∑i≤N

(xi − x) (xi − x)>, (2.2)

where each entry qjk is (an estimate of) the covariance between the jthand kth component of every xi ∈ X.

4. Now the task is to �nd a projection V∗ that maximizes the variance ofthe data:

V∗ = arg maxV

∣∣V>QV∣∣ .

This is equivalent with solving the eigenvalue problem for Q:

Qvj = λjvj , j ≤ K ,

with eigenvectors V = (v1,v2, . . . ,vK) � called eigenfaces as they havethe same dimension as the sample images � and their respective eigenval�ues Λ = (λ1, λ2, . . . , λK). V∗, then, is the matrix composed of the L ≤ K

7

Page 18: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 2 Face Recognition in Video

eigenfaces with largest eigenvalue ordered descending. I.e., V∗ =[v∗j]

(j ≤ L), where v∗1 ∈ V has the highest corresponding eigenvalue, v∗2 ∈ Vhas the next highest corresponding eigenvalue, etc.

5. Solving this equation can, however, be infeasible if not using very low-res�olution face images. Suppose, again, that we have a set of N = 300training images, each of size 100 × 100 pixels. This corresponds to vec�tors in X of length K = 100 · 100 = 10 000, in turn meaning that thecovariance matrix Q is of size K ×K = 10 000× 10 000.

If, as in this case, N < K, we can use a trick to compute the eigenvectors.First, note that Q can be computed in a di�erent way:

Q =1

N − 1YY> ,

where Y is the K × N matrix composed of column vectors [yi], eachyi = xi − x (i ≤ N). The idea is to compute the eigenvalue problem forR = 1

N−1Y>Y instead of for Q:

Ruj = λjuj ,

1

N − 1Y>Yuj = λjuj , j ≤ K ,

a much easier task as R is only N ×N = 300× 300 in size. By left-mul�tiplying both sides with Y,

1

N − 1YY>Yuj = λjYuj ,

QYuj = λjYuj ,

we note that if uj is an eigenvector of R then vj = Yuj is an eigenvectorof Q.

6. Finally, the L principal components x∗ (L ≤ K) of an image vector x,then, are given by

x∗ = V∗>(x− x) , (2.3)

where x is de�ned as in Equation (2.1). As expected, the x∗ vector is areduced version of x, with only L components instead of the original K.

8

Page 19: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

2.2 Techniques for Face Recognition

Now V∗ combined with x acts as a basis for the pca subspace where thecomparisons will take place.1 To initiate a face gallery, all training samplesare projected into the subspace using Equation (2.3) (if not done so whencomputing V∗). For face recognition, the query image is also projected intothe same space and compared to all samples in the gallery.Depending on the application, V∗ can be computed in advance on a �repre�

sentative� set of face images, without knowledge of whom to recognize, or becomputed on the actual training set (or a superset of it) when training the frsystem.

2.2.2 Fisherfaces and Linear Discriminant Analysis

Even though the Eigenfaces approach seems reasonable, it has de�cits. As itdoes not take into account what class (i.e., what person) a given face belongsto, some information is lost.The Fisherfaces algorithm was proposed by Peter Belhumeur, João Hes�

panha, and David Kriegman in 1996 [2]. Instead of using pca for dimensionreduction it uses the related Linear Discriminant Analysis (lda), which triesto �nd a projection of the face data that both maximizes the between-classvariance and minimizes the within-class variance.The algorithm outline is as follows:

1. As in Section 2.2.1, let X be a list of p × q-pixel sample images, eachimage represented as a column vector of length K = pq. However, thistime X is partitioned in C classes, each class representing a person:

Xj =(xj1,xj2, . . . ,xjNj

), Nj = |Xj | , j ≤ C

X = X1‖X2‖· · ·‖XC

Here, ·||· denotes list concatenation, e.g., (a1, a2)‖(b1) = (a1, a2, b1).

2. Compute the per-class sample mean vectors (x1,x2, . . . ,xC) and the totalsample mean vector x:

xj =1

Nj

∑i≤Nj

xji , j ≤ C

x =1

N

∑j≤Ci≤Nj

xji , N = |X|

1 For completeness sake, the formula for reconstructing a projected face image x∗ from thepca subspace into its original form x is as follows: x = V∗x∗ + x. Here, x is de�ned asin Equation (2.1). Note the symmetry with Equation (2.3).

9

Page 20: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 2 Face Recognition in Video

3. Compute the between-class scatter matrix SB and within-class scattermatrix SW (both K ×K):

SB =∑j≤C

Nj (xj − x) (xj − x)>,

SW =∑j≤Ci≤Nj

(xji − xj) (xji − xj)>,

measuring (estimates of) the non-normalized covariance between andwithin classes, respectively. I.e., they are not, as in Equation (2.2), nor�malized by the number of classes C or samples N , respectively.

4. The �nal task is to �nd a projection W∗ that separates the classes as muchas possible while keeping the within-class variance as low as possible,formally

W∗ = arg maxW

∣∣W>SBW∣∣

|W>SWW|.

The solution is found by solving the following eigenvalue problem foreigenvectors (v1,v2, . . . ,vK) with respective eigenvalues (λ1, λ2, . . . , λK):

SBvk = λkSWvk , k ≤ KS−1W SBvk = λkvk

2.2.3 Local Binary Patterns

Instead of the approach of Eigenfaces and Fisherfaces, more recent research withfr has moved away from such holistic or global reasoning, instead focusing onthe local features in face images. One such algorithm, Local Binary Patterns(Histograms) (lbp), was proposed by Timo Ahonen, Abdenour Hadid, andMatti Pietikäinen of University of Oulu in 2004 [1].It has its roots in 2D texture analysis, a sub�eld of computer vision. We only

encode how each pixel di�erentiates itself from its neighbors, instead of tryingto make sense of the whole image at once.The algorithm follows these steps:

1. Each p× q-pixel face image is represented as a matrix x of the same size,and for each center pixel position c = (xc, yc) with neighboring positions(p1, p2, . . . , pP ), its local binary pattern is computed as

LBP(c) =∑

0≤i<P

2i · sgn(I(pi)− I(c)) ,

10

Page 21: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

2.2 Techniques for Face Recognition

where P is the number of neighbors to c, I(p) = I(x, y) the intensity atimage position p = (x, y), and

sgn(x) =

{1 if x ≥ 0 ,

0 if x < 0 .

2. An lbp operator is used to compute what positions constitute the neigh�borhood of a given center pixel. The original (and simplest) one is exem�pli�ed in Figure 2.1, but there are newer, more re�ned, operators. Onecan use, e.g., the extended lbp operator :

pi = (xc, yc) +R ·(

cos2πi

P,− sin

2πi

P

), 0 ≤ i < P , R ≥ 1 , (2.4)

where each position pi is a neighbor to the center position c = (xc, yc),R is the radius of the neighborhood circle, and P the number of samplepoints on the circle. The idea, then, is to vary the value of R to encodefeatures of varying scale. As pi in general will not be an integer pixelposition, its intensity must be interpolated from surrounding pixels using,e.g., bilinear or nearest-neighbor interpolation.2

3. The �nal feature vector for the image is computed by

a) dividing the image matrix into �xed-size cells, e.g., 16× 16 pixels;b) computing a local histogram over each cell; andc) concatenating the local histogram data.

Feature vectors can then be compared, matching a query image to a gallery offaces.

2.2.4 Joint Multiple Hidden Markov Models

Hidden Markov Models (hmms) have been used in pattern recognition for, e.g.,speech and gesture recognition. A variant for face recognition was proposed

2 Using matrix notation, bilinear interpolation is de�ned as

I(x, y) ≈[x2 − x x− x1

] [I(x1, y1) I(x1, y2)I(x2, y1) I(x2, y2)

] [y2 − yy − y1

],

or more compactly as

I(x∗, y∗) ≈[1− x∗ x∗] [I(0, 0) I(0, 1)

I(1, 0) I(1, 1)

] [1− y∗

y∗

]if we use the unit square coordinate system where every point is translated by (−x1,−y1).

11

Page 22: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 2 Face Recognition in Video

x

yc

xc

1 2 2

9 5 6

5 3 1

0 0 0

1 c 1

1 0 0

LBP(c) = 0011 00012

threshold

I(pi) ≥ I(c) = 5?

Figure 2.1: Illustration of the original lbp operator. x is a face image, repre�sented as a two-dimensional matrix. The operator can be expressedformally as pi = (xc, yc) +

(⌊cos 2πi

8

⌉,−⌊sin 2πi

8

⌉), 0 ≤ i < 8,

where each position pi is a neighboring pixel to the center pixelc = (xc, yc) and b·e denotes rounding to nearest integer. Adaptedfrom Equation (2.4).

by Professor Hung-Son Le [5, 6], called Joint Multiple Hidden Markov Models(jm-hmm).A hmm models the observable outcomes of a hidden stochastic process. I.e.,

a random process whose direct state is not observable is indirectly observedthrough another set of processes conditioned by the �rst. As illustrated inFigure 2.2, a hmm Λ = (Π,A,B) is de�ned by

1. the initial state probabilities Π = [πi], each element πi the probability ofof starting in state si,

2. its states S = {s1, s2, . . . , sN} with transition probability matrix A =[aij ], each entry aij the probability of transitioning to state sj when instate si, and

3. the set of observation symbols V = {v1, v2, . . . , vM}, with emission prob�ability matrix B = [bjk], each entry bjk the probability of emitting thesymbol vk when in state sj .

A corresponding hmm process, then, is a sequence of emitted observation sym�bols O = (o1, o2, . . . , oT ) and corresponding sequence of hidden states Q =(q1, q2, . . . , qT ), each observation ot emitted in state qt.In the jm-hmm algorithm, two problems are to be solved: a) computing

the parameters Π, A, and B of a model Λ that maximizes the probabilityof emitting a set of training observations (O1,O2, . . . ,ON ), and b) evaluating

12

Page 23: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

2.3 Evaluated Face Recognition Systems

−−−−→ q1 −−−−→ q2 −−−−→ · · · −−−−→ qTy y yo1 o2 · · · oT

qt ∈ S = {s1, s2, . . . , sN}ot ∈ V = {v1, v2, . . . , vM}

Figure 2.2: Illustration of a hmm process, where the hidden state sequenceQ = (q1, q2, . . . , qT ) is observed as a sequence of emitted symbolsO = (o1, o2, . . . , oT ), each transition qt → qt+1 and observationqt → ot with probabilities de�ned by the underlying model.

Identi�er Name Comments Sect.

wawo Wawo A research spin-o� company,based on research from UmeåUniversity.

2.3.1

openbr OpenBR Collaborative research projectsupported by the mitre Cor�poration.

2.3.2

opencv{1,2,3} OpenCV's facerec Experimental contribution toOpenCV's contrib package.

2.3.3

Table 2.2: Face recognition systems that are part of the comparison. The iden�ti�er is used when presenting the results. Each system is describedin more detail in its respective subsection.

the probability that a given observation sequence O was produced by a givenmodel Λ.

2.3 Evaluated Face Recognition Systems

This thesis aims to compare the three fr systems summarized in Table 2.2.Their theoretical foundation, design, and public api is described here.

13

Page 24: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 2 Face Recognition in Video

2.3.1 Wawo

TheWawo fr library and application is developed by the Umeå-based companyWawo Technology AB3. It was co-founded by Prof Le in 2008 as a way to makehis hmm-based approach [6] to fr into a product.

2.3.2 OpenBR

The OpenBR project4 (Open Source Biometrics Recognition) is an e�ort todevelop a library of open algorithms in the area of biometrics, as well as aframework to utilize and evaluate those. The evaluated version is 0.2, releasedFebruary 23, 2013.OpenBR's api is based on the pipes-and-�lters architectural pattern, where

each algorithm, called transform, is composed of a pipeline of �sub-transforms�or an actual implementation. The base objects that are manipulated in apipeline are (lists of) matrices, called templates; each transform step takes amatrix or list of matrices and modi�es it in some way, giving its output as inputto the next step. This template also stores a map with parameter values thatcan be used for one-o� data related to, but not part of, the matrix or matrices.For example, the FaceRecognition transform is expanded into

FRRegistration→ FRExtraction→ FREmbedding→ FRQuantization ,

which further expands into

[ASEFEyes→ Affine→ DFFS] −→

−→[Mask→

{DenseSIFT

DenseLBP

}→ ProjectPCA→ Normalize

]−→

−→ [RndSubspace→ ProjectLDA→ ProjectPCA] −→−→ [Normalize→ Quantize] .

(A → B means that the result from A is piped to B as input; { AB } that Aand B are performed independently on the same input, their respective resultsconcatenated as output. Some details removed for clarity.) Worth noting isthat the transforms are not static; while manipulating templates, they at thesame time learn internal parameters.As hinted by the transform names, OpenBR uses a combination of three of

the described fr techniques:

3http://www.wawo.com/4http://openbiometrics.org/

14

Page 25: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

2.3 Evaluated Face Recognition Systems

1. Eigenfaces, with the use of pca;2. Fisherfaces, with the use of lda; and3. lbp.

These are then weighted together when doing the face recognition.

2.3.3 OpenCV's Face Recognition Library

OpenCV 5 (Open Source Computer Vision) is a software library for computervision and machine learning, as well as a community surrounding this devel�opment. Recent versions of OpenCV contain an experimental library for facerecognition in its contrib module. The version evaluated is 2.4.3, releasedNovember 2, 2012.The facerec library contains three separate fr implementations:

1. EigenFaceRecognizer, implementing Eigenfaces and pca;2. FisherFaceRecognizer, implementing Fisherfaces and lda; and3. LBPHFaceRecognizer, implementing lbp Histograms.

The user selects which one to use though a factory pattern. All recognizersneed gray-scale face images for training and classi�cation, and the �rst tworecognizers additionally require that images are normalized to a �xed pixelresolution.In the comparison, opencv1 identi�es the classi�er FisherFaceRecognizer,

opencv2 identi�es LBPHFaceRecognizer, and opencv3 identi�es EigenFaceRecognizer.

5http://opencv.org/

15

Page 26: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 27: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 3

Comparison

This chapter describes how the comparison of the fr systems was done, morespeci�cally what aspects were evaluated, what test data was used, and howthey were compared.

3.1 Criteria for Evaluation

When evaluating the di�erent fr systems, comparison criteria must be chosen.To make it fair, all systems where given the same set of detected faces in theframes of each test video. Instead of each system using its own face detection(if at all available), they were fed the actual recorded faces from the truth dataof the test videos. Likewise, all tracking of face objects was disabled, so thatit would not interfere with the results.The evaluated aspect of the fr systems were the amount of correct vs. incor�

rect classi�cations of identi�ed faces, summing for all video frames. In discus�sion with CodeMill, aspects such as training and classi�cation time was deemedless important. The statistical methods for weighing correct and incorrect clas�si�cations against each other are presented in Section 3.3.

3.2 Test Material

The comparison used several di�erent test videos, listed in Table 3.1 and ex�empli�ed in Figure 3.1. They were prepared by extracting face rectangles fromeach video frame automatically with OpenCV's face detector, then manuallyassigning the correct identity (person) to each detected face. False positives(incorrect faces) were removed and false negatives (missing faces) inserted, sothat the test material would correspond with reality.The music quiz show Så ska det låta (b), a long-running entertainment se�

ries in Swedish television, was chosen for its complexity: there are many scenechanges and camera rolls, several people occurring in di�erent poses and con�stellations that makes fr di�cult. On the other hand, the locally recorded

17

Page 28: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 3 Comparison

Identi�er Description Length #P

codemill Video recorded at CodeMill us�ing webcam. 1�3 people in pic�ture at the time; camera still.

0:38@ 7.5 3

SA_SKA_DET_LATA First part of an episode fromthis year's season of theSwedish-tv music quiz Såska det låta. Multiple peoplein picture; a lot of cameramovements.

8:02 / 0:30@ 25 7 / 5

Table 3.1: The test videos used in the comparison. The Length column is thelength and frame rate of the video in the format �m:s @ fps�; �#P�is the number of identi�able people appearing in it. Table A.1(a)in the appendix contains more details.

c© CodeMill AB, used with permission.

(a) codemill frames.

Så ska det låta, c© Sveriges Television, used with perm.

(b) SA_SKA_DET_LATA frames.

Figure 3.1: Sample frames from the test videos.

18

Page 29: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

3.3 Introduction to ROC

webcam video (a) constitutes a simpler case, while not being trivial: the cam�era is still but people are constantly moving in and out of picture, 1�3 at thetime.Due to technical di�culties with one of the fr systems, a shorter version of

the Så ska det låta video was also prepared.Table A.1(b), (c) in the appendix contains detailed information of how the

identities are distributed, i.e., the class skew.

3.3 Introduction to ROC

This section gives an introduction to tables and matrices of confusion, trueand false positive rates, and Receiver Operating Characteristic plots. It canbe skipped if the reader already has an understanding of these concepts. It isbased on the excellent introductory article by Fawcett [3].

3.3.1 Tables of Confusion

Tables of confusion � as illustrated in Figure 3.2 � are used as a statistical toolwhen visualizing data from binary classi�cation experiments. E.g., in testingwhether a patient has a given disease, the person is tested positive if he or shehas the disease, negative otherwise. This may or may not correspond with thetruth: positive is the patient actually has the disease, negative if not.A table of confusion summarizes the the test result into four categories,

depending on actuality and classi�cation:

TP: the number of true-positive samples, correctly classi�ed as positive;

FP: the number of false-positive ones, incorrectly classi�ed as positive;

FN: the number of false-negative ones, incorrectly classi�ed as negative;

TN: the number of true-negative ones, correctly classi�ed as negative.

These values are in turn summed up row- and column-wise:

P = TP + FN , P′ = TP + FP ,

N = FP + TN , N′ = FN + TN .

Hence, P is the total number of positive samples and N the total number ofnegative ones, whereas P′ and N′ is the total number of positively and negativelyclassi�ed samples, respectively.

19

Page 30: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 3 Comparison

classi�ed total

actualTP FN PFP TN N

total P′ N′

Figure 3.2: Illustration of a table of confusion, summarizing the result of abinary classi�cation experiment into four categories, depending onactuality and classi�cation (rows vs. columns).

3.3.2 ROC Plots

Tables of confusion can be visualized in Receiver Operating Characteristic plots(roc plots) � as seen in Figure 3.3(a) � using the metrics true-positive rateTPR and false-positive rate FPR.1 They are de�ned as

TPR =TP

TP + FN=

TP

P, FPR =

FP

FP + TN=

FP

N, (3.1)

and the (FPR,TPR) pair of each table corresponds to a point in the plot.When interpreting a roc plot, points closer to the corner point (0, 1) and

farther away from the line of no discrimination y = x (the dashed diagonal line)are better, as that indicate higher TPR (more correct classi�cations) and/orlower FPR (fewer incorrect classi�cations).By varying the threshold parameter of a binary classi�er we get multiple

tables (one for each threshold value); plotting these as a curve show the trade-o�between high TPR and low FPR.

3.3.3 Confusion Matrices

In fr, the interest lies in classifying a given face into one of multiple identities,or classes. Therefore, the table of confusion concept must be generalized tomore than the two classes positive and negative.A confusion matrix contains the same information for N classes as a table of

confusion does for the binary case. As shown in Figure 3.4(a), each entry cjk(j, k ≤ N) is the number of samples of actual class j that are classi�ed as be�longing to class k. Hence, the N diagonal entries indicate correct classi�cationswhile all other N2 −N entries indicate errors.

1 TPR is equivalent with sensitivity and FPR is equivalent with (1− speci�city).

20

Page 31: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

3.3 Introduction to ROC

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

(a) roc curve.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1TPR

FPR

(b) Area under (roc) curve in (a).

Figure 3.3: Example of roc plots. The horizontal axis shows the false-positiverate FPR and the vertical the true-positive rate TPR. The curveis the roc curve of an example classi�er, each point correspondingto a di�erent threshold value. The �lled area is the auc metric ofthe given example curve.

classi�ed total

actual

c11 c12 · · · c1N C1

c21 c22 · · · c2N C2

......

. . ....

...cN1 cN2 · · · cNN CN

total C′1 C′2 · · · C′N

(a) Confusion matrix for an N -class classi�ca�tion experiment.

classi�ed total

actualTPi FNi PiFPi TNi Ni

total P′i N′i

(b) Class reference table of confu�sion corresponding to (a).

Figure 3.4: Illustration of a confusion matrix, summarizing the result of aN -class classi�cation experiment into a square matrix, and cor�responding class reference table of confusion (pairwise comparisonbetween each class i and all other).

21

Page 32: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 3 Comparison

The row sums C1, . . . ,CN , then, are the actual number of samples of eachclass, and the column sums C′1, . . . ,C

′N are the number of samples classi�ed as

each class. Formally:

Ci =∑k≤N

cik , C′i =∑j≤N

cji , i ≤ N .

The confusion matrix in itself is hard to visualize because of its high dimen�sionality. One method to handle this is called class reference formulation, whereeach class i is compared to all other classes (i ≤ N) [3, s. 9]. As illustrated inFigure 3.4(b), for each class, the matrix is reduced to a table of confusion wherethe original class i corresponds to the positive class and all other classes formthe negative class, formally

TPi = cii , FNi =∑k≤Nk 6=i

cik ,

FPi =∑j≤Nj 6=i

cji , TNi =∑

j,k≤Nj,k 6=i

cjk , (3.2)

and

Pi = Ci , P′i = C′i ,

Ni =∑j≤Nj 6=i

Cj =∑

j,k≤Nj 6=i

cjk , N′i =∑k≤Nk 6=i

C′k =∑

j,k≤Nk 6=i

cjk . (3.3)

3.3.4 Area under ROC Curves

When comparing roc curves, it is advantageous to reduce the two-dimensionalplot into a single scalar value. One common method is to compute the areaunder the curve (auc), illustrated in Figure 3.3(b) [3, s. 7]. The auc is alwaysbetween 0 and 1, as the roc curve is contained in the unit square. What ismore, a usable classi�er should have an area greater than 0.5, as anything lesswould indicate worse performance than pure luck.2 The auc value is computedfrom the (FPR,TPR) pairs of the roc curve, using the trapezoidal rule or anyother method for numerical integration.

2 Such a bad-performing classi�er could simply be inverted to produce better-than-randomresults.

22

Page 33: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

3.4 Method

Reasonably, one would like to have a single metric like the auc also for themulticlass case. A straightforward way is to compute the auc for each classindependently (using class reference formulation). Then combining these usinga weighted mean gives a simple yet reasonable global auc metric:

AUC =∑i≤N

Pi · AUCi , (3.4)

where AUCi is the area under the roc curve with reference class i and Pi ∈ [0, 1]is the distribution of class i (

∑i≤N Pi = 1).

3.4 Method

This section describes the comparison method for one test case. It is thenre-iterated for each.

Each test case � one video �le � contains a set of identi�ed faces F , eachwith an actual identity from the set of identities P , de�ned as

P = {ID1, ID2, . . . , IDN} ∪ {ID∅} ,

where N is the number of (identi�able) people occurring in the video. Theidentity ID∅ is special in that it has two meanings: either the face does notbelong to any of the people 1, ..., N or the system cannot tell whether it doesor not.

We also de�ne the following subsets of P for convenience later on:

Pi = P \ {i} ,P∅ = P \ {ID∅} .

For each fr system and test case, the method is as follows:

1. Each detected face in F is classi�ed by the fr system as belonging toan identity in P . The classi�cation data is summarized in a confusionmatrix A = [γjk], as de�ned in Section 3.3.3: each γjk (j, k ∈ P ) is thenumber of times the fr system classi�es a face of actual identity j ashaving identity k.

2. For each identity i ∈ P , we then construct a table of confusion Ai usingthe class reference method described in the same section: the positive

23

Page 34: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 3 Comparison

class corresponds to identity i while the negative class corresponds toidentities Pi. By Equation (3.2) we have

Ai =

TPi FNi

FPi TNi

=

γii∑k∈Pi

γik∑j∈Pi

γji∑j,k∈Pi

γjk

,and, by Equation (3.3), the corresponding row sums are

Pi =∑k∈P

γik , Ni =∑j∈Pik∈P

γjk .

Thus, by Equation (3.1), the true- and false-positive rates for each identityi is

TPRi =γii∑k∈P γik

, FPRi =

∑j∈Pi

γji∑j∈Pik∈P

γjk.

3. The fr system has a classi�cation threshold value, and varying it givesmultiple (FPRi,TPRi) pairs. Plotting each TPR value (y-axis) against itsFPR value (x-axis), produces the roc curve for reference class i.

4. Finally, for each identity i, the auc value AUCi of the roc curve is com�puted, as described in Section 3.3.4, as well as the mean auc value AUCfollowing Equation (3.4).

24

Page 35: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 4

Results

This chapter shows the results from the comparison of the fr systems. Foreach test video and fr system, two items are included:

1. roc plots for one of the identities vs. all other,2. the mean auc value, computed from the individual auc values.

Figure 4.2 and Table 4.1(a) show the results for test video codemill, while�gures 4.3, 4.4, and Table 4.1(b) show the results for SA_SKA_DET_LATA. Also,Figure 4.1 shows examples of the rendering of classi�ed faces in the originalvideo.More detailed results can be found in Appendix A: see each �gure or table

caption for speci�c references.

4.1 Problems

A big setback was the fact that the wawo system kept crashing when runningthe long version of the SA_SKA_DET_LATA video. To remedy this, a shorterversion of the same video was also used in the comparison; this way, at leastsome results are available for wawo in this setting.Also, the openbr system refused to accept multiple training images, rendering

the computed metrics a bit misleading. The auc values marked with * arecomputed by classifying those identities without any accepted training imagesas �None/Unknown�. I.e., from the perspective of the fr system, these identitiesdo not exist and should not be counted when computing auc values (and roc

curves).

4.2 Interpretation

For the codemill test case, surprisingly OpenCV's Eigenfaces implementation(opencv3) performed best. Compared with the results from the other testvideo, this seems like a �uke.

25

Page 36: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 4 Results

fr system Mean auc

wawo 0.65openbr 0.62opencv1 0.61opencv2 0.49opencv3 0.67

(a) codemill

Mean auc

fr system long short

wawo � 0.85openbr 0.64 / 0.88* 0.82 / 0.94*

opencv1 0.54 0.70opencv2 0.52 0.57opencv3 0.53 0.71

(b) SA_SKA_DET_LATA

Table 4.1: auc results. A detailed version is found in Table A.2 in the ap�pendix, containing the auc results also for each identity comparedto all other.*Disregarding identities with all face images failed.

For the long version of SA_SKA_DET_LATA, openbr performed best, eventhough the missing result from wawo makes the conclusion uncertain. How�ever, looking at the results for the shorter version, wawo has the highest meanauc value. As the results look quite similar to the longer version, a reasonableassumption could be that wawo would have performed best for both test cases.

26

Page 37: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

4.2 Interpretation

c© CodeMill AB, used with permission.

(a) Successful (correct id.) classi�cation. (b) Failed (�None/Unknown� id.) classi��cation.

Figure 4.1: The codemill sample frames from Figure 3.1(a), after doing fr.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(a) Identity 0 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(b) Identity 1 vs. all other

Figure 4.2: A selection of roc plots for codemill. The rest of the plots arefound in Figure A.1 in the appendix.

27

Page 38: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 4 Results

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(a) Identity 0 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(b) Identity 1 vs. all other

Figure 4.3: A selection of roc plots for SA_SKA_DET_LATA-long. The rest ofthe plots are found in �gures A.2 and A.3 in the appendix.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(a) Identity 0 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(b) Identity 1 vs. all other

Figure 4.4: A selection of roc plots for SA_SKA_DET_LATA-short. The rest ofthe plots are found in �gures A.4 and A.5 in the appendix.

28

Page 39: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 5

Conclusions

The project work was not without problems. One third into the project, I dis�covered that I had not received all relevant code for CodeMill's fr application,making some of work redundant or mis�t. Also, when doing the test runs togenerate comparison data, I noticed that the original fr system (wawo) keptcrashing one the longer test video. Thus, I had to rethink my test cases, alsousing a shorter version of that video.

Because of this, my initial time plan was revised for the later half of theproject. Among the changes, the task of improving one of the evaluated fr

systems was removed from the plan. Apart from that, my initial goal was met:I evaluated and compared alternatives to wawo. However, the performance ofCodeMill's application was not improved, as the evaluated systems did notnecessarily do a better job.

5.1 Limitations

The work with and the results of the comparison have some limitations:

� As can be seen in some of the roc plots (e.g. in Figure A.4(d) in theappendix), the curves are not always reaching in an arc from (0, 0) to(1, 1) as an ideal roc curve should. Or they are simply missing.

As a consequence, an assumption has been made when computing theauc values: implicit endpoints (0, 0) and (1, 1) have been inserted whenmissing. The e�ect of this is that, e.g., completely empty curves willreceive an auc value of 0.5.

One probable explanation for these incomplete curves is related to thee�ect of the fr system threshold parameters: varying them do not changethe ranking among the individual identities, only whether the highest-rankedone should be output as classi�cation or not. This means that for someof the roc curves we might not receive a pair (FPRi,TPRi) = (1, 1), as

29

Page 40: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Chapter 5 Conclusions

at least some face of identity i is not ever classi�ed correctly, regardlessof threshold value.

� This approach to calculating roc curves and auc values is not the moste�cient one. E.g., Fawcett [3] presents algorithms for roc and auc com�putations. However, they require that the classi�er is able to output somekind of probability for the classi�cations, rather than just a binary deci�sion. The fr systems openbr and opencv had this possibility but notwawo, making this approach impossible.

Given more accessible source code, one could modify most classi�ers toreport the probability in addition to the classi�cation. However, theavailable code for the wawosystem was not behaving correctly so it wasnot used (or modi�ed).

Solving the �rst point would require a more in-depth study of multi-class Re�ceiver Operating Characteristics. Maybe more parameters than a single thresh�old must be introduced.

5.2 Future work

The following points are areas of improvement for fr in video in general andCodeMill's fr application speci�cally:

� In combination with face tracking, use a pro�le face detector in combina�tion with the frontal detector, and aggregate the result. That way, theend fr result would hopefully improve, as the object tracker could followfaces even when people turn their head sideways and back.

� Also related to face tracking, do face recognition on the tracked face ob�jects continuously, not only on the statically detected faces. This shouldyield more information to integrate into the end result.

� Evaluate �true� video-based face recognition, such as the work of Gorod�nichy [4]. In this method, the classi�er directly uses the temporal informa�tion inherent in video streams, both for training and classi�cation. Thisinstead of looking at each video frame independently.

� Explore continuous learning, where a face classi�er receives ongoing cor�rections to its classi�cations by a user. As soon as a classi�ed face hasbeen �cleared� as correct, it can be used for training. Also a face that ismis-classi�ed could be used as counter-example if the fralgorithm sup�ports it.

30

Page 41: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

5.2 Future work

These �rst two points are just minor adjustments to the core work �ow de�scribed in Section 2.1. The third one, however, constitutes a di�erent approachto fr in video and would require major changes to the Vidispine architectureand CodeMill's application. The fourth point would require another approachto storing and handling training data (face galleries).

31

Page 42: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 43: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Acknowledgements

It has been a great opportunity to write my Master's thesis at CodeMill thissemester. Thanks goes to Petter Edblom for giving me this project, and TomasHärdin, my supervisor there, for answering my questions about the gory systemdetails and being a tech wizard in general.Also, I would like to thank Frank Drewes, my supervisor at the Department

of Computing Science, for being supportive and positive when I have havehit di�culties, as well as Linus Nilsson, previous student doing his Master'sthesis project before me at CodeMill, for taking inspiration from his scripts forcomparison.Last but not least, a heartfelt thank you to Linnéa Carlberg for her love and

encouragement, and for believing in me.

33

Page 44: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start
Page 45: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

References

[1] Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognitionwith local binary patterns. In Computer Vision � ECCV 2004, pages469�481. Springer, 2004. URL http:

//link.springer.com/chapter/10.1007/978-3-540-24670-1_36.

[2] Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman.Eigenfaces vs. �sherfaces: Recognition using class speci�c linearprojection. In Computer Vision � ECCV 1996, pages 43�58. Springer,1996. URLhttp://link.springer.com/chapter/10.1007/BFb0015522.

[3] Tom Fawcett. An introduction to roc analysis. Pattern RecognitionLetters, 27(8):861�874, 2006. URLhttp://people.inf.elte.hu/kiss/13dwhdm/roc.pdf.

[4] Dimitry Gorodnichy. Video-based framework for face recognition invideo. In Second Workshop on Face Processing in Video (FPiV 2005),Proceedings of the Second Canadian Conference on Computer and RobotVision (CRV 2005), pages 330�338. 2005. URL http://nparc.

cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=5764009.

[5] Hung-Son Le. Face Recognition Using Hidden Markov Models. Lic. thesis,Umeå University, 2005. URL http://libris.kb.se/bib/9941301.

[6] Hung-Son Le. Face Recognition: A Single View Based HMM Approach.PhD thesis, Umeå University, Applied Physics and Electronics, 2008.URLhttp://umu.diva-portal.org/smash/record.jsf?pid=diva2:141208.

[7] Linus Nilsson. Object Tracking and Face Recognition in Video Streams.Bachelor's thesis, Umeå University, Department of Computing Science,2012. URL http://umu.diva-portal.org/smash/record.jsf?

searchId=4&pid=diva2:546732.

[8] Tobias Nilsson. Speech Recognition Software and Vidispine. Master'sthesis, Umeå University, Department of Computing Science, 2013.

35

Page 46: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

References

[9] Thomas Rondahl. Face Detection in Digital Imagery Using ComputerVision and Image Processing. Bachelor's thesis, Umeå University,Department of Computing Science, 2011. URL http://umu.

diva-portal.org/smash/record.jsf?searchId=4&pid=diva2:480900.

[10] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal ofCognitive Neuroscience, 3(1):71�86, 1991. URL http:

//www.mitpressjournals.org/doi/abs/10.1162/jocn.1991.3.1.71.

36

Page 47: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Appendix A

Detailed Results

This appendix contains more information about the test videos (Table A.1) andthe full details of the results of the comparison (�gures A.1�A.5 and Table A.2).

37

Page 48: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Appendix A Detailed Results

Identi�er Length fps Res. #P #F #D #D#F

codemill 0:38 7.5 640× 480 3 283 352 1.24SA_SKA_DET_LATA 25.0 512× 288� long 8:02 7 12 062 9 435 0.78� short 0:30 5 750 602 0.80

(a) Details of the test videos used in the comparison.

Identity Gender #D %

0 Male 81 231 Male 52 152 Female 71 20

None/Unkn. 148 42

Total 352 100

(b) Class skew for codemill.

long short

Identity Gender #D % #D %

0 Male 1 316 14 84 141 Male 2 703 29 388 652 Female 578 6 17 33 Female 782 8 110 184 Female 558 65 Female 1 223 136 Male 1 838 19 1 0None/Unkn. 437 5 2 0

Total 9 435 100 602 100

(c) Class skew for SA_SKA_DET_LATA.

Table A.1: Details (a) and class skew (b), (c) for the test videos:In (a), for each video the columns show (left to right): length, framerate (frames per second), frame resolution (width times height inpixels), the number of identi�able people, the total number offrames, the number of detected faces, and the number of detectedfaces per video frame.In (b) and (c), the �#D� columns shows the number of face de�tections belonging to the given identity (i.e., class) and the �%�columns the percentage of total detections. Note that there is nocorrelation of identities between test videos.

38

Page 49: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

auc

Identity wawo openbr opencv1 opencv2 opencv3

0 0.58 0.68 0.48 0.50 0.591 0.65 0.52 0.66 0.46 0.672 0.80 0.68 0.76 0.53 0.81None/Unknown 0.60 0.59 0.60 0.47 0.65

Mean 0.65 0.62 0.61 0.49 0.67

(a) codemill

auc

Identity wawo openbr opencv1 opencv2 opencv3

0 � 0.71 / 0.76* 0.50 0.54 0.501 � 0.77 / 0.87* 0.62 0.49 0.482 � 0.82 / 0.85* 0.50 0.51 0.503 � 0.50 0.57 0.50 0.674 � 0.50 0.68 0.50 0.595 � 0.50 0.50 0.50 0.506 � 0.50 0.50 0.50 0.50None/Unknown � 0.75 / 0.91* 0.36 0.91 0.81

Mean � 0.64 / 0.88* 0.54 0.52 0.53

(b) SA_SKA_DET_LATA-long

auc

Identity wawo openbr opencv1 opencv2 opencv3

0 0.95 0.95 / 0.95* 0.50 0.50 0.501 0.83 0.88 / 0.94* 0.80 0.62 0.822 0.86 0.90 / 0.91* 0.87 0.38 0.503 0.88 0.50 0.50 0.50 0.506 0.50 0.50 0.50 0.50 0.50None/Unknown 0.91 0.88 / 0.92* 0.85 0.89 0.42

Mean 0.85 0.82 / 0.94* 0.70 0.57 0.71

(c) SA_SKA_DET_LATA-short

Table A.2: Detailed auc results. Note that there is no correlation of identitiesbetween the codemill and SA_SKA_DET_LATA test videos.*Disregarding identities with all face images failed. 39

Page 50: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Appendix A Detailed Results

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(a) Identity 0 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(b) Identity 1 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(c) Identity 2 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(d) None/Unknown identity vs. all other

Figure A.1: roc plots for codemill, comparing each identity to all others.

40

Page 51: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(a) Identity 0 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(b) Identity 1 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(c) Identity 2 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(d) Identity 3 vs. all other

Figure A.2: roc plots for SA_SKA_DET_LATA-long (part 1), comparing eachidentity to all others.

41

Page 52: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Appendix A Detailed Results

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(a) Identity 4 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1TPR

FPR

openbropencv1opencv2opencv3

(b) Identity 5 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(c) Identity 6 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

openbropencv1opencv2opencv3

(d) None/Unknown identity vs. all other

Figure A.3: roc plots for SA_SKA_DET_LATA-long (part 2), comparing eachidentity to all others.

42

Page 53: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(a) Identity 0 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(b) Identity 1 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(c) Identity 2 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(d) Identity 3 vs. all other

Figure A.4: roc plots for SA_SKA_DET_LATA-short (part 1), comparing eachidentity to all others.

43

Page 54: Face Recognition for Annotation in Media Asset Management ... · In the media industr,y mdeia asset management systems ( mam systems) are in creasingly important, as companies start

Appendix A Detailed Results

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(a) Identity 6 vs. all other

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

TPR

FPR

wawoopenbr

opencv1opencv2opencv3

(b) None/Unknown identity vs. all other

Figure A.5: roc plots for SA_SKA_DET_LATA-short (part 2), comparing eachidentity to all others.

44