Réseaux de neurones - ericbeaudry.caericbeaudry.ca/~eric/ift615/LarochelleH_ift615_RN_convolution.pdf · calcul de la sortie d’un réseau de neurones servira d’ailleurs d’inspiration

Réseaux de neurones: - à convolution - autoencodeur

Hugo [email protected] - Intelligence ArtificielleUniversité de Sherbrooke

Réseau de neurone artificiel(classification)

•17

la fonction softmax :

softmax(a)i =eai

!Cj=1 eaj

où C correspond au nombre de classes. En utilisant cette fonction, la sortie vectorielle du

modèle f!(x) peut être interprétée comme une estimation de la distribution condition-

nelle de la classe y à assigner à l’entrée x. Dans le cas d’un problème de classification

binaire (Y = {0, 1}), étant donné la contrainte de sommation à 1 de la sortie du mo-dèle, on suppose plutôt que le modèle estime la probabilité p(y = 1|x) que l’entrée

appartienne à la classe 1 en définissant

f!(x) = sigmoid(wTx + b)

où on n’a maintenant besoin que d’un seul biais b plutôt que d’un vecteur, ainsi que d’un

seul vecteur de poidsw plutôt que d’une matriceW. Dans ce cas, le classifieur utilise la

fonction sigmoïde :

sigmoid(a) =1

1 + e!a.

Afin d’entraîner ce modèle, on utilise le coût associé au risque suivant :

L(f!(x), y) = ! log f!(x)y

ou dans le cas binaire

L(f!(x), y) = !y log f!(x) ! (1 ! y) log(1 ! f!(x)) ,

soit la log-vraisemblance conditionnelle négative de l’assignation de l’entrée x à la

classe y selon la distribution conditionnelle f!(x). Tel que décrit à la section 1.2.1, il

est alors possible d’entraîner ce modèle par descente de gradient stochastique2. Ainsi,

l’entraînement vise à maximiser la vraisemblance conditionnelle de la classification ob-

servée dans l’ensemble d’entraînement telle que donnée par le modèle. Il est intéressant

2De nombreuses autres approches ont aussi été développées pour l’entraînement d’un classifieur par

régression logistique. La descente de gradient stochastique est considérée ici pour sa simplicité et parce

qu’elle se comporte généralement bien sur des ensembles d’entraînement de grand taille.

17

la fonction softmax :

softmax(a)i =eai

!Cj=1 eaj

où C correspond au nombre de classes. En utilisant cette fonction, la sortie vectorielle du

modèle f!(x) peut être interprétée comme une estimation de la distribution condition-

nelle de la classe y à assigner à l’entrée x. Dans le cas d’un problème de classification

binaire (Y = {0, 1}), étant donné la contrainte de sommation à 1 de la sortie du mo-dèle, on suppose plutôt que le modèle estime la probabilité p(y = 1|x) que l’entrée

appartienne à la classe 1 en définissant

f!(x) = sigmoid(wTx + b)

où on n’a maintenant besoin que d’un seul biais b plutôt que d’un vecteur, ainsi que d’un

seul vecteur de poidsw plutôt que d’une matriceW. Dans ce cas, le classifieur utilise la

fonction sigmoïde :

sigmoid(a) =1

1 + e!a.

Afin d’entraîner ce modèle, on utilise le coût associé au risque suivant :

L(f!(x), y) = ! log f!(x)y

ou dans le cas binaire

L(f!(x), y) = !y log f!(x) ! (1 ! y) log(1 ! f!(x)) ,

soit la log-vraisemblance conditionnelle négative de l’assignation de l’entrée x à la

classe y selon la distribution conditionnelle f!(x). Tel que décrit à la section 1.2.1, il

est alors possible d’entraîner ce modèle par descente de gradient stochastique2. Ainsi,

l’entraînement vise à maximiser la vraisemblance conditionnelle de la classification ob-

servée dans l’ensemble d’entraînement telle que donnée par le modèle. Il est intéressant

2De nombreuses autres approches ont aussi été développées pour l’entraînement d’un classifieur par

régression logistique. La descente de gradient stochastique est considérée ici pour sa simplicité et parce

qu’elle se comporte généralement bien sur des ensembles d’entraînement de grand taille.

= − log p(y|x)

20

c,V

h (Vx + c)

! x"

!W"

softmax (Wh (Vx + c) + b)

!"#$%&'()&*+,-&

!"#$%&'$.$%-&

!"#$%&'(&'/",+0&

Figure 1.7 – Illustration d’un réseau de neurones artificiel.

ne peut être modélisée par un classifieur linéaire. Cependant, les fonctions OR et AND

OR(x1, x2) =

!"""""#

"""""$

0 si x1 = 0 et x2 = 0

1 si x1 = 1 et x2 = 0

1 si x1 = 1 et x2 = 0

1 si x1 = 1 et x2 = 1

AND(x1, x2) =

!"""""#

"""""$

0 si x1 = 0 et x2 = 0

0 si x1 = 1 et x2 = 0

0 si x1 = 1 et x2 = 1

0 si x1 = 1 et x2 = 1

peuvent l’être, ainsi que toute variation où la négation de certaines des composantes

d’entrée est appliquée préalablement (voir figure 1.8). En remarquant que

XOR(x1, x2) = OR (AND(x1, x2), AND(x1, x2))

il donc possible de modéliser la fonction XOR à l’aide d’un réseau de neurones avec

deux neurones cachés modélisant les fonctions AND(x1, x2) et AND(x1, x2) respecti-

vement, et dont les poids de sortie sont ceux d’un classifieur par régression logistique

modélisant la fonction OR(x1, x2). Ainsi, en combinant ensemble plusieurs fonctions

simples (OR et AND dans cet exemple), un réseau de neurones arrive à modéliser une

fonction d’une plus grande complexité (XOR). Cette explication de la complexité du

calcul de la sortie d’un réseau de neurones servira d’ailleurs d’inspiration aux travaux de

cette thèse.

Cela étant dit, ce dernier exemple est simpliste comparativement aux problèmes que

l’on souhaite typiquement résoudre. Entre autres, certains des réseaux de neurones qui

x

h(x) = sigmoid(Wx + b)

sigmoid(a)i =1

1 + e−ai

W

f(x) = softmax(Uh(x) + c)U

Entraînement• Apprentissage est défini comme un

problème d’optimisation

• Descente de gradient stochastique

1. Prendre exemple2. Calculer3. Diminuer avec un pas de gradient:

4. Retouner à 1.

(xt, yt) ∈ DL (fΘ(xt), yt)

∀θi ∈ Θθi ← θi − �∂

∂θiL(fΘ(xt), yt),

argminΘ

1n

�

(xt,yt)

L(fΘ(xt), yt)

backpropagation

L (fΘ(xt), yt)propagation avant

Quelle architecture utiliser?

• Ça dépend de l’application!

! y a-t-il des contraintes de temps d’exécution (application en temps réel)?

! y a-t-il de la structure connue à exploiter dans les données?

Application: vision par ordinateur

• Le cerveau comme inspiration:

!"##$%&'(#

...But the Mammalian Visual Cortex is Hierarchical. Why?...But the Mammalian Visual Cortex is Hierarchical. Why?

The ventral (recognition) pathway in the visual cortex has multiple stages

Retina - LGN - V1 - V2 - V4 - PIT - AIT ....

[picture from Simon Thorpe]

Réseau à convolution

• Date de plusieurs années déjà:

! Neocognition (Fukushima, 1980)

! LeNet (LeCun, 1989)

• Implémentent 3 idées:

! connectivité parcimonieuse («sparse»)

! connectivité locale

! partage de paramètres

Réseau à convolution(deux couches cachées)

!"##$%&'(#

An Old Idea for Image Representation with Distortion InvarianceAn Old Idea for Image Representation with Distortion Invariance

[Hubel & Wiesel 1962]: !"#$%&'(&%%!')&*&(*'%+(,%'-&,*./&!

(+#$%&0'(&%%!'1$++%2'*3&'+.*$.*!'+-'!"#$%&'(&%%!'4"*3"5','

/&*"5+*+$"('5&"637+/3++)8'

pooling subsampling

“Simple cells”“Complex cells”

Multiple convolutions

Retinotopic Feature Maps

Couche à convolution

Couche de «pooling»

Image(canaux)

Hubel & Wiesel video

http://www.youtube.com/watch?v=8VdFf3egwfg&feature=related

Réseau à convolution(X couches cachées)

!"##$%&'(#

Example of Architecture: Convolutional Network (ConvNet)Example of Architecture: Convolutional Network (ConvNet)

input

83x83

Layer 1

64x75x75Layer 2

64@14x14

Layer 3

256@6x6 Layer 4

256@1x1Output

101

9x9

convolution

(64 kernels)

9x9

convolution

(4096 kernels)

10x10 pooling,

5x5 subsampling6x6 pooling

4x4 subsamp

Non-Linearity: tanh, absolute value, shrinkage function, local whitening,..

Pooling: average, max, Lp norm, .....

Réseau à convolution(couche à convolution)

tanh(a) =ea − e−a

ea + e−a

Hi,j,k = tanh

�bk +

K�

s=1

K�

t=1

ws,t,k Xi−1+s,j−1+t

�

maps

H

X

noyau «kernel»filtre{

Réseau à convolution(couche à convolution)

Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2 !m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:

yj = gj tanh(!

i

kij " xi) (1)

where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9

CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #

"ipq wpq.xi,j+p,k+q, where wpq is

a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that

"ipq wpq = 1. The divisive

normalization computes yijk = vijk/max(c,!jk) where!jk = (

"ipq wpq.v2

i,j+p,k+q)1/2. For each sample, the

constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="

pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling

Figure 1. A example of feature extraction stage of the type FCSG!

Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, localcontrast normalization and spatial pooling/sub-sampling.

layer with 4x4 down-sampling is denoted: P 4!4A .

Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4

M .

2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the

above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG # PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG # Rabs # PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG # Rabs # N # PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG # PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.

3. Training ProtocolGiven a particular architecture, a number of training pro-

tocols have been considered and tested. Each protocol isidentified by a letter R,U,R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.


ea + e−a

maps

H

X


H·,·,k = tanh(bk + w·,·,k ∗ X)

Réseau à convolution(opération de convolution)

0 0 1 0 0

0 0 1 0 0

0 0 1 0 0

0 1 0 0 0

1 0 0 0 0

X

0 0.5

0.5 0

w

=∗ 0 0.5 0.5 0

0 0.5 0.5 0

0 1 0 0

1 0 0 0

H

• Exemple avec un seul canal dans l’entrée et une seule «map» pour la couche cachée

X

Réseau à convolution(couche à convolution 3D)


yj = gj tanh(!

i

kij " xi) (1)







"ipq wpq.v2








M .






ea + e−a

maps

H

X


Hi,j,k = tanh

�bk +

K�

s=1

K�

t=1

C�

c=1

ws,t,c,k Xi−1+s,j−1+t,c

�

• Deux étapes

1. «max pooling»:

3. «downsampling»: garde seulement une fraction des neurones

Réseau à convolution(couche de «pooling»)


yj = gj tanh(!

i

kij " xi) (1)







"ipq wpq.v2








M .





Pi,j,k = max(i�,j�)∈N(i,j)

Hi�,j�,k

voisinage autour de

(i, j)

Réseau à convolution(classification)Deep Learning from Temporal Coherence in Video

Input Image72x72

3x3Convolution

C1

70x702x2

SubsamplingS2

35x354x4

ConvolutionC3

32x322x2

SubsamplingS4

16x165x5

ConvolutionC5

12x122x2

SubsamplingS6

6x66x6

ConvolutionC7

1x1

Full ConnectedF8

one output per class

Figure 1. A Convolutional Neural Network (CNN) performs a series of convolutions and subsamplings given the raw inputimage until it finally outputs a vector of predicted class labels.

parameters θ of the network:

L(θ) = −N�

n=1

log Pθ(yn|xn) = −N�

n=1

log P̃θ,yn(xn) (2)

We use stochastic gradient descent (Bottou, 1991) op-timization for that purpose. Random examples (x, y)are sampled from the training set. After computationof the gradient ∂L(θ)/∂θ, a gradient descent updateis applied:

θ ←− θ − λ∂L(θ, x, y)

∂θ, (3)

where λ is a carefully chosen learning rate (e.g., choos-ing the rate which optimizes the training error).

2.2. Leveraging Video Coherence

As highlighted in the introduction, video coherence en-sures that consecutive images in a video are likely torepresent the same scene. It is also natural to enforcethe representation of input images in the deep layersof the neural network to be similar if we know that thesame scene is represented in the input images.

We consider now two images x1 and x2, and theircorresponding generated representation zl

θ(x1) andzl

θ(x2) in the lth layer. We exploit the video coher-ence property by enforcing zl

θ(x1) and zlθ(x2) to be

close (in the �L1 norm) if the two input images are con-secutive video images. If the two input images are notconsecutive frames, then we push their representationsapart. This corresponds to minimizing the followingcost:

Lcoh(θ, x1,x2) = (4)

||zlθ(x1)− zl

θ(x2)||1, if x1, x2 consecutive

max(0, δ − ||zlθ(x1)− zl

θ(x2)||1), otherwise

where δ is the size of the margin, a hyperparameterchosen in advance, e.g. δ = 1.

Algorithm 1 Stochastic Gradient with Video Coher-ence.

Input: Labeled data (xn, yn), n = 1, ...N , unla-beled video data xn, n = N + 1, ...N + Urepeat

Pick a random labeled example (xn, yn)Make a gradient step to decrease L(θ, xn, yn)Pick a random pair of consecutive images xm,xn

in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)Pick a random pair of images xm,xn in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)

until Stopping criterion is met

In our experiments, we enforced video coherence as de-scribed in (4) on the (M−1)th layer of our M -layer net-work, i.e. on the representation yielded by the succes-sive layers of the network just before the final softmaxlayer (1). The reasoning behind this choice is that the�L1 distance we use may not be appropriate for the logprobability representation in the last layer, althoughin principle we could apply this coherence regulariza-tion at any layer l. In practice, minimizing (4) for allpairs of images is achieved by stochastic gradient de-scent over a “siamese network” architecture (Bromleyet al., 1993): two networks sharing the same param-eters θ compute the representation for two sampledimages x1 and x2 as shown in Figure 2. The gradientof the cost (4) with respect to θ is then computed andupdated in the same way as in (3).

The optimization of the object recognition task (2) andthe video coherence (4) is done simultaneously. Thatis, we minimize:

N�

n=1

L(θ, xn, yn) + γ�

m,n

Lcoh(θ, xm, xn)

with respect to θ.

In order to limit the number of hyper-parameters, we

Deep Learning from Temporal Coherence in Video

Input Image72x72

3x3Convolution

C1

70x702x2

SubsamplingS2

35x354x4

ConvolutionC3

32x322x2

SubsamplingS4

16x165x5

ConvolutionC5

12x122x2

SubsamplingS6

6x66x6

ConvolutionC7

1x1

Full ConnectedF8

one output per class

Figure 1. A Convolutional Neural Network (CNN) performs a series of convolutions and subsamplings given the raw inputimage until it finally outputs a vector of predicted class labels.

parameters θ of the network:

L(θ) = −N�

n=1

log Pθ(yn|xn) = −N�

n=1

log P̃θ,yn(xn) (2)

We use stochastic gradient descent (Bottou, 1991) op-timization for that purpose. Random examples (x, y)are sampled from the training set. After computationof the gradient ∂L(θ)/∂θ, a gradient descent updateis applied:

θ ←− θ − λ∂L(θ, x, y)

∂θ, (3)

where λ is a carefully chosen learning rate (e.g., choos-ing the rate which optimizes the training error).

2.2. Leveraging Video Coherence

As highlighted in the introduction, video coherence en-sures that consecutive images in a video are likely torepresent the same scene. It is also natural to enforcethe representation of input images in the deep layersof the neural network to be similar if we know that thesame scene is represented in the input images.

We consider now two images x1 and x2, and theircorresponding generated representation zl

θ(x1) andzl

θ(x2) in the lth layer. We exploit the video coher-ence property by enforcing zl

θ(x1) and zlθ(x2) to be

close (in the �L1 norm) if the two input images are con-secutive video images. If the two input images are notconsecutive frames, then we push their representationsapart. This corresponds to minimizing the followingcost:

Lcoh(θ, x1,x2) = (4)

||zlθ(x1)− zl

θ(x2)||1, if x1, x2 consecutive

max(0, δ − ||zlθ(x1)− zl

θ(x2)||1), otherwise

where δ is the size of the margin, a hyperparameterchosen in advance, e.g. δ = 1.

Algorithm 1 Stochastic Gradient with Video Coher-ence.

Input: Labeled data (xn, yn), n = 1, ...N , unla-beled video data xn, n = N + 1, ...N + Urepeat

Pick a random labeled example (xn, yn)Make a gradient step to decrease L(θ, xn, yn)Pick a random pair of consecutive images xm,xn

in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)Pick a random pair of images xm,xn in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)

until Stopping criterion is met

In our experiments, we enforced video coherence as de-scribed in (4) on the (M−1)th layer of our M -layer net-work, i.e. on the representation yielded by the succes-sive layers of the network just before the final softmaxlayer (1). The reasoning behind this choice is that the�L1 distance we use may not be appropriate for the logprobability representation in the last layer, althoughin principle we could apply this coherence regulariza-tion at any layer l. In practice, minimizing (4) for allpairs of images is achieved by stochastic gradient de-scent over a “siamese network” architecture (Bromleyet al., 1993): two networks sharing the same param-eters θ compute the representation for two sampledimages x1 and x2 as shown in Figure 2. The gradientof the cost (4) with respect to θ is then computed andupdated in the same way as in (3).

The optimization of the object recognition task (2) andthe video coherence (4) is done simultaneously. Thatis, we minimize:

N�

n=1

L(θ, xn, yn) + γ�

m,n

Lcoh(θ, xm, xn)

with respect to θ.

In order to limit the number of hyper-parameters, we

...

• Descente de gradient stochastique

1. Prendre exemple2. Calculer3. Diminuer avec un pas de gradient:

4. Retouner à 1.

(xt, yt) ∈ DL (fΘ(xt), yt)

∀θi ∈ Θθi ← θi − �∂

∂θiL(fΘ(xt), yt),

L (fΘ(xt), yt)

«softmax»

Réseau à convolution

• Il existe plusieurs autres opérations utiles

! Rectification

! Normalisation de contraste


yj = gj tanh(!

i

kij " xi) (1)







"ipq wpq.v2








M .






yj = gj tanh(!

i

kij " xi) (1)







"ipq wpq.v2








M .





calcul dela valeur absolue

soustraction d’une moyenne locale

division par un écart-type local

Application: conduite automatique d’un robot

http://www.cs.nyu.edu/~yann/research/dave/index.html

Application: conduite automatique d’un robot

http://www.cs.nyu.edu/~yann/research/dave/index.html

D’autres applications en vision

• Détection de visage

• Google Streetview: détection de visage et de plaque d’immatriculation pour l’anonymisation

SYNERGISTIC FACE DETECTION AND POSE ESTIMATION WITH ENERGY-BASED MODELS

Figure 9: Some example face detections. Each white box shows the location of a detected face. Theangle of each box indicates the estimated in-plane rotation. The black crosshairs withineach box indicate the estimated yaw.

1211

Documents

Réseaux de neurones - ericbeaudry.caericbeaudry.ca/~eric/ift615/LarochelleH_ift615_RN_convolution.pdf · calcul de la sortie d’un réseau de neurones servira d’ailleurs d’inspiration