View
217
Download
0
Category
Preview:
Citation preview
Réseaux de neurones: - à convolution - autoencodeur
Hugo Larochellehugo.larochelle@usherbrooke.caIFT615 - Intelligence ArtificielleUniversité de Sherbrooke
Réseau de neurone artificiel(classification)
•17
la fonction softmax :
softmax(a)i =eai
!Cj=1 eaj
où C correspond au nombre de classes. En utilisant cette fonction, la sortie vectorielle du
modèle f!(x) peut être interprétée comme une estimation de la distribution condition-
nelle de la classe y à assigner à l’entrée x. Dans le cas d’un problème de classification
binaire (Y = {0, 1}), étant donné la contrainte de sommation à 1 de la sortie du mo-dèle, on suppose plutôt que le modèle estime la probabilité p(y = 1|x) que l’entrée
appartienne à la classe 1 en définissant
f!(x) = sigmoid(wTx + b)
où on n’a maintenant besoin que d’un seul biais b plutôt que d’un vecteur, ainsi que d’un
seul vecteur de poidsw plutôt que d’une matriceW. Dans ce cas, le classifieur utilise la
fonction sigmoïde :
sigmoid(a) =1
1 + e!a.
Afin d’entraîner ce modèle, on utilise le coût associé au risque suivant :
L(f!(x), y) = ! log f!(x)y
ou dans le cas binaire
L(f!(x), y) = !y log f!(x) ! (1 ! y) log(1 ! f!(x)) ,
soit la log-vraisemblance conditionnelle négative de l’assignation de l’entrée x à la
classe y selon la distribution conditionnelle f!(x). Tel que décrit à la section 1.2.1, il
est alors possible d’entraîner ce modèle par descente de gradient stochastique2. Ainsi,
l’entraînement vise à maximiser la vraisemblance conditionnelle de la classification ob-
servée dans l’ensemble d’entraînement telle que donnée par le modèle. Il est intéressant
2De nombreuses autres approches ont aussi été développées pour l’entraînement d’un classifieur par
régression logistique. La descente de gradient stochastique est considérée ici pour sa simplicité et parce
qu’elle se comporte généralement bien sur des ensembles d’entraînement de grand taille.
17
la fonction softmax :
softmax(a)i =eai
!Cj=1 eaj
où C correspond au nombre de classes. En utilisant cette fonction, la sortie vectorielle du
modèle f!(x) peut être interprétée comme une estimation de la distribution condition-
nelle de la classe y à assigner à l’entrée x. Dans le cas d’un problème de classification
binaire (Y = {0, 1}), étant donné la contrainte de sommation à 1 de la sortie du mo-dèle, on suppose plutôt que le modèle estime la probabilité p(y = 1|x) que l’entrée
appartienne à la classe 1 en définissant
f!(x) = sigmoid(wTx + b)
où on n’a maintenant besoin que d’un seul biais b plutôt que d’un vecteur, ainsi que d’un
seul vecteur de poidsw plutôt que d’une matriceW. Dans ce cas, le classifieur utilise la
fonction sigmoïde :
sigmoid(a) =1
1 + e!a.
Afin d’entraîner ce modèle, on utilise le coût associé au risque suivant :
L(f!(x), y) = ! log f!(x)y
ou dans le cas binaire
L(f!(x), y) = !y log f!(x) ! (1 ! y) log(1 ! f!(x)) ,
soit la log-vraisemblance conditionnelle négative de l’assignation de l’entrée x à la
classe y selon la distribution conditionnelle f!(x). Tel que décrit à la section 1.2.1, il
est alors possible d’entraîner ce modèle par descente de gradient stochastique2. Ainsi,
l’entraînement vise à maximiser la vraisemblance conditionnelle de la classification ob-
servée dans l’ensemble d’entraînement telle que donnée par le modèle. Il est intéressant
2De nombreuses autres approches ont aussi été développées pour l’entraînement d’un classifieur par
régression logistique. La descente de gradient stochastique est considérée ici pour sa simplicité et parce
qu’elle se comporte généralement bien sur des ensembles d’entraînement de grand taille.
= − log p(y|x)
20
c,V
h (Vx + c)
! x"
!W"
softmax (Wh (Vx + c) + b)
!"#$%&'()&*+,-&
!"#$%&'$.$%-&
!"#$%&'(&'/",+0&
Figure 1.7 – Illustration d’un réseau de neurones artificiel.
ne peut être modélisée par un classifieur linéaire. Cependant, les fonctions OR et AND
OR(x1, x2) =
!"""""#
"""""$
0 si x1 = 0 et x2 = 0
1 si x1 = 1 et x2 = 0
1 si x1 = 1 et x2 = 0
1 si x1 = 1 et x2 = 1
AND(x1, x2) =
!"""""#
"""""$
0 si x1 = 0 et x2 = 0
0 si x1 = 1 et x2 = 0
0 si x1 = 1 et x2 = 1
0 si x1 = 1 et x2 = 1
peuvent l’être, ainsi que toute variation où la négation de certaines des composantes
d’entrée est appliquée préalablement (voir figure 1.8). En remarquant que
XOR(x1, x2) = OR (AND(x1, x2), AND(x1, x2))
il donc possible de modéliser la fonction XOR à l’aide d’un réseau de neurones avec
deux neurones cachés modélisant les fonctions AND(x1, x2) et AND(x1, x2) respecti-
vement, et dont les poids de sortie sont ceux d’un classifieur par régression logistique
modélisant la fonction OR(x1, x2). Ainsi, en combinant ensemble plusieurs fonctions
simples (OR et AND dans cet exemple), un réseau de neurones arrive à modéliser une
fonction d’une plus grande complexité (XOR). Cette explication de la complexité du
calcul de la sortie d’un réseau de neurones servira d’ailleurs d’inspiration aux travaux de
cette thèse.
Cela étant dit, ce dernier exemple est simpliste comparativement aux problèmes que
l’on souhaite typiquement résoudre. Entre autres, certains des réseaux de neurones qui
x
h(x) = sigmoid(Wx + b)
sigmoid(a)i =1
1 + e−ai
W
f(x) = softmax(Uh(x) + c)U
Entraînement• Apprentissage est défini comme un
problème d’optimisation
• Descente de gradient stochastique
1. Prendre exemple2. Calculer3. Diminuer avec un pas de gradient:
4. Retouner à 1.
(xt, yt) ∈ DL (fΘ(xt), yt)
∀θi ∈ Θθi ← θi − �∂
∂θiL(fΘ(xt), yt),
argminΘ
1n
�
(xt,yt)
L(fΘ(xt), yt)
backpropagation
L (fΘ(xt), yt)propagation avant
Quelle architecture utiliser?
• Ça dépend de l’application!
! y a-t-il des contraintes de temps d’exécution (application en temps réel)?
! y a-t-il de la structure connue à exploiter dans les données?
Application: vision par ordinateur
• Le cerveau comme inspiration:
!"##$%&'(#
...But the Mammalian Visual Cortex is Hierarchical. Why?...But the Mammalian Visual Cortex is Hierarchical. Why?
The ventral (recognition) pathway in the visual cortex has multiple stages
Retina - LGN - V1 - V2 - V4 - PIT - AIT ....
[picture from Simon Thorpe]
Réseau à convolution
• Date de plusieurs années déjà:
! Neocognition (Fukushima, 1980)
! LeNet (LeCun, 1989)
• Implémentent 3 idées:
! connectivité parcimonieuse («sparse»)
! connectivité locale
! partage de paramètres
Réseau à convolution(deux couches cachées)
!"##$%&'(#
An Old Idea for Image Representation with Distortion InvarianceAn Old Idea for Image Representation with Distortion Invariance
[Hubel & Wiesel 1962]: !"#$%&'(&%%!')&*&(*'%+(,%'-&,*./&!
(+#$%&0'(&%%!'1$++%2'*3&'+.*$.*!'+-'!"#$%&'(&%%!'4"*3"5','
/&*"5+*+$"('5&"637+/3++)8'
pooling subsampling
“Simple cells”“Complex cells”
Multiple convolutions
Retinotopic Feature Maps
Couche à convolution
Couche de «pooling»
Image(canaux)
Hubel & Wiesel video
http://www.youtube.com/watch?v=8VdFf3egwfg&feature=related
Réseau à convolution(X couches cachées)
!"##$%&'(#
Example of Architecture: Convolutional Network (ConvNet)Example of Architecture: Convolutional Network (ConvNet)
input
83x83
Layer 1
64x75x75Layer 2
64@14x14
Layer 3
256@6x6 Layer 4
256@1x1Output
101
9x9
convolution
(64 kernels)
9x9
convolution
(4096 kernels)
10x10 pooling,
5x5 subsampling6x6 pooling
4x4 subsamp
Non-Linearity: tanh, absolute value, shrinkage function, local whitening,..
Pooling: average, max, Lp norm, .....
Réseau à convolution(couche à convolution)
tanh(a) =ea − e−a
ea + e−a
Hi,j,k = tanh
�bk +
K�
s=1
K�
t=1
ws,t,k Xi−1+s,j−1+t
�
maps
H
X
noyau «kernel»filtre{
Réseau à convolution(couche à convolution)
Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2 !m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:
yj = gj tanh(!
i
kij " xi) (1)
where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9
CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #
"ipq wpq.xi,j+p,k+q, where wpq is
a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that
"ipq wpq = 1. The divisive
normalization computes yijk = vijk/max(c,!jk) where!jk = (
"ipq wpq.v2
i,j+p,k+q)1/2. For each sample, the
constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="
pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type FCSG!
Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, localcontrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P 4!4A .
Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4
M .
2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG # PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG # Rabs # PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG # Rabs # N # PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG # PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.
3. Training ProtocolGiven a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol isidentified by a letter R,U,R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.
tanh(a) =ea − e−a
ea + e−a
maps
H
X
noyau «kernel»filtre{
H·,·,k = tanh(bk + w·,·,k ∗ X)
Réseau à convolution(opération de convolution)
0 0 1 0 0
0 0 1 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
X
0 0.5
0.5 0
w
=∗ 0 0.5 0.5 0
0 0.5 0.5 0
0 1 0 0
1 0 0 0
H
• Exemple avec un seul canal dans l’entrée et une seule «map» pour la couche cachée
X
Réseau à convolution(couche à convolution 3D)
Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2 !m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:
yj = gj tanh(!
i
kij " xi) (1)
where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9
CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #
"ipq wpq.xi,j+p,k+q, where wpq is
a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that
"ipq wpq = 1. The divisive
normalization computes yijk = vijk/max(c,!jk) where!jk = (
"ipq wpq.v2
i,j+p,k+q)1/2. For each sample, the
constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="
pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type FCSG!
Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, localcontrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P 4!4A .
Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4
M .
2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG # PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG # Rabs # PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG # Rabs # N # PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG # PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.
3. Training ProtocolGiven a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol isidentified by a letter R,U,R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.
tanh(a) =ea − e−a
ea + e−a
maps
H
X
noyau «kernel»filtre{
Hi,j,k = tanh
�bk +
K�
s=1
K�
t=1
C�
c=1
ws,t,c,k Xi−1+s,j−1+t,c
�
• Deux étapes
1. «max pooling»:
3. «downsampling»: garde seulement une fraction des neurones
Réseau à convolution(couche de «pooling»)
Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2 !m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:
yj = gj tanh(!
i
kij " xi) (1)
where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9
CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #
"ipq wpq.xi,j+p,k+q, where wpq is
a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that
"ipq wpq = 1. The divisive
normalization computes yijk = vijk/max(c,!jk) where!jk = (
"ipq wpq.v2
i,j+p,k+q)1/2. For each sample, the
constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="
pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type FCSG!
Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, localcontrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P 4!4A .
Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4
M .
2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG # PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG # Rabs # PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG # Rabs # N # PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG # PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.
3. Training ProtocolGiven a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol isidentified by a letter R,U,R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.
Pi,j,k = max(i�,j�)∈N(i,j)
Hi�,j�,k
voisinage autour de
(i, j)
Réseau à convolution(classification)Deep Learning from Temporal Coherence in Video
Input Image72x72
3x3Convolution
C1
70x702x2
SubsamplingS2
35x354x4
ConvolutionC3
32x322x2
SubsamplingS4
16x165x5
ConvolutionC5
12x122x2
SubsamplingS6
6x66x6
ConvolutionC7
1x1
Full ConnectedF8
one output per class
Figure 1. A Convolutional Neural Network (CNN) performs a series of convolutions and subsamplings given the raw inputimage until it finally outputs a vector of predicted class labels.
parameters θ of the network:
L(θ) = −N�
n=1
log Pθ(yn|xn) = −N�
n=1
log P̃θ,yn(xn) (2)
We use stochastic gradient descent (Bottou, 1991) op-timization for that purpose. Random examples (x, y)are sampled from the training set. After computationof the gradient ∂L(θ)/∂θ, a gradient descent updateis applied:
θ ←− θ − λ∂L(θ, x, y)
∂θ, (3)
where λ is a carefully chosen learning rate (e.g., choos-ing the rate which optimizes the training error).
2.2. Leveraging Video Coherence
As highlighted in the introduction, video coherence en-sures that consecutive images in a video are likely torepresent the same scene. It is also natural to enforcethe representation of input images in the deep layersof the neural network to be similar if we know that thesame scene is represented in the input images.
We consider now two images x1 and x2, and theircorresponding generated representation zl
θ(x1) andzl
θ(x2) in the lth layer. We exploit the video coher-ence property by enforcing zl
θ(x1) and zlθ(x2) to be
close (in the �L1 norm) if the two input images are con-secutive video images. If the two input images are notconsecutive frames, then we push their representationsapart. This corresponds to minimizing the followingcost:
Lcoh(θ, x1,x2) = (4)
||zlθ(x1)− zl
θ(x2)||1, if x1, x2 consecutive
max(0, δ − ||zlθ(x1)− zl
θ(x2)||1), otherwise
where δ is the size of the margin, a hyperparameterchosen in advance, e.g. δ = 1.
Algorithm 1 Stochastic Gradient with Video Coher-ence.
Input: Labeled data (xn, yn), n = 1, ...N , unla-beled video data xn, n = N + 1, ...N + Urepeat
Pick a random labeled example (xn, yn)Make a gradient step to decrease L(θ, xn, yn)Pick a random pair of consecutive images xm,xn
in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)Pick a random pair of images xm,xn in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)
until Stopping criterion is met
In our experiments, we enforced video coherence as de-scribed in (4) on the (M−1)th layer of our M -layer net-work, i.e. on the representation yielded by the succes-sive layers of the network just before the final softmaxlayer (1). The reasoning behind this choice is that the�L1 distance we use may not be appropriate for the logprobability representation in the last layer, althoughin principle we could apply this coherence regulariza-tion at any layer l. In practice, minimizing (4) for allpairs of images is achieved by stochastic gradient de-scent over a “siamese network” architecture (Bromleyet al., 1993): two networks sharing the same param-eters θ compute the representation for two sampledimages x1 and x2 as shown in Figure 2. The gradientof the cost (4) with respect to θ is then computed andupdated in the same way as in (3).
The optimization of the object recognition task (2) andthe video coherence (4) is done simultaneously. Thatis, we minimize:
N�
n=1
L(θ, xn, yn) + γ�
m,n
Lcoh(θ, xm, xn)
with respect to θ.
In order to limit the number of hyper-parameters, we
Deep Learning from Temporal Coherence in Video
Input Image72x72
3x3Convolution
C1
70x702x2
SubsamplingS2
35x354x4
ConvolutionC3
32x322x2
SubsamplingS4
16x165x5
ConvolutionC5
12x122x2
SubsamplingS6
6x66x6
ConvolutionC7
1x1
Full ConnectedF8
one output per class
Figure 1. A Convolutional Neural Network (CNN) performs a series of convolutions and subsamplings given the raw inputimage until it finally outputs a vector of predicted class labels.
parameters θ of the network:
L(θ) = −N�
n=1
log Pθ(yn|xn) = −N�
n=1
log P̃θ,yn(xn) (2)
We use stochastic gradient descent (Bottou, 1991) op-timization for that purpose. Random examples (x, y)are sampled from the training set. After computationof the gradient ∂L(θ)/∂θ, a gradient descent updateis applied:
θ ←− θ − λ∂L(θ, x, y)
∂θ, (3)
where λ is a carefully chosen learning rate (e.g., choos-ing the rate which optimizes the training error).
2.2. Leveraging Video Coherence
As highlighted in the introduction, video coherence en-sures that consecutive images in a video are likely torepresent the same scene. It is also natural to enforcethe representation of input images in the deep layersof the neural network to be similar if we know that thesame scene is represented in the input images.
We consider now two images x1 and x2, and theircorresponding generated representation zl
θ(x1) andzl
θ(x2) in the lth layer. We exploit the video coher-ence property by enforcing zl
θ(x1) and zlθ(x2) to be
close (in the �L1 norm) if the two input images are con-secutive video images. If the two input images are notconsecutive frames, then we push their representationsapart. This corresponds to minimizing the followingcost:
Lcoh(θ, x1,x2) = (4)
||zlθ(x1)− zl
θ(x2)||1, if x1, x2 consecutive
max(0, δ − ||zlθ(x1)− zl
θ(x2)||1), otherwise
where δ is the size of the margin, a hyperparameterchosen in advance, e.g. δ = 1.
Algorithm 1 Stochastic Gradient with Video Coher-ence.
Input: Labeled data (xn, yn), n = 1, ...N , unla-beled video data xn, n = N + 1, ...N + Urepeat
Pick a random labeled example (xn, yn)Make a gradient step to decrease L(θ, xn, yn)Pick a random pair of consecutive images xm,xn
in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)Pick a random pair of images xm,xn in the videoMake a gradient step to decrease Lcoh(θ, xm, xn)
until Stopping criterion is met
In our experiments, we enforced video coherence as de-scribed in (4) on the (M−1)th layer of our M -layer net-work, i.e. on the representation yielded by the succes-sive layers of the network just before the final softmaxlayer (1). The reasoning behind this choice is that the�L1 distance we use may not be appropriate for the logprobability representation in the last layer, althoughin principle we could apply this coherence regulariza-tion at any layer l. In practice, minimizing (4) for allpairs of images is achieved by stochastic gradient de-scent over a “siamese network” architecture (Bromleyet al., 1993): two networks sharing the same param-eters θ compute the representation for two sampledimages x1 and x2 as shown in Figure 2. The gradientof the cost (4) with respect to θ is then computed andupdated in the same way as in (3).
The optimization of the object recognition task (2) andthe video coherence (4) is done simultaneously. Thatis, we minimize:
N�
n=1
L(θ, xn, yn) + γ�
m,n
Lcoh(θ, xm, xn)
with respect to θ.
In order to limit the number of hyper-parameters, we
...
• Descente de gradient stochastique
1. Prendre exemple2. Calculer3. Diminuer avec un pas de gradient:
4. Retouner à 1.
(xt, yt) ∈ DL (fΘ(xt), yt)
∀θi ∈ Θθi ← θi − �∂
∂θiL(fΘ(xt), yt),
L (fΘ(xt), yt)
«softmax»
Réseau à convolution
• Il existe plusieurs autres opérations utiles
! Rectification
! Normalisation de contraste
Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2 !m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:
yj = gj tanh(!
i
kij " xi) (1)
where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9
CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #
"ipq wpq.xi,j+p,k+q, where wpq is
a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that
"ipq wpq = 1. The divisive
normalization computes yijk = vijk/max(c,!jk) where!jk = (
"ipq wpq.v2
i,j+p,k+q)1/2. For each sample, the
constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="
pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type FCSG!
Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, localcontrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P 4!4A .
Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4
M .
2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG # PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG # Rabs # PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG # Rabs # N # PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG # PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.
3. Training ProtocolGiven a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol isidentified by a letter R,U,R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.
Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2 !m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:
yj = gj tanh(!
i
kij " xi) (1)
where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9
CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #
"ipq wpq.xi,j+p,k+q, where wpq is
a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that
"ipq wpq = 1. The divisive
normalization computes yijk = vijk/max(c,!jk) where!jk = (
"ipq wpq.v2
i,j+p,k+q)1/2. For each sample, the
constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="
pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type FCSG!
Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, localcontrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P 4!4A .
Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4
M .
2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG # PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG # Rabs # PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG # Rabs # N # PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG # PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.
3. Training ProtocolGiven a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol isidentified by a letter R,U,R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.
calcul dela valeur absolue
soustraction d’une moyenne locale
division par un écart-type local
Application: conduite automatique d’un robot
http://www.cs.nyu.edu/~yann/research/dave/index.html
Application: conduite automatique d’un robot
http://www.cs.nyu.edu/~yann/research/dave/index.html
D’autres applications en vision
• Détection de visage
• Google Streetview: détection de visage et de plaque d’immatriculation pour l’anonymisation
SYNERGISTIC FACE DETECTION AND POSE ESTIMATION WITH ENERGY-BASED MODELS
Figure 9: Some example face detections. Each white box shows the location of a detected face. Theangle of each box indicates the estimated in-plane rotation. The black crosshairs withineach box indicate the estimated yaw.
1211
Recommended