Le LAR(s) et autres algorithmes de suivi de chemin de régularisation

Introduction Dfinition et principe Exemples `1 le LAR(s) SVR Conclusion

Le LAR(s)

et autres algorithmes de suivi de chemin dergularisation

Stphane Canu, Gilles Gasso et Alain Rakotomamonjy

LITISINSA de Rouen, France

St Etienne du Rouvray, [email protected]

Paris, 21 Mai 2007


Plan de lexpos

1 Dfinition et principe du chemin de rgularisationSlection de modle : choisir Chemin de rgularisation et parcimonie

2 Exemples de critres de type `1

3 le LAR(s)Le principeLalgorithme du LARLARS et LASSO

4 SVRPrincipe


Notations

Notations

Moindres carrs

Cout : S() = y X2 = y 2

prdictions : = X =d

j=1

xjjX =

[x1x2...xj ...xd

]solution : MC = (X>X )1X>y

Ridge rgression

Pnalisation : T () = 2 =d

j=1

2j

minIRd

S() + T ()

Solution : () = (X>X + I)1X>y

(0) = MC() = 0


Dfinition

minIRd

S() + T ()

Chemin de rgularisation de la ridge

{() | [0,]}Calcul du chemin de rgularisation :

(NEW) = ((OLD))

Point de dpart facile : (0) = MC; ou () = 0Avantages ?

efficace pour touver dans certains cas , certains S ou Ttrouver le bon gnrique : cas non convexe


Slection de modle : choisir

Redmarage chaud vs. chemin de rgularisation

minIRd

S() + T ()

- Tantque on a pas touv le bon 1 choisir un 2 min

IRdS() + T ()

3 valuer la solution

- Pour tous les calculer systmatiquement tous les () = = () = 0 = 0 = (0) = MC

Chemin de rgularisation{()

[0,[}



exemples de chemins de rgularisation

0

0

L(a) = ||Xa y||2

P(a

) =

||a|

|2

Pareto frontierLagrangian

0

0

L(a) = ||Xa y||2P

(a)

= |a

|/(1+

|a|)

Pareto frontierLagrangian

Avantages ?efficace pour touver dans certains cas , certains S ou Ttrouver le bon gnrique : cas non convexe



Un autre exemple : le chemin du OC SVM

exampleoneclassalpha.aviMedia File (video/avi)

exampleoneclass.aviMedia File (video/avi)


dans quels cas

cas favorable au chemin de rgularisation

minIRd

S() + T () {() | [0,]}

quand on sait le calculer efficacement

= linaire par morceauNEW = OLD + (NEW OLD)u

linaire par morceau

= un des critres S ou T est `1un critre `1

= parcimonie : plein de j = 0

parcimonie contraintes activesquels choix pour S et T ? Il faut du `1


dans quels cas

Chemin de rgularisation linaire par morceau

cas convexe [Rosset & Zhu, 07] minIRd

S() + T ()

1 linaire par morceau : lim0

(+ ) ()

= constante

2 optimalit S(()) + T (()) = 0S((+ )) + (+ )T ((+ )) = 0

3 developpement de Taylor

(+ ) ()

= (2S(())+2T (()))1T (())+O()

2S(()) = constante et 2T (()) = 0lun quadratique et lautre linaire par morceau


a propos des critres `1

Le monde bouge

The Gaussian Hare and the Laplacian TortoiseComputability of `1 vs. `2 Regression Estimators.

Portnoy & Koenker 1997


Chemin de rgularisation et parcimonie


minIRd

S() + T () {() | [0,]}


= linaire par morceauNEW = OLD + (NEW OLD)u

linaire par morceau

= un des critres S ou T est `1un critre `1

= parcimonie : plein de j = 0

parcimonie contraintes activespourquoi `1 entraine til de la parcimonie ?



Dfinition : Ensemble dhomognit forte (variables)

I0 ={

i {1, ...,d} i = 0}Thorme

Rgulier si S() + T () drivable en 0 et si I0(y) 6=

> 0, y B(y, ) tel que I0(y) 6= I0(y)

Singulier si S() + T () NONdrivable en 0 et si I0(y) 6=

> 0, y B(y, ) on a I0(y) = I0(y)Critre singulier = parcimonie Nikolova, 2000critres `1 ?

la singularit entraine la parcimonie




minIRd

S() + T () {() | [0,]}


= plein de j = 0NEW = OLD + (NEW OLD)u

calculer ufaire voluer lensemble I0 = {j |j = 0}

algorithme de type contraintes actives quelques exemples de critres `1 ?


le Lasso

Lasso et Basis pursuit

minIRd

y X2 + d

i=1

|i |

minIRd

y X2

avecd

i=1

|i | t

minIRdd

i=1

|i |

avec y X2

Tibshirani,1996 ; Chen, Donoho & Saunders, 1999 (Basis pursuit) ;Donoho et al., 2004 - Tropp 2004


le Lasso

LASSO et parcimonie

{minIR2

S() = y X2

avec T () = |1|+ |2| t


SVR

Support vector regression - SVR

Minimise le cout -insensible avec une rgularisation `2

= max(0, |y X| )minIRd

122 + C

ni=1

(+i + i )

avec 1I y X 1I + +et 0 +; 0

Vapnik, 1995

dualit variable - exemplesparcimonie par rapport aux exemples


SVR daprs Smola & Schlkopf

SVR et parcimonie (Smola & Schlkopf)L(, ,) = 122 + C

ni=1

(+i + i )

n

i=1

+i (y X +i )n

i=1

i (y + X i ) + ...

=n

i=1

(+i i )x>i f (x) =n

i=1

i x>i x

i ; +i = i = 0parcimonie par rapport aux exemples



Model selection

Choose the "correct" values of hyperparameters (or ) and

Illustration : tube width = 0.25

Pareto front

0 0.5 1 1.50

1

2

3

4

5

Penalty

Loss

Pareto front

Too smooth solution

0 0.2 0.4 0.6 0.8 11

0.5

0

0.5

1 = 0.25 and = 100



Model selection



Pareto front

0 0.5 1 1.50

1

2

3

4

5

Penalty

Loss

Pareto front

Better solution

0 0.2 0.4 0.6 0.8 11

0.5

0

0.5

1 = 0.25 and = 0.01



Model selection



Pareto front

0 2 4 6 80

2

4

6

8

10

12

14

Penalty

Loss

Pareto front

Too smooth solution

0 0.2 0.4 0.6 0.8 11

0.5

0

0.5

1 = 0.05 and = 72.2081



Model selection



Pareto front

0 2 4 6 80

2

4

6

8

10

12

14

Penalty

Loss

Pareto front

Convenient solution

0 0.2 0.4 0.6 0.8 11

0.5

0

0.5

1 = 0.05 and = 1.2328



Model selection



Pareto front

0 2 4 6 80

2

4

6

8

10

12

14

Penalty

Loss

Pareto front

Not enough smooth solution

0 0.2 0.4 0.6 0.8 11

0.5

0

0.5

1 = 0.05 and = 0.031623


Dantzig selector

Le Dantzig selector

minIRd

1avec

X>(X y) 1ICest de la programmation linaire

minIRd ,uIRd

di=1

ui

avec u u; 0 uet 1I X>(X y) 1I

Candes & Tao, 2004

pnalisation `1 - Fidlit `


Synthse des critres

rgulier vs. singulier

S() T () Rgression Discrimination

sing. sing.`1Dantzig Selector

LP SVRLP SVM

reg. sing.`1LASSOLARS

rgressionlogistique L1

sing. reg. `2 SVR SVM

reg. reg. `2 Ridgergression logistique

Lagrangien SVM

TAB.: SVM et SVR : support vector machine et support vectorregression. LP linear programming, LARS Least angle regressionstagewise.


Synthse des critres

comparaison empirique

Discussion of the Dantzig selector" Efron, Hastie & Tibshirani 2007


Plan de lexpos




4 SVRPrincipe


Notations

Notatons

tout est centr - variables explicatives standardises

ni=1

yi = 0n

i=1

xij = 0n

i=1

x2ij = 1 j = 1,d

Moindres carrs

Cout : S() = y X2 = y 2

prdictions : = X =d

j=1

xj jX =

...

x1x2...xj ...xd...

solution :

S() = 0 X>(y ) corrlations

= 0 MC = (X>X )1X>y


Les mthodes de rfrences

une mthode itrative constructive

le LASSO [Tibshirani 96]minIRd

S() = y X2 = y 2

avec T () =d

j=1

|j | t

Forward Stagewise regression [Weisberg 80]

new old + sign (cj) xjj = argmaxj=1,d |cj | c = X>(y ) = S()

Stagewise = la politique des petits pas



Forward Stagewise regression

1 initialisation : = 0 : rsidus : y X = y2 trouver le prdicteur xj le plus corrl avec le rsidu :

maxj=1,d

|c| avec c = X>(y X)

3 j j + avec = sign(cj)



Chemin de rgularisation (Hastie 05)

LASS0 Forward Stagewise regression


Le principe

LAR et Stagewise [Efron et al. 04]

LAR = Stagewise + optimal et variable


Le principe

une mthode itrative constructive

0 = 00 = X0 = 0

1 0 + 1w11 0 + 1x1

2 1 + 2w22 1 + 2u2

w1 =

0...01 j10...0

w2 =

...0a j10...0b j20...

une approche de type contraintes active A lensemble des variables actives


Le principe

LAR : corrlation et gomtrie (V. Guigue)

volution de la solution dans lespace des variables

X3

f 0

y

X2

X1

A

Dpart : tous les nuls


Le principe



X3

f 0

y

X2

R

uA

A

X1

Projection du rsidu sur la variable active


Le principe



X3

uA

f 1f

0

y

X2A

X1

Calcul du pas, qui-correlation entre le rsidu et deux variables


Le principe



X3

X2R

f 1f 0

y

uAX1

A

Projection du rsidu sur les variables actives


Le principe



X3

uAf 1f 0

y

X2

f 2

X1

A

Calcul du pas


Lalgorithme du LAR

Lalgorithme du LAR

A = lensemble des variables intressantes

k+1 k + wk+1 IRd

A A + wA IRcard(A)d A = k (A)

k+1 k + uk+1 IRn u = Xw = XAwA

le LAR1 trouver la variable ajouter j : A = A {j}2 calculer la direction de descente (w,u = Xw = XAwA)3 calculer le pas de descente


Lalgorithme du LAR

Lalgorithme du LARS

A A + kwAk+1 k + kuk {1, ...,d} = A A

c

1 trouver la variable ajouter j : maximise le gradient

j = argmaxjAc |Ac S()| = |X>Ac (y)| ; A = A j

2 la direction de descente : mthode du second ordre

wA = H1 AS()AS() = X

>A (y ) = c 1IA

H = X>AXA(X>AXA

)wA = 1IA

3 calculer le pas de descente

pour concerver la proprit : X>A (y ) = c 1I


Lalgorithme du LAR

que doit on calculer pour chaque opration

1 trouver la variable ajouter j : O(n(d k))|X >A (y )|

2 la direction de descente : mise jour (Choleski update)

XA = XA sjXj sj = sign(cj)(X>AXA

)wA = 1IA O(nk + k2)

3 calculer le pas de descente , O(n(d k)) trouver la prochaine variable ajouter

X>A (y ) = c 1I c = maxjAc |x>j (y )|


Lalgorithme du LAR

Calcul du pas > 0 qui concerve lquicorrlationDu cot de lensemble des variables actives

X>A (y k+1) = X>A (y k u) (1)= X>A (y k ) X>Au) (2)= c1I 1I (3)

Chaque composante du vecteur des corrlations vaut : c c = max

jAc|x>j (y k+1)| (4)

= maxjAc|x>j (y k )

cj

x>j uaj

| (5)au maximum j

c ={

cj ajcj + aj = minjAc

+

(c cj1 aj ,

c + cj1 + aj

) : le plus petit rel positif pour quune variable intgre A


Lalgorithme du LAR

volution de la corrlation


LARS et LASSO

Adaptation du LAR au LASSOminIRd

S() = y X2 = y 2

avec T () =d

j=1

|j | t

si j tel que j change de signealors on prend tel que j = 0 et cette composante quittelensemble actif AThorme 1Avec la modification dcrite ci-dessus,

le LARS donne les solutions du LASSO

les solutions = le chemin de rgularisation (t)


LARS et LASSO

Le chemin de rgularisation - Thorme 7

(t) = (t0) (t t0)AASAwA1 (t) = argminy XASA2 avec

A

sjj = t

2 L(, ) = y XASA2 + (

A sjj t)

L(, ) = S>AX>A(y XASA

)+ SA1I = 0

3 L(t1)L(t2) = S>AX>AXASA((t2)(t1)

)+(12)SA1I

(t2) (t1) = (1 2)SA(X>AXA

)11I4 sA

((t2)(t1)

)= t2t1 = (12)A1A ; A1A = 1I>

(X>AXA

)11I5 (t2) (t1) = (t1 t2) AA SA

(X>AXA

)11I wA

...est linaire par morceau


Bilan sur le LAR(s)

Bilan sur le LAR(s)

Rsumons nous

LARS est efficace

LARS = LASSO, LARS = stagewise

problmes de stabliti utiliser plutt le LARS stagewise

calcule vite le chemin de rgularisation point de dpart de nouvelles recherches


Plan de lexpos




4 SVRPrincipe


Principe

Support Vector Regression

Classical SVR (-SVR)-insensitive loss + L2 penalty

f = arg minf

1N

Ni=1

max(0, |yi f (xi , )| ) + 2f2

: width of the tubeProblems : choice of and the regularization parameter

3 2 0 + +2 +30

0.5

1

1.5

2

Residuals: r = yf(x)

in

sens

ible

lost

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

Data

True function

Tube


Principe

Support Vector Regression

Repartition of the training pointsL : yi f (xi) < , i = 1, i = 0 hors du tubeR : yi f (xi) > , i = 0, i = 1 hors du tubeC : |yi f (xi)| < , i = 0, i = 0 dans le tubeEL : yi f (xi) = , 0 i 1, i = 0 sur le tubeER : yi f (xi) = , i = 0,0 i 1 sur le tube

Chemin de rgularisation : trouver comment les points xicirculent entre ces ensembles en lien avec et

3 2 0 + +2 +30

0.5

1

1.5

2

Residuals: r = yf(x)

in

sens

ible

lost

L R

CE ERL

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8 = 0.1 and = 1.6681


Principe

Derivation of the -path

Regression function :f (x) = 1

(mi=1 (

i i)k(xi , x) + 0

)At step t : f t(x) associated to t and ELt , ERt , Ct , Lt and RtFor < t , suppose the sets unchanged. Thus

f (x) = f (x) t f t(x) + t f t(x)=

iEL tERt

(i i) k(xi , x) + 0 + t f t(x)

= t unknowns variations of the parameters.Sets unchanged two consequences for the points on thetube

j ELt , yj f t(xj) = t and yj f (xj) = j ERt , yj f t(xj) = +t and yj f (xj) = +

Optimality conditions : ()>1=0 and (+)>1=0


Principe

Derivation of the -path

Tiding all these informations, we have a linear square system of|EL|+ |ER|+ 2 unknowns i , with i EL, i with i ER,0 = b tbt and d = tt

A = ( t)z

Piecewise linear variation of the parameters

Parameter variations : = t + ( t)A1z and are piecewise linear in Bias b and width of the tube are piecewise linear in 1/


Principe

Determination of next - Algorithm

Detection of the points move between the setsPoints xi on the tube move inside or outside

From EL to L : i = 1 From ER to R : i = 1From EL or ER to C : i = 0 or i = 0

Points xi inside or outside move on the tubeFrom L to EL : f (xi) = yi + From R to ER : f (xi) = yi From C to EL or ER

Algorithm1 Fix - Initialize and the sets2 Solve the linear system3 Detect the points move and deduce the next value of 4 Update the parameters and the sets5 Run from 2 until a termination criterion is reached


Historique

2000 Homotopy [Osborne et al.]

2003 LARS [Efron et al.]

2004 SVM [Hastie et al., Bach et al.]

2004 cas gnral : linaire par morceau [Rosset et al.]

2004-06 cas non linaire : prdiction correction [Friedman el al.]

2004-06 Traitement de nombreux cas particuliersKernel LARS [Guigue et al.] SVM [Loosli et al.] SVR [Gasso et al.]OC SVM [Rakotomamonjy et al.]Laplcian SVM (semi supervis) [Gasso et al.]


Conclusions & perspectives

Chemin de rgularisationun algorithme efficaceparcimonie - n taille moyennelimites de lapproche

instabilitquand sarrternon convexit

Comressive imaging[Baraniuk et al, 2006]

Perspectivespassage lchelle : approches stochastiqueinstabilit - couplage avec les contraintes activesquand sarrter ? : borne adaptectritres non convexes : linatrisation par morceau


Questions

Questions ?

Des questions ?

LAR(s) (et svmpath) en R www-stat.stanford.edu/~hastieMatlab asi.insa-rouen.fr/~vguigue/LARS.html

kernlab cran.r-project.org/src/contrib/Descriptions/kernlab.htmlSVR Matlab asi.insa-rouen.fr/~arakotom

Danzig selector www.l1-magic.org/


Questions

biblio

Least Angle Regression. Least angle regression, Bradley Efron, Trevor Hastie,Iain Johnstone and Robert Tibshirani ; Annals of statistics, vol. 32 (2),pp.407-499, 2004A tutorial on support vector regression, A. Smola and B. Scholkopf, Statistics andComputing 14(3), pp 199-222, 2004The Dantzig selector : Statistical estimation when p is much smaller than n,Emmanuel Candes and Terence Tao Submitted to IEEE Transactions onInformation Theory, June 2005.Two-Dimensional Solution Path for Support Vector Regression, Gang Wang,Dit-Yan Yeung, Frederick H. Lochovsky, ICML, 2006Piecewise Linear Regularized Solution Paths, Saharon Rosset, Ji Zhu. Annals ofStatistics, to appear, 2007Algorithmic linear dimension reduction in the l1 norm for sparse vectors, A. C.Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, 2006 (submitted).An iterative thresholding algorithm for linear inverse problems with a sparsityconstraint, I. Daubechies, M. Defrise and C. De Mol, Comm. Pure Appl. Math,57, pp.1413-1541, 2004.Local strong homogeneity of a regularized estimator, Mila Nikolova, SIAMJournal on Applied Mathematics, vol. 61, no. 2, pp. 633-658, 2000.

Introduction

Dfinition et principe du chemin de rgularisationSlection de modle : choisir Chemin de rgularisation et parcimonie

Exemples de critres de type 1

le LAR(s)Le principeL'algorithme du LARLARS et LASSO

SVRPrincipe

Conclusion

Documents

Le LAR(s) et autres algorithmes de suivi de chemin de régularisation