141
RECOMMANDATION SOCIALE Patrice Bellot Aix-Marseille Université - CNRS (LSIS UMR 7296) — OpenEdition [email protected] LSIS - DIMAG team http://www.lsis.org/dimag OpenEdition Lab : http://lab.hypotheses.org

Recommandation sociale : filtrage collaboratif et par le contenu

Embed Size (px)

Citation preview

Page 1: Recommandation sociale : filtrage collaboratif et par le contenu

RECOMMANDATION SOCIALEPatriceBellotAix-MarseilleUniversité-CNRS(LSISUMR7296)—OpenEdition

[email protected]://www.lsis.org/dimagOpenEditionLab:http://lab.hypotheses.org

Page 2: Recommandation sociale : filtrage collaboratif et par le contenu

OpenEditionhomepage

>4millionuniquevisitors/month

Ourpartners:librariesaninstitutionsallovertheworld

Page 3: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Quelques questions ouvertes…

— Est-il utile d’exploiter les méta-données, les contenus, les commentaires ?

— Comment relier les contenus les uns aux autres ?

— Comment exploiter des contenus de nature différente ?

— Comment « comprendre » les besoins des lecteurs ? des requêtes longues ? des profils ?

— Quels sont les usages ? Quels sont les besoins ?

— Comment aller au-delà de la pertinence informationnelle ? (genre, niveau d’expertise, document récent ou non…)

3

— OpenEdition Lab : un programme de recherche HN — Détecter des tendances, des sujets émergents, les livres « à lire »…

Page 4: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Plan

— Quelques exemples : poser les problèmes et les enjeux

— Quelles ressources ?

— Quelques généralités méthodologiques

— Quelques stratégies d’évaluation d’une recommandation

— Autour du filtrage collaboratif ( = recommandation « sociale » ?)

— Autour de l’analyse de contenu et de la suggestion de contenus

. focus sur la recherche de livres par requêtes longues en langue naturelle

4

Page 5: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Introduction

Objectifs de la recommandation :

— Recommander des « objets » (films, livres, pages Web…)

— Prédire les notes que individus donneraient

Différents types de recommandation :

— Selon des connaissances : caractéristiques sur les individus cibles (âge, salaire…)

— Selon les préférences des individus

— exprimées par les individus eux-mêmes explicitement

— devinées en analysant leur comportement (%) — lien avec classification

— En croisant les comportements des individus : filtrage collaboratif

— En construisant des profils et en les comparant aux contenus

Un grand nombre de sources d’information :

— Informations explicitement données par les individus

— Les contenus et leurs méta-données

— Le Web et les réseaux sociaux (contenus, graphes…)

5

Page 6: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 6

Page 7: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

ACM Conférences et ateliers

— Conférences :

— Recommender Systems RecSys (depuis 2007)

— Sessions « Recommendation Systems » à SIGIR, CIKM,

— Ateliers :

— Context-aware Movie Recommendation (2010+2011)

— Information Heterogeneity and Fusion in Recommender Systems (2010+2011)

— Large-Scale Recommender Systems and the Netflix Prize Competition (2008)

— Recommendation Systems for Software Engineering (2008-14)

— Recommender Systems and the Social Web (2012)

7

Page 8: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Articles « systèmes de recommandation »

Conférence ACM RecSys (https://recsys.acm.org)

8

Page 9: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 9

Page 10: Recommandation sociale : filtrage collaboratif et par le contenu

Aperçu des approches

10

Page 11: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

EXEMPLES

11

Page 12: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 12

Page 13: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 13

Page 14: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 14

Page 15: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 15

Page 16: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 16

(2015)

https://www.slideshare.net/MrChrisJohnson/interactive-recommender-systems-with-netflix-and-spotify/20-Spotify_in_NumbersStarted_in_2006

Page 17: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 17

Page 18: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 18

Page 19: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 19

Page 20: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 20

Page 21: Recommandation sociale : filtrage collaboratif et par le contenu

Amazon NavigationGraph : YASIV

21

http://www.yasiv.com/#/Search?q=orwell&category=Books&lang=US

Page 22: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 22

Page 23: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 23

Page 24: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 24

Page 25: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 25

Page 26: Recommandation sociale : filtrage collaboratif et par le contenu

Nombreuses considérations

26

BobadillaJ,OrtegaF,HernandoA,GutiérrezA.Recommendersystemssurvey.Knowledge-BasedSystems.2013;46(C):109-132.doi:10.1016/j.knosys.2013.03.012.

Page 27: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

RESSOURCES

27

Page 28: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 28

Page 29: Recommandation sociale : filtrage collaboratif et par le contenu

Quelques collections de données

29

vote is ‘‘not voted”, which we represent with the symbol !. All thelists have the same number of elements: I.

Example:

rba 2 ½1::5#f g [ f!g;

rx : ð4;5; !;3;2; !;1;1Þ; ry : ð4;3;1;2; !;3;4; !Þ:

using standardized values [0..1]:

rx : 0:75;1; !;0:5; 0:25; !;0;0ð Þ;ry : 0:75;0:5;0;0:25; !;0:5;0:75; !ð Þ:

We define the cardinality of a list: #l as the number of elementsin the list l different to !.

(1) We obtain the list

dx;y : d1x;y;d

2x;y; d

3x;y; . . . ; dI

x;y

! "j

dix;y ¼ ri

x ' riy

! "28ijri

x – ! ^riy – !; di

x;y ¼ !8ijrix ¼ ! _ ri

y ¼ !;

ð10Þ

in our example:

dx;y ¼ ð0;0:25; !;0:0625; !; !;0:5625; !Þ:

(2) We obtain the MSD(x,y) measure computing the arithmeticaverage of the values in the list dx,y

MSDðx; yÞ ¼ !dx;y ¼

Pi¼1::I;di

x;y–! dix;y

#dx;y; ð11Þ

in our example:

!dx;y ¼ ð0þ 0:25þ 0:0625þ 0:5625Þ=4 ¼ 0:218

MSD(x,y) (11) tends towards zero as the ratings of users x andy become more similar and tends towards 1 as they becamemore different (we assume that the votes are normalized inthe interval [0..1]).

(3) We obtain the Jaccard(x,y) measure computing the propor-tion between the number of positions [1..I] in which thereare elements different to ! in both rx and ry regarding thenumber of positions [1..I] in which there are elements differ-ent to ! in rx or in ry:

Jaccardðx; yÞ ¼ rx \ ry

rx [ ry¼ #dx;y

#rx þ#ry '#dx;y; ð12Þ

in our example: 4/(6 + 6'4) = 0.5.(4) We combine the above elements in the final equation:

newmetric x; yð Þ ¼ Jaccard x; yð Þ ) 1'MSD x; yð Þð Þ; ð13Þ

in the running example:

newmetricðx; yÞ ¼ 0:5) ð1' 0:218Þ ¼ 0;391:

If the values of the votes are normalized in the interval [0..1],then:

1'MSD x; yð Þð Þ ^ Jaccard x; yð Þ ^ newmetric x; yð Þ 2 ½0::1#:

4. Planning the experiments

The RS databases [2,30,32] that we use in our experiments pres-ent the general characteristics summarized in Table 1.

The experiments have been grouped in such a way that the fol-lowing can be determined:

! Accuracy.! Coverage.! Number of perfect predictions.! Precision/recall.

We consider a perfect prediction to be each situation in whichthe prediction of the rating recommended to one user in one filmmatches the value rated by that user for that film.

The previous experiments were carried out, depending on thesize of the database, for each of the following k-neighborhoods val-ues: MovieLens [2..1500] step 50, FilmAffinity [2..2000] step 100,NetFlix [2..10000] step100, due to the fact that depending on thesize of each particular RS database, it is necessary to use a differentnumber of k-neighborhoods in order to display tendencies in thegraphs that display their results. The precision/recall recommenda-tion quality results have been obtained using a range [2..20] of rec-ommendations and relevance thresholds h = 5 using MovieLensand NetFlix and h = 9 using FilmAffinity.

When we use MovieLens and FilmAffinity we use 20% of testusers taken at random from all the users of the database; withthe remaining 80% we carry out the training. When we use NetFlix,given the huge number of users in the database, we only use 5% ofits users as test users. In all cases we use 20% of test items.

Table 2 shows the numerical data exposed in this section.

5. Results

In this section we present the results obtained using the dat-abases specified in Table 1. Fig. 6 shows the results obtained withMovieLens, Fig. 7 shows those obtained with NetFlix and Fig. 8 cor-responds to FilmAffinity.

Graph 6A shows the MAE error obtained in MovieLens by apply-ing Pearson correlation (dashed) and the proposed metric (contin-uous). The new metric achieves significant fewer errors inpractically all the experiments carried out (by varying the numberof k-neighborhoods). The percentage improvement average isaround 0.2 stars in the most commonly used values of k (50, 100,150, 200).

Graph 6B shows us the coverage. Small values of k producesmall percentages in the capacity for prediction, as it is moreimprobable that the few neighbors of a test user have voted for afilm that this user has not voted for. As the number of neighborsincreases, the probability that at least one of them has voted forthe film also increases, as shown in the Graph.

Table 1Main parameters of the databases used in the experiments.

MovieLens FilmAffinity NetFlix

Number of users 4382 26447 480189Number of movies 3952 21128 17770Number of ratings 1000209 19126278 100480507Min and max values 1–5 1–10 1–5

Table 2Main parameters used in the experiments.

K (MAE, coverage, perfect predictions) Precision/recall Test users (%) Test items (%)

Range Step N h

MovieLens 1M [2..1500] 50 [2..20] 5 20 20FilmAffinity [2..2000] 100 [2..20] 5 20 20NetFlix [2..10000] 100 [2..20] 9 5 20

J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 525

Page 30: Recommandation sociale : filtrage collaboratif et par le contenu

30

Page 31: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

The MovieLens Datasets

31

Harper,F.M.,&Konstan,J.A.(2016).Themovielensdatasets:Historyandcontext.ACMTransactionsonInteractiveIntelligentSystems(TiiS),5(4),19.

Page 32: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 32

Page 33: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 33

https://labrosa.ee.columbia.edu/millionsong/lastfm

Page 34: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 34

https://labrosa.ee.columbia.edu/millionsong/lastfm

Page 35: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 35

Page 36: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 36

Page 37: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 37

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Page 38: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 38

Page 39: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 39http://files.grouplens.org/datasets/hetrec2011/hetrec2011-delicious-readme.txt

Page 40: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 40

Page 41: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

METHODES : GENERALITES

41

Page 42: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Articles « Etat de l’art »

42

Page 43: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 43

https://www.slideshare.net/xamat/recommender-systems-machine-learning-summer-school-2014-cmu

Page 44: Recommandation sociale : filtrage collaboratif et par le contenu

Des « individus » et des « données »

44

Chapitre 2

Analyse en composantes

principales

Soient T un tableau croisant n individus I (en lignes) et K variablesquantitatives X (en colonnes). xi,k est la valeur de la variable k pour l’indi-vidu i :

X1 X2 · · · XK variables

individusI1 x1,1 x1,2 · · · x1,K

I2 x2,1 x2,2... x2,K

......

... xi,k...

In xn,1 xn,2... xn,K

Un des objectifs de l’analyse de donnees est de determiner des profilsd’individus ou, dit autrement, des classes d’individus se ressemblant. Cetteressemblance est determinee a partir des valeurs des variables associees auxindividus.

Un autre objectif concerne les variables elles-memes : calcul des correlationsentre elles (a quel point une evolution des valeurs de l’une entraıne uneevolution des valeurs de l’autre et de quelle maniere), regression entre va-riables (formulation des liens entre variables)... L’Analyse en ComposantesPrincipales (ACP) concerne les liaisons lineaires entre variables, par op-position aux liaisons quadratiques, logarithmiques ou exponentielles parexemple. L’ACP fait partie des analyses factorielles qui vont determinerdes facteurs a partir des valeurs des variables associees aux individus. Ces

7

Page 45: Recommandation sociale : filtrage collaboratif et par le contenu

P. Bellot

• L’analyse des données peut être conduite selon

• les individus : recherche de ressemblance entre les individus (en fonction des valeurs des variables) = classification automatique des individus

• les variables : quelles sont les variables qui expliquent le mieux les données (les différences entre individus) ? quelles sont les composantes principales ? où se trouve la plus grande variabilité ?

Etude des individus / étude des variables

45

Chapitre 2

Analyse en composantes

principales

Soient T un tableau croisant n individus I (en lignes) et K variablesquantitatives X (en colonnes). xi,k est la valeur de la variable k pour l’indi-vidu i :

X1 X2 · · · XK variables

individusI1 x1,1 x1,2 · · · x1,K

I2 x2,1 x2,2... x2,K

......

... xi,k...

In xn,1 xn,2... xn,K

Un des objectifs de l’analyse de donnees est de determiner des profilsd’individus ou, dit autrement, des classes d’individus se ressemblant. Cetteressemblance est determinee a partir des valeurs des variables associees auxindividus.

Un autre objectif concerne les variables elles-memes : calcul des correlationsentre elles (a quel point une evolution des valeurs de l’une entraıne uneevolution des valeurs de l’autre et de quelle maniere), regression entre va-riables (formulation des liens entre variables)... L’Analyse en ComposantesPrincipales (ACP) concerne les liaisons lineaires entre variables, par op-position aux liaisons quadratiques, logarithmiques ou exponentielles parexemple. L’ACP fait partie des analyses factorielles qui vont determinerdes facteurs a partir des valeurs des variables associees aux individus. Ces

7

Chapitre 2

Analyse en composantes

principales

Soient T un tableau croisant n individus I (en lignes) et K variablesquantitatives X (en colonnes). xi,k est la valeur de la variable k pour l’indi-vidu i :

X1 X2 · · · XK variables

individusI1 x1,1 x1,2 · · · x1,K

I2 x2,1 x2,2... x2,K

......

... xi,k...

In xn,1 xn,2... xn,K

Un des objectifs de l’analyse de donnees est de determiner des profilsd’individus ou, dit autrement, des classes d’individus se ressemblant. Cetteressemblance est determinee a partir des valeurs des variables associees auxindividus.

Un autre objectif concerne les variables elles-memes : calcul des correlationsentre elles (a quel point une evolution des valeurs de l’une entraıne uneevolution des valeurs de l’autre et de quelle maniere), regression entre va-riables (formulation des liens entre variables)... L’Analyse en ComposantesPrincipales (ACP) concerne les liaisons lineaires entre variables, par op-position aux liaisons quadratiques, logarithmiques ou exponentielles parexemple. L’ACP fait partie des analyses factorielles qui vont determinerdes facteurs a partir des valeurs des variables associees aux individus. Ces

7

> temp<-data.frame(temperature[1:12])> cl = kmeans(temp,3,iter.max=2,nstart=15)

e) visualisez les classes :> summary(cl)> cl$cluster> summary(cl$cluster)> cl$center

f) Ajouter le résultat de la classification aux données - utilisez le paquetage cluster pour accéder à la fonction clusplot : > library(cluster)- puis :> aggregate(temperature,by=list(cl$cluster),FUN=mean)> cl2<-data.frame(temperature,cl$cluster)> clusplot(temperature,cl$cluster,color=TRUE,shade=TRUE,labels=2,lines=0)

5- Question «subsidiaire» : manipulation du paquetage APCluster

Installer le paquetage APCluster

Polytech’Marseille Page 2 sur 3

-6 -4 -2 0 2 4 6 8

-4-3

-2-1

01

23

Individuals factor map (PCA)

Dim 1 (86.87%)

Dim

2 (1

1.42

%)

Amsterdam

Athens

Berlin

Brussels

Budapest

Copenhagen

Dublin

Elsinki

Kiev

Krakow

Lisbon London

Madrid

Minsk

Moscow

Oslo

Paris

Prague

Reykjavik

Rome Sarajevo

Sofia

Stockholm

Antwerp

Barcelona

Bordeaux

Edinburgh

Frankfurt Geneva Genoa

Milan

Palermo

Seville

St. Petersburg

Zurich

East

North

South

West

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (86.87%)

Dim

2 (1

1.42

%)

JanuaryFebruary

March

April

MayJuneJulyAugust

September

October

November

December

Annual

Amplitude

Latitude

Longitude

Polytech’Marseille Page 2 sur 4

Page 46: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 46

Page 47: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 47

Page 48: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 48

Page 49: Recommandation sociale : filtrage collaboratif et par le contenu

P. Bellot

ACP et réduction de la dimension• Une façon de représenter en quelques dimensions des nuages d’individus

— en conservant au mieux les distances entre les individus— en privilégiant les dimensions de plus grande variabilité (sélection itérative des facteurs qui maximisent la variance)= application d’une fonction de projection

49

Page 50: Recommandation sociale : filtrage collaboratif et par le contenu

P. Bellot 50

Méthodes d’apprentissage• Différentes formes d’apprentissage

• Agent « élève » recopie l’agent « maître » --> fournir des exemples

• Raisonnement par induction (à partir d’exemples)

• Apprentissage de caractéristiques importantes

• Détection de patterns récurrents

• Ajustement des paramètres importants

• Transformation d’informations en connaissancesExemples --> Modèle --> Test --> Correction / Enrichissement des exemples

Page 51: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Approches statistiques, probabilistes Apprentissage automatique

51

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence DataProceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001)

Y

i�1 Y

i

Y

i+1

?ss -

?ss -

?ss

X

i�1 X

i

X

i+1

Y

i�1 Y

i

Y

i+1

c6s -

c6s -

c6s

X

i�1 X

i

X

i+1

Y

i�1 Y

i

Y

i+1

cs

cs

cs

X

i�1 X

i

X

i+1

Figure 2. Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.An open circle indicates that the variable is not generated by the model.

sequence. In addition, the features do not need to specifycompletely a state or observation, so one might expect thatthe model can be estimated from less training data. Anotherattractive property is the convexity of the loss function; in-deed, CRFs share all of the convexity properties of generalmaximum entropy models.

For the remainder of the paper we assume that the depen-dencies of Y, conditioned on X, form a chain. To sim-plify some expressions, we add special start and stop statesY0 = start and Y

n+1 = stop. Thus, we will be using thegraphical structure shown in Figure 2. For a chain struc-ture, the conditional probability of a label sequence can beexpressed concisely in matrix form, which will be usefulin describing the parameter estimation and inference al-gorithms in Section 4. Suppose that p

(Y |X) is a CRFgiven by (1). For each position i in the observation se-quence x, we define the |Y| ⇥ |Y| matrix random variableM

i

(x) = [M

i

(y

0, y |x)] by

M

i

(y

0, y |x) = exp (⇤

i

(y

0, y |x))

i

(y

0, y |x) =

Pk

k

f

k

(e

i

,Y|ei = (y

0, y),x) +

Pk

µ

k

g

k

(v

i

,Y|vi = y,x) ,

where e

i

is the edge with labels (Y

i�1,Yi

) and v

i

is thevertex with labelY

i

. In contrast to generative models, con-ditional models like CRFs do not need to enumerate overall possible observation sequences x, and therefore thesematrices can be computed directly as needed from a giventraining or test observation sequence x and the parametervector ✓. Then the normalization (partition function)Z

(x)

is the (start, stop) entry of the product of these matrices:

Z

(x) = (M1(x) M2(x) · · ·Mn+1(x))

start,stop

.

Using this notation, the conditional probability of a labelsequence y is written as

p

(y |x) =

Qn+1i=1 M

i

(y

i�1,yi

|x)⇣Qn+1i=1 M

i

(x)

start,stop

,

where y0 = start and y

n+1 = stop.

4. Parameter Estimation for CRFsWe now describe two iterative scaling algorithms to findthe parameter vector ✓ that maximizes the log-likelihood

of the training data. Both algorithms are based on the im-proved iterative scaling (IIS) algorithm of Della Pietra et al.(1997); the proof technique based on auxiliary functionscan be extended to show convergence of the algorithms forCRFs.

Iterative scaling algorithms update the weights as �

k

k

+ ��

k

and µ

k

µ

k

+ �µ

k

for appropriately chosen��

k

and �µ

k

. In particular, the IIS update ��

k

for an edgefeature f

k

is the solution of

eE[f

k

]

def=

X

x,y

ep(x,y)

n+1X

i=1

f

k

(e

i

,y|ei ,x)

=

X

x,y

ep(x) p(y |x)

n+1X

i=1

f

k

(e

i

,y|ei ,x) e

��kT (x,y) .

where T (x,y) is the total feature count

T (x,y)

def=

X

i,k

f

k

(e

i

,y|ei ,x) +

X

i,k

g

k

(v

i

,y|vi ,x) .

The equations for vertex feature updates �µ

k

have similarform.

However, efficiently computing the exponential sums onthe right-hand sides of these equations is problematic, be-cause T (x,y) is a global property of (x,y), and dynamicprogramming will sum over sequences with potentiallyvarying T . To deal with this, the first algorithm, AlgorithmS, uses a “slack feature.” The second, Algorithm T, keepstrack of partial T totals.

For Algorithm S, we define the slack feature by

s(x,y)

def=

S �X

i

X

k

f

k

(e

i

,y|ei ,x)�

X

i

X

k

g

k

(v

i

,y|vi ,x) ,

where S is a constant chosen so that s(x(i),y) � 0 for all

y and all observation vectors x

(i) in the training set, thusmaking T (x,y) = S. Feature s is “global,” that is, it doesnot correspond to any particular edge or vertex.

For each index i = 0, . . . , n+1 we now define the forward

vectors ↵

i

(x) with base case

↵0(y |x) =

n1 if y = start

0 otherwise

3 Conditional Random Fields

Lafferty et al. [8] define the the probability of a particular label sequence y

given observation sequence x to be a normalized product of potential functions,each of the form

exp (!

j

λjtj(yi−1, yi, x, i) +!

k

µksk(yi, x, i)), (2)

where tj(yi−1, yi, x, i) is a transition feature function of the entire observationsequence and the labels at positions i and i−1 in the label sequence; sk(yi, x, i)is a state feature function of the label at position i and the observation sequence;and λj and µk are parameters to be estimated from training data.

When defining feature functions, we construct a set of real-valued featuresb(x, i) of the observation to expresses some characteristic of the empirical dis-tribution of the training data that should also hold of the model distribution.An example of such a feature is

b(x, i) =

"

1 if the observation at position i is the word “September”

0 otherwise.

Each feature function takes on the value of one of these real-valued observationfeatures b(x, i) if the current state (in the case of a state function) or previousand current states (in the case of a transition function) take on particular val-ues. All feature functions are therefore real-valued. For example, consider thefollowing transition function:

tj(yi−1, yi, x, i) =

"

b(x, i) if yi−1 = IN and yi = NNP

0 otherwise.

In the remainder of this report, notation is simplified by writing

s(yi, x, i) = s(yi−1, yi, x, i)

and

Fj(y, x) =n

!

i=1

fj(yi−1, yi, x, i),

where each fj(yi−1, yi, x, i) is either a state function s(yi−1, yi, x, i) or a transi-tion function t(yi−1, yi, x, i). This allows the probability of a label sequence y

given an observation sequence x to be written as

p(y|x, λ) =1

Z(x)exp (

!

j

λjFj(y, x)). (3)

Z(x) is a normalization factor.

4

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score

Quelssontlesmotscaractéristiquesd’ungroupededocuments?

Quellesrelationssignificativesàpartirdesseulesformesobservées?

Analogies,corrélations

Page 52: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 52

Page 53: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 53

Page 54: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Recommandation et séries temporelles

54

Page 55: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

EVALUATION

55

Page 56: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Grille d’évaluation

56

Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).

5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.

APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy

x The items recommended to me matched my interests.*

x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse

scale).

A.1.2 Relative Accuracy x The recommendation I received better fits my interests than

what I may receive from a friend. x A recommendation from my friends better suits my interests

than the recommendation from this system (reverse scale).

A.1.3 Familiarity

x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me

(reverse scale).

A.1.4 Attractiveness x The items recommended to me are attractive.

A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty

x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse

scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other

(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.

x The items recommended to me took my personal context requirements into consideration.

x The recommendations are timely.

A2. Interaction Adequacy x The recommender provides an adequate way for me to express

my preferences. x The recommender provides an adequate way for me to revise

my preferences. x The recommender explains why the products are

recommended to me.*

A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is

sufficient for me. x The labels of the recommender interface are clear and

adequate. x The layout of the recommender interface is attractive and

adequate.*

A4. Perceived Ease of Use A.4.1 Ease of Initial Learning

19

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).

5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.

APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy

x The items recommended to me matched my interests.*

x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse

scale).

A.1.2 Relative Accuracy x The recommendation I received better fits my interests than

what I may receive from a friend. x A recommendation from my friends better suits my interests

than the recommendation from this system (reverse scale).

A.1.3 Familiarity

x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me

(reverse scale).

A.1.4 Attractiveness x The items recommended to me are attractive.

A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty

x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse

scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other

(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.

x The items recommended to me took my personal context requirements into consideration.

x The recommendations are timely.

A2. Interaction Adequacy x The recommender provides an adequate way for me to express

my preferences. x The recommender provides an adequate way for me to revise

my preferences. x The recommender explains why the products are

recommended to me.*

A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is

sufficient for me. x The labels of the recommender interface are clear and

adequate. x The layout of the recommender interface is attractive and

adequate.*

A4. Perceived Ease of Use A.4.1 Ease of Initial Learning

19

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).

5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.

APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy

x The items recommended to me matched my interests.*

x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse

scale).

A.1.2 Relative Accuracy x The recommendation I received better fits my interests than

what I may receive from a friend. x A recommendation from my friends better suits my interests

than the recommendation from this system (reverse scale).

A.1.3 Familiarity

x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me

(reverse scale).

A.1.4 Attractiveness x The items recommended to me are attractive.

A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty

x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse

scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other

(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.

x The items recommended to me took my personal context requirements into consideration.

x The recommendations are timely.

A2. Interaction Adequacy x The recommender provides an adequate way for me to express

my preferences. x The recommender provides an adequate way for me to revise

my preferences. x The recommender explains why the products are

recommended to me.*

A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is

sufficient for me. x The labels of the recommender interface are clear and

adequate. x The layout of the recommender interface is attractive and

adequate.*

A4. Perceived Ease of Use A.4.1 Ease of Initial Learning

19

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.

Page 57: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)57

Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).

5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.

APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy

x The items recommended to me matched my interests.*

x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse

scale).

A.1.2 Relative Accuracy x The recommendation I received better fits my interests than

what I may receive from a friend. x A recommendation from my friends better suits my interests

than the recommendation from this system (reverse scale).

A.1.3 Familiarity

x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me

(reverse scale).

A.1.4 Attractiveness x The items recommended to me are attractive.

A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty

x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse

scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other

(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.

x The items recommended to me took my personal context requirements into consideration.

x The recommendations are timely.

A2. Interaction Adequacy x The recommender provides an adequate way for me to express

my preferences. x The recommender provides an adequate way for me to revise

my preferences. x The recommender explains why the products are

recommended to me.*

A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is

sufficient for me. x The labels of the recommender interface are clear and

adequate. x The layout of the recommender interface is attractive and

adequate.*

A4. Perceived Ease of Use A.4.1 Ease of Initial Learning

19

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort

(reverse scale). A.4.2 Ease of Preference Elicitation

x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.

x It required too much effort to tell the system what I like (reversed scale).

A.4.3 Ease of Preference Revision

x I found it easy to make the system recommend different things to me.

x It is easy to train the system to update my preferences.

x I found it easy to alter the outcome of the recommended items due to my preference changes.

x It is easy for me to inform the system if I dislike/like the recommended item.

x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making

x Using the recommender to find what I like is easy.

x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is

easy.* x Finding an item to buy, even with the help of the

recommender, consumes too much time.

A5. Perceived Usefulness x The recommended items effectively helped me find the ideal

product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the

recommender.* x I feel supported in selecting the items to buy with the help of

the recommender.

A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my

preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were

recommended to me. x The system seems to control my decision process rather than

me (reverse scale).

A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *

x The recommender made me more confident about my selection/decision.

x The recommended items made me confused about my choice (reverse scale).

x The recommender can be trusted.

A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find

products to buy.

A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*

6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next

Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.

[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.

[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.

[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.

[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.

[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.

[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.

[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.

[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.

[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.

20

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort

(reverse scale). A.4.2 Ease of Preference Elicitation

x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.

x It required too much effort to tell the system what I like (reversed scale).

A.4.3 Ease of Preference Revision

x I found it easy to make the system recommend different things to me.

x It is easy to train the system to update my preferences.

x I found it easy to alter the outcome of the recommended items due to my preference changes.

x It is easy for me to inform the system if I dislike/like the recommended item.

x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making

x Using the recommender to find what I like is easy.

x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is

easy.* x Finding an item to buy, even with the help of the

recommender, consumes too much time.

A5. Perceived Usefulness x The recommended items effectively helped me find the ideal

product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the

recommender.* x I feel supported in selecting the items to buy with the help of

the recommender.

A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my

preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were

recommended to me. x The system seems to control my decision process rather than

me (reverse scale).

A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *

x The recommender made me more confident about my selection/decision.

x The recommended items made me confused about my choice (reverse scale).

x The recommender can be trusted.

A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find

products to buy.

A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*

6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next

Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.

[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.

[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.

[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.

[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.

[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.

[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.

[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.

[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.

[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.

20

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.

Page 58: Recommandation sociale : filtrage collaboratif et par le contenu

58

x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort

(reverse scale). A.4.2 Ease of Preference Elicitation

x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.

x It required too much effort to tell the system what I like (reversed scale).

A.4.3 Ease of Preference Revision

x I found it easy to make the system recommend different things to me.

x It is easy to train the system to update my preferences.

x I found it easy to alter the outcome of the recommended items due to my preference changes.

x It is easy for me to inform the system if I dislike/like the recommended item.

x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making

x Using the recommender to find what I like is easy.

x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is

easy.* x Finding an item to buy, even with the help of the

recommender, consumes too much time.

A5. Perceived Usefulness x The recommended items effectively helped me find the ideal

product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the

recommender.* x I feel supported in selecting the items to buy with the help of

the recommender.

A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my

preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were

recommended to me. x The system seems to control my decision process rather than

me (reverse scale).

A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *

x The recommender made me more confident about my selection/decision.

x The recommended items made me confused about my choice (reverse scale).

x The recommender can be trusted.

A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find

products to buy.

A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*

6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next

Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.

[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.

[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.

[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.

[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.

[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.

[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.

[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.

[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.

[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.

20

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort

(reverse scale). A.4.2 Ease of Preference Elicitation

x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.

x It required too much effort to tell the system what I like (reversed scale).

A.4.3 Ease of Preference Revision

x I found it easy to make the system recommend different things to me.

x It is easy to train the system to update my preferences.

x I found it easy to alter the outcome of the recommended items due to my preference changes.

x It is easy for me to inform the system if I dislike/like the recommended item.

x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making

x Using the recommender to find what I like is easy.

x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is

easy.* x Finding an item to buy, even with the help of the

recommender, consumes too much time.

A5. Perceived Usefulness x The recommended items effectively helped me find the ideal

product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the

recommender.* x I feel supported in selecting the items to buy with the help of

the recommender.

A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my

preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were

recommended to me. x The system seems to control my decision process rather than

me (reverse scale).

A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *

x The recommender made me more confident about my selection/decision.

x The recommended items made me confused about my choice (reverse scale).

x The recommender can be trusted.

A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find

products to buy.

A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*

6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next

Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.

[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.

[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.

[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.

[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.

[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.

[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.

[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.

[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.

[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.

20

FULL PAPER

Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010

Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf

Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.

PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.

Page 59: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 59

Page 60: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 60

Page 61: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Mesures d’évaluation

— Qualité de la prédiction : Mean Absolute Error, Root Mean Squared Error, Coverage

— Qualité de la recommandation : Precision, Recall, F1-Measure

61

subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.

The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.

Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].

The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.

The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.

Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.

Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.

4.1. Quality of the predictions: mean absolute error, accuracy andcoverage

In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.

We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.

Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.

MAE ¼ 1#U

X

u2U

1#Ou

X

i2Ou

jpu;i " ru;ij !

ð1Þ

RMSE ¼ 1#U

X

u2U

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

#Ou

X

i2Ou

ðpu;i " ru;iÞ2s

ð2Þ

The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:

Let

Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g

coverage ¼ 1#U

X

u2U

100& #Cu

#Du

" #ð3Þ

4.2. Quality of the set of recommendations: precision, recall and F1

The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.

In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.

Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:

precision ¼ 1#U

X

u2U

#fi 2 Zujru;i P hgn

ð4Þ

recall ¼ 1#U

X

u2U

#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc

u

$$ru;i P h% & ð5Þ

F1 ¼ 2& precision& recallprecisionþ recall

ð6Þ

4.3. Quality of the list of recommendations: rank measures

When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.

HL ¼ 1#U

X

u2U

XN

i¼1

maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ

DCGk ¼ 1#U

X

u2U

ru;p1 þXk

i¼2

ru;pi

log2ðiÞ

!ð8Þ

J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117

subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.

The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.

Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].

The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.

The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.

Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.

Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.

4.1. Quality of the predictions: mean absolute error, accuracy andcoverage

In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.

We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.

Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.

MAE ¼ 1#U

X

u2U

1#Ou

X

i2Ou

jpu;i " ru;ij !

ð1Þ

RMSE ¼ 1#U

X

u2U

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

#Ou

X

i2Ou

ðpu;i " ru;iÞ2s

ð2Þ

The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:

Let

Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g

coverage ¼ 1#U

X

u2U

100& #Cu

#Du

" #ð3Þ

4.2. Quality of the set of recommendations: precision, recall and F1

The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.

In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.

Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:

precision ¼ 1#U

X

u2U

#fi 2 Zujru;i P hgn

ð4Þ

recall ¼ 1#U

X

u2U

#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc

u

$$ru;i P h% & ð5Þ

F1 ¼ 2& precision& recallprecisionþ recall

ð6Þ

4.3. Quality of the list of recommendations: rank measures

When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.

HL ¼ 1#U

X

u2U

XN

i¼1

maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ

DCGk ¼ 1#U

X

u2U

ru;p1 þXk

i¼2

ru;pi

log2ðiÞ

!ð8Þ

J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117

subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.

The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.

Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].

The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.

The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.

Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.

Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.

4.1. Quality of the predictions: mean absolute error, accuracy andcoverage

In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.

We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.

Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.

MAE ¼ 1#U

X

u2U

1#Ou

X

i2Ou

jpu;i " ru;ij !

ð1Þ

RMSE ¼ 1#U

X

u2U

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

#Ou

X

i2Ou

ðpu;i " ru;iÞ2s

ð2Þ

The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:

Let

Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g

coverage ¼ 1#U

X

u2U

100& #Cu

#Du

" #ð3Þ

4.2. Quality of the set of recommendations: precision, recall and F1

The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.

In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.

Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:

precision ¼ 1#U

X

u2U

#fi 2 Zujru;i P hgn

ð4Þ

recall ¼ 1#U

X

u2U

#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc

u

$$ru;i P h% & ð5Þ

F1 ¼ 2& precision& recallprecisionþ recall

ð6Þ

4.3. Quality of the list of recommendations: rank measures

When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.

HL ¼ 1#U

X

u2U

XN

i¼1

maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ

DCGk ¼ 1#U

X

u2U

ru;p1 þXk

i¼2

ru;pi

log2ðiÞ

!ð8Þ

J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117

Page 62: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Mesures d’évaluation (2)

— Qualité d’une liste de recommandations (selon les rangs) : DCG au rang k : . le gain apporté par un item est inversement lié à sa position dans la liste

. calculé pour chaque utilisateur (u) puis moyenne sur tous les utilisateur

nDCG est la version normalisée selon le « DCG idéal » (liste idéale)

— Nouveauté et diversité

62

subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.

The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.

Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].

The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.

The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.

Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.

Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.

4.1. Quality of the predictions: mean absolute error, accuracy andcoverage

In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.

We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.

Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.

MAE ¼ 1#U

X

u2U

1#Ou

X

i2Ou

jpu;i " ru;ij !

ð1Þ

RMSE ¼ 1#U

X

u2U

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

#Ou

X

i2Ou

ðpu;i " ru;iÞ2s

ð2Þ

The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:

Let

Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g

coverage ¼ 1#U

X

u2U

100& #Cu

#Du

" #ð3Þ

4.2. Quality of the set of recommendations: precision, recall and F1

The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.

In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.

Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:

precision ¼ 1#U

X

u2U

#fi 2 Zujru;i P hgn

ð4Þ

recall ¼ 1#U

X

u2U

#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc

u

$$ru;i P h% & ð5Þ

F1 ¼ 2& precision& recallprecisionþ recall

ð6Þ

4.3. Quality of the list of recommendations: rank measures

When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.

HL ¼ 1#U

X

u2U

XN

i¼1

maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ

DCGk ¼ 1#U

X

u2U

ru;p1 þXk

i¼2

ru;pi

log2ðiÞ

!ð8Þ

J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117

p1, . . . ,pn represents the recommendation list, ru,pi representsthe true rating of the user u for the item pi, k is the rank of the eval-uated item, d is the default rating, a is the number of the item onthe list such that there is a 50% chance the user will review thatitem.

4.4. Novelty and diversity

The novelty evaluation measure indicates the degree of differ-ence between the items recommended to and known by the user.The diversity quality measure indicates the degree of differentia-tion among recommended items.

Currently, novelty and diversity measures do not have a stan-dard; therefore, different authors propose different metrics[163,220]. Certain authors have [105] used the following:

diversityZu¼

1#Zuð#Zu # 1Þ

X

i2Zu

X

j2Zu ;j–i

½1# simði; jÞ& ð9Þ

noveltyi ¼1

#Zu # 1

X

j2Zu

½1# simði; jÞ&; i 2 Zu ð10Þ

Here, sim(i, j) indicates item to item memory-based CF similar-ity measures. Zu indicates the set of n recommendations to user u.

4.5. Stability

The stability in the predictions and recommendations influ-ences on the users’ trust towards the RS. A RS is stable if the pre-dicitions it provides do not change strongly over a short periodof time. Adomavicius and Zhang [4] propose a quality measure ofstability, MAS (Mean Absolute Shift). This measure is definedthrough a set of known ratings R1 and a set of predictions of all un-known ratings, P1. For an interval of time, users of the RS will haverated a subset S of these unknown ratings and the RS can nowmake new predictions, P2. MAS is defined as follows:

stability ¼MAS ¼ 1jP2j

X

ðu;iÞ2P2

jP2ðu; iÞ # P1ðu; iÞj ð11Þ

4.6. Reliability

The reliability of a prediction or a recommendation informsabout how seriously we may consider this prediction. When RSrecommends an item to a user with prediction 4.5 in a scale{1, . . . ,5}, this user hopes to be satisfied by this item. However, thisvalue of prediction (4.5 over 5) does not reflect with which certaindegree the RS has concluded that the user will like this item (withvalue 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-able if it has obtained by means of 200 similar users than if it hasobtained by only two similar users.

In Hernando et al. [96], a realibility measure is proposed accord-ing the usual notion that the more reliable a prediction, the less lia-ble to be wrong. Although this reliability measure is not a qualitymeasure used for comparing different techniques of RS throughcross validation, this can be regarded as a quality measure associ-ated to a prediction and a recommendation. In this way, the RS pro-vides a pair of values (prediction value, reliability value), throughwhich users may balance its preference: for example users wouldprobably prefer the option (4,0.9) to the option (4.5,0.1). Conse-quently, the reliability measure proposed in Hernando et al. [96]provides a new understandable factor, which users may considerfor taking its decisions. Nevertheless, the use of this reliabilitymeasure is just constrained to those RS based on the kNNalgorithm.

The definition of reliability on the prediction, pu,i, isbased on two numeric factors: su,i and vu,i. su,i measures the similar-ity of the neighbors used for making the prediction pu,i; vu,i

measures the degree of disagreement between these neighborsrating the item i. Finally, the reliablity measure is defined asfollows:

fSðsu;iÞ ¼ 1#!s

!sþ su;i; su;i ¼

X

v2Ku;i

simðu;vÞ ð12Þ

where

fSðsu;iÞ ¼ 1#!s

!sþ su;i; su;i ¼

X

v2Ku;i

simðu;vÞ ð13Þ

Fig. 7. Recommender systems evaluation process.

118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132

p1, . . . ,pn represents the recommendation list, ru,pi representsthe true rating of the user u for the item pi, k is the rank of the eval-uated item, d is the default rating, a is the number of the item onthe list such that there is a 50% chance the user will review thatitem.

4.4. Novelty and diversity

The novelty evaluation measure indicates the degree of differ-ence between the items recommended to and known by the user.The diversity quality measure indicates the degree of differentia-tion among recommended items.

Currently, novelty and diversity measures do not have a stan-dard; therefore, different authors propose different metrics[163,220]. Certain authors have [105] used the following:

diversityZu¼

1#Zuð#Zu # 1Þ

X

i2Zu

X

j2Zu ;j–i

½1# simði; jÞ& ð9Þ

noveltyi ¼1

#Zu # 1

X

j2Zu

½1# simði; jÞ&; i 2 Zu ð10Þ

Here, sim(i, j) indicates item to item memory-based CF similar-ity measures. Zu indicates the set of n recommendations to user u.

4.5. Stability

The stability in the predictions and recommendations influ-ences on the users’ trust towards the RS. A RS is stable if the pre-dicitions it provides do not change strongly over a short periodof time. Adomavicius and Zhang [4] propose a quality measure ofstability, MAS (Mean Absolute Shift). This measure is definedthrough a set of known ratings R1 and a set of predictions of all un-known ratings, P1. For an interval of time, users of the RS will haverated a subset S of these unknown ratings and the RS can nowmake new predictions, P2. MAS is defined as follows:

stability ¼MAS ¼ 1jP2j

X

ðu;iÞ2P2

jP2ðu; iÞ # P1ðu; iÞj ð11Þ

4.6. Reliability

The reliability of a prediction or a recommendation informsabout how seriously we may consider this prediction. When RSrecommends an item to a user with prediction 4.5 in a scale{1, . . . ,5}, this user hopes to be satisfied by this item. However, thisvalue of prediction (4.5 over 5) does not reflect with which certaindegree the RS has concluded that the user will like this item (withvalue 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-able if it has obtained by means of 200 similar users than if it hasobtained by only two similar users.

In Hernando et al. [96], a realibility measure is proposed accord-ing the usual notion that the more reliable a prediction, the less lia-ble to be wrong. Although this reliability measure is not a qualitymeasure used for comparing different techniques of RS throughcross validation, this can be regarded as a quality measure associ-ated to a prediction and a recommendation. In this way, the RS pro-vides a pair of values (prediction value, reliability value), throughwhich users may balance its preference: for example users wouldprobably prefer the option (4,0.9) to the option (4.5,0.1). Conse-quently, the reliability measure proposed in Hernando et al. [96]provides a new understandable factor, which users may considerfor taking its decisions. Nevertheless, the use of this reliabilitymeasure is just constrained to those RS based on the kNNalgorithm.

The definition of reliability on the prediction, pu,i, isbased on two numeric factors: su,i and vu,i. su,i measures the similar-ity of the neighbors used for making the prediction pu,i; vu,i

measures the degree of disagreement between these neighborsrating the item i. Finally, the reliablity measure is defined asfollows:

fSðsu;iÞ ¼ 1#!s

!sþ su;i; su;i ¼

X

v2Ku;i

simðu;vÞ ð12Þ

where

fSðsu;iÞ ¼ 1#!s

!sþ su;i; su;i ¼

X

v2Ku;i

simðu;vÞ ð13Þ

Fig. 7. Recommender systems evaluation process.

118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132

p1, . . . ,pn represents the recommendation list, ru,pi representsthe true rating of the user u for the item pi, k is the rank of the eval-uated item, d is the default rating, a is the number of the item onthe list such that there is a 50% chance the user will review thatitem.

4.4. Novelty and diversity

The novelty evaluation measure indicates the degree of differ-ence between the items recommended to and known by the user.The diversity quality measure indicates the degree of differentia-tion among recommended items.

Currently, novelty and diversity measures do not have a stan-dard; therefore, different authors propose different metrics[163,220]. Certain authors have [105] used the following:

diversityZu¼

1#Zuð#Zu # 1Þ

X

i2Zu

X

j2Zu ;j–i

½1# simði; jÞ& ð9Þ

noveltyi ¼1

#Zu # 1

X

j2Zu

½1# simði; jÞ&; i 2 Zu ð10Þ

Here, sim(i, j) indicates item to item memory-based CF similar-ity measures. Zu indicates the set of n recommendations to user u.

4.5. Stability

The stability in the predictions and recommendations influ-ences on the users’ trust towards the RS. A RS is stable if the pre-dicitions it provides do not change strongly over a short periodof time. Adomavicius and Zhang [4] propose a quality measure ofstability, MAS (Mean Absolute Shift). This measure is definedthrough a set of known ratings R1 and a set of predictions of all un-known ratings, P1. For an interval of time, users of the RS will haverated a subset S of these unknown ratings and the RS can nowmake new predictions, P2. MAS is defined as follows:

stability ¼MAS ¼ 1jP2j

X

ðu;iÞ2P2

jP2ðu; iÞ # P1ðu; iÞj ð11Þ

4.6. Reliability

The reliability of a prediction or a recommendation informsabout how seriously we may consider this prediction. When RSrecommends an item to a user with prediction 4.5 in a scale{1, . . . ,5}, this user hopes to be satisfied by this item. However, thisvalue of prediction (4.5 over 5) does not reflect with which certaindegree the RS has concluded that the user will like this item (withvalue 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-able if it has obtained by means of 200 similar users than if it hasobtained by only two similar users.

In Hernando et al. [96], a realibility measure is proposed accord-ing the usual notion that the more reliable a prediction, the less lia-ble to be wrong. Although this reliability measure is not a qualitymeasure used for comparing different techniques of RS throughcross validation, this can be regarded as a quality measure associ-ated to a prediction and a recommendation. In this way, the RS pro-vides a pair of values (prediction value, reliability value), throughwhich users may balance its preference: for example users wouldprobably prefer the option (4,0.9) to the option (4.5,0.1). Conse-quently, the reliability measure proposed in Hernando et al. [96]provides a new understandable factor, which users may considerfor taking its decisions. Nevertheless, the use of this reliabilitymeasure is just constrained to those RS based on the kNNalgorithm.

The definition of reliability on the prediction, pu,i, isbased on two numeric factors: su,i and vu,i. su,i measures the similar-ity of the neighbors used for making the prediction pu,i; vu,i

measures the degree of disagreement between these neighborsrating the item i. Finally, the reliablity measure is defined asfollows:

fSðsu;iÞ ¼ 1#!s

!sþ su;i; su;i ¼

X

v2Ku;i

simðu;vÞ ð12Þ

where

fSðsu;iÞ ¼ 1#!s

!sþ su;i; su;i ¼

X

v2Ku;i

simðu;vÞ ð13Þ

Fig. 7. Recommender systems evaluation process.

118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132

Page 63: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Mesures d’évaluation (3)

— D’autres mesures orientées utilisateurs

— Pertinence (accuracy) perçue par l’utilisateur

— Familiarité : les items sont connus (leur existence) des utilisateurs

— Nouveauté : découverte d’items nouveaux

— Attractivité : les items attirent les utilisateurs (pas toujours le cas d’items pertinents…)

— Utilité : les items ont été appréciés (après usage / lecture)

— Compatibilité avec le contexte de l’utilisateur

— Niveau de l’interaction

— Contrôle des paramètres

— Explications de la recommandation

— Transparence de la méthode

63

PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.

Page 64: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

FILTRAGE COLLABORATIF

64

Page 65: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 65

Page 66: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Filtrage collaboratif

Nous sommes des êtres « sociaux »

— Les « autres » dictent / influencent nos choix

— Nos relations sont typées (amis / ennemis, famille, relations professionnelles…)

— « Dis moi qui sont tes amis, je te dirai qui tu es » — homophilie

66

Table 3 Sets.

Table 5 Running example : users similarities.

N a m e Sets descr ipt ions Pa ramete r s MSD Ui U2 U3 U, U5

N.e

ft 9 ft h

U Users L I I t ems M V Rating va lues min, max Ru User ra t ings user Ku Ne ighborhoods of t he user user, k P„ Predict ions to t he user user, k Xu Top r e c o m m e n d e d i t ems to t he user user, k, 9 Zu Top N r e c o m m e n d e d i t ems to t he user user, k, Y I t ems voted of the m o s t by y users y T„ User 's ne ighborhoods taking into account ft user, k, Qu Trust users user, k, Hu Trust pairs (user, i t em) user, k, ft Axy I t ems ra ted s imul taneous ly by users x and y use r l ,

user2 Gui User 's ne ighborhoods w h i c h have ra ted i tem / user, k Buj Users w h o have vo ted for i t em /, exceptuser user, item 0„ I t ems tha t the user has vo ted for and on w h i c h user, k

predic t ions exist O Users from w h o m a MAE can be ob ta ined k C„ I tems tha t the user has no t voted for and on w h i c h user, k

predic t ions exist D„ I t ems tha t the user has no t voted for user, k C Pairs (user, i t em) t ha t have no t been voted for and k

accept predic t ions D Pairs (user, i t em) t ha t have no t been voted for Exy I t ems t ha t have recent ly been voted for by bo th user x ft use r l ,

and user y user2 S„ User 's recen t votes user, ji

Table 4 Running example : RS database.

ru,¡ h h h u Í5 Í6 h Í8 h ho in Í12 Í13 Í14

Ui 5 • 3 • 4 • • 4 • 2 4 • u?. 1 • 2 4 1 4 1 U3 5 2 4 • • 3 5 4 • • 4 • U4 4 • 3 • • • 5 4 • • • • ¡A 3 3 4 5 • • 5 •

2.3. Obtaining a user's K-neighbors

2.3.1. Formalization We define Ku as the set of K neighbors of the user u. The following must be true:

Ku<zUA#Ku = kAu$Ku, (10) VxeKu, V y e ( [ / - K u ) , sim(u,x) > sim(u,y). (11)

2.3.2. Running example Table 6 shows the sets of neighbors using K= 2 and K= 3:

2.4. Prediction of the value of an item

2.4.1. Formalization Based on the information provided by the K-neighbors of a user

u, the CF process enables the value of an item to be predicted as follows:

let Pu = {(i,p)\ie I,peR}, set of prediction to the user u (R •. real numbers) (12)

We will assign the value of the prediction p made to user u on item i as pui=p (13)

U2

U3

U4

Us

0 6.5 0.25 0.33 2

6.5 0 6.66 5 1

0.25 6.66 0 0.5 0.75

0.33 5 0.5 0 1

2 1 0.75 1 0

Table 6 Running example : 2 and 3 neighbors of each user.

K=2 {U3,U4} {U5,U4} {UUU4} {UUU3} {U3,U2} K=3 {U3,U4,U5} {U5,U4,U,} {UUU4,U2} {UUU3,U5} {U3,U2,U4}

Once the set of K users (neighbors) similar to active u has been calculated (Ku), in order to obtain the prediction of item i on user u(12), one of the following aggregation approaches is often used: the average (15), the weighted sum (16) and the adjusted weighted aggregation (Deviation-From-Mean) (17).

Let Gu>i = {neKu|3rn > i^«}

Pu,i = 177^ Y] rnf ^ Guf * 0

Pu,i = K.> J2 sim(u,n)rni ^ GUji ^ 0

Pu,i = fu + [tu,i J2 S Í m ( " ' n)(r">'' ~ f") ^ G".'' ^ 0

where ¡i serves as a normalizing factor, usually computed:

(14)

(15)

(16)

(17)

(18)

When it is not possible to make the prediction of an item as none of the K-neighbors has voted for this item, we can decide to make use of the average ratings given to that item by all users of the RS who have voted for it; in this case, Eqs. (14)-(18) are com-plemented with Eqs. (19)-(23):

where BUj¡ = {n e U\n ^ u, rn>¡ •}

Pu,i h Er».'-#BU„ ' Guj ••

; A B u > i ^ : neB„,

Pu,< = Vuj Yl S Í m ( U ' n)r".'' ^ ^ Gu* = 0 A B",> ^ '•

(19)

(20)

(21)

Puf = ru + / iu i Y^ sim(u, n)(rn>1- - r„) ^ ^ Gu>i = 0 A BUji ^ 0 (22) neB„,

ft,,- = 1 / E s í m ( u . n ) ^ G",¡ = 0 A B",¡ ^ 0 (23)

Finally, in RS cases exist in which it is impossible to make pre-dictions on some items that any other user has voted for:

Pu,i = ' Gu,i = 0 A Bu, = ; (24)

2.4.2. Running example By using the simplest prediction Eq. (15) we obtain the predic-

tions that the users can receive using K = 3 neighbors. Table 7 shows these predictions.

Notesexplicitesvs.notesimplicites(nombred’accèsoudecitations,tempspassé…)

Notesàprédire

Page 67: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 67

Page 68: Recommandation sociale : filtrage collaboratif et par le contenu

Filtrage collaboratif : similarités et voisinages

68

Variante:itemtoitem

Page 69: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Quelles fonctions de similarité ?

- Corrélation de Pearson

- Corrélation de Spearman sur les rangs

- Cosinus

- Distance euclidienne

- Métriques plus complexes:

- JMSD pour intégrer des informations non numériques (combinaison de Pearson et de Jaccard)

- « Optimum de Pareto » pour filtrer les individus les moins représentatifs

- Intégration des scores des autres individus / autres items

69

publications and reviews also exist which include the most com-monly accepted metrics, aggregation approaches and evaluationmeasures: mean absolute error, coverage, precision, recall andderivatives of these: mean squared error, normalized mean absoluteerror, ROC and fallout; Goldberg et al. [13] focuses on the aspects notrelated to the evaluation, Breese et al. [6] compare the predictiveaccuracy of various methods in a set of representative problemdomains. Candillier et al. [7] and Schafer et al. [36] review the maincollaborative filtering methods proposed in the literature.

The rest of the paper is structured as follows:

! In Section 2 we provide the basis for the principles on which thedesign of the new metric will be based, we present graphswhich show the way in which the users vote, we carry outexperiments which support the decisions made, we establishthe best way of selecting numerical and non-numerical infor-mation from the votes and, finally, we establish the hypothesison which the paper and its proposed metric are based.! In Section 3 we establish the mathematical formulation of the

metric.! In Sections 4 and 5, respectively, we list the experiments that

will be carried out and we present and discuss the resultsobtained.! Section 6 presents the most relevant conclusions of the

publication.

2. Approach and design of the new similarity metric

2.1. Introduction

Collaborative filtering methods work on a table of U users whocan rate I items. The prediction of a non-rated item i for a user u iscomputed as an aggregate of the ratings of the K most similar users(k-neighborhoods) for the same item i, where Ku denotes the set ofk-neighborhoods of u and rn,i denotes of value of the user n ratingon the item i (! if there is not rating value).

Once the set of K users (neighborhoods) similar to active u hasbeen calculated, in order to obtain the prediction of item i on useru, one of the following aggregation approaches is often used: theaverage (2), the weighted sum (3) and the adjusted weightedaggregation (deviation-from-mean) (4). We will use the auxiliarset Gu,i in order to define Eqs. (2)–(5):

Gu;i ¼ n 2 Kuj9rn;i – !! "

; ð1Þ

pu;i ¼1

#Gu;i

X

n2Gu;i

rn;i () Gu;i – £; ð2Þ

pu;i ¼ lu;i

X

n2Gu;i

sim u;nð Þrn;i () Gu;i – £; ð3Þ

pu;i ¼ !ru þ lu;i

X

n2Gu;i

sim u;nð Þ rn;i & !rn# $

() Gu;i – £; ð4Þ

where l serves as a normalizing factor, usually computed:

lu;i ¼ 1X

n2Gu;i

simðu;nÞ () Gu;i – £

,: ð5Þ

The most popular similarity metrics are Pearson correlation (6),cosine (7), constrained Pearson’s correlation (8) and Spearmanrank correlation (9):

sim x; yð Þ ¼P

i rx;i & !rx# $

ry;i & !ry# $

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & !rx# $2P

i ry;i & !ry# $2

q ; ð6Þ

sim x; yð Þ ¼P

irx;iry;iffiffiffiffiffiffiffiffiffiffiffiffiPir2

x;i

q ffiffiffiffiffiffiffiffiffiffiffiffiPir2

y;i

q ; ð7Þ

sim x; yð Þ ¼P

i rx;i & rmed# $

ry;i & rmed# $

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & rmed# $2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi ry;i & rmed# $2

q ;

rmed : median value in the rating scale; ð8Þ

sim x; yð Þ ¼

Pi rankx;i & rankx

& 'ranky;i & ranky

& '

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

i rankx;i & rankx

& '2Pi ranky;i & ranky

& '2r : ð9Þ

Although Pearson correlation is the most commonly used met-ric in the process of memory-based CF (user to user), this choice isnot always backed by the nature and distribution of the data in theRS. Formally, in order to be able to apply this metric with guaran-tees, the following assumptions must be met:

! Linear relationship between x and y.! Continuous random variables.! Both variables must be normally distributed.

These conditions are not normally met in real RS, and Pearsoncorrelation presents some significant cases of erroneous operationthat should not be ignored in RS.

Despite the deficiencies of Pearson correlation, this similaritymeasure presents the best prediction and recommendation resultsin CF-based RS [15,16,31,7,35], furthermore, it is the most com-monly used, and therefore, any alternative metric proposed mustimprove its results.

On accepting that Pearson correlation is the metric for whichthe results must be improved, but not necessarily the most appro-priate to be taken as a base, it is advisable to focus on the informa-tion that is obtained in the different research processes and whichcan sometimes be overlooked when searching for other differentobjectives to improving the accuracy of RS (cold-start problem,trust and novelty, sparsity, etc.).

The simplest information to give us an idea of the nature of the RSis to find out what users usually vote: do they always tend to vote forthe same values? Do they always tend to vote for the same items? Isthere much difference between the votes of some users and others?

Fig. 1 shows the distribution of the votes cast in MovieLens 1Mand NetFlix RS (where you can vote in the interval [1..5]). We cansee how, on average, the users focus their votes on the higher levelsof the interval, but avoiding the extremes, particularly the lowerextremes. The distribution of the votes is not balanced and partic-ularly negative or particularly positive votes are avoided.

Fig. 2 shows the arithmetic average and the standard deviationof the votes cast in the MovieLens 1M and NetFlix databases.Graphs (A) and (B) of Fig. 2 show the number of items that displaythe arithmetic average specified on the x axis; we can see thatthere are hardly any items rated, on average, below 2 or above 4,whereby most of the cases are between the values 3 and 4. Graphs(C) and (D) of Fig. 2 show the number of items that display thestandard deviation specified on the x axis; we can see that mostof the items have been voted by the users, on average, with a max-imum difference of 1.2 votes.

According to the figures analyzed, we find that traditional met-rics must often achieve results by operating on a set of discrete rat-ings with very little variation (majority of votes between 3 and 4)and with the obligation of improving simpler and quicker estima-tions, such as always predicting with the value of the arithmeticaverage of the votes of each item (in which we know there is sel-dom a standard deviation higher than 1.2).

2.2. Basic experimentation

After ascertaining the low diversity of the votes cast by theusers, it seems reasonable to consider that the votes mainly tend

J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521

publications and reviews also exist which include the most com-monly accepted metrics, aggregation approaches and evaluationmeasures: mean absolute error, coverage, precision, recall andderivatives of these: mean squared error, normalized mean absoluteerror, ROC and fallout; Goldberg et al. [13] focuses on the aspects notrelated to the evaluation, Breese et al. [6] compare the predictiveaccuracy of various methods in a set of representative problemdomains. Candillier et al. [7] and Schafer et al. [36] review the maincollaborative filtering methods proposed in the literature.

The rest of the paper is structured as follows:

! In Section 2 we provide the basis for the principles on which thedesign of the new metric will be based, we present graphswhich show the way in which the users vote, we carry outexperiments which support the decisions made, we establishthe best way of selecting numerical and non-numerical infor-mation from the votes and, finally, we establish the hypothesison which the paper and its proposed metric are based.! In Section 3 we establish the mathematical formulation of the

metric.! In Sections 4 and 5, respectively, we list the experiments that

will be carried out and we present and discuss the resultsobtained.! Section 6 presents the most relevant conclusions of the

publication.

2. Approach and design of the new similarity metric

2.1. Introduction

Collaborative filtering methods work on a table of U users whocan rate I items. The prediction of a non-rated item i for a user u iscomputed as an aggregate of the ratings of the K most similar users(k-neighborhoods) for the same item i, where Ku denotes the set ofk-neighborhoods of u and rn,i denotes of value of the user n ratingon the item i (! if there is not rating value).

Once the set of K users (neighborhoods) similar to active u hasbeen calculated, in order to obtain the prediction of item i on useru, one of the following aggregation approaches is often used: theaverage (2), the weighted sum (3) and the adjusted weightedaggregation (deviation-from-mean) (4). We will use the auxiliarset Gu,i in order to define Eqs. (2)–(5):

Gu;i ¼ n 2 Kuj9rn;i – !! "

; ð1Þ

pu;i ¼1

#Gu;i

X

n2Gu;i

rn;i () Gu;i – £; ð2Þ

pu;i ¼ lu;i

X

n2Gu;i

sim u;nð Þrn;i () Gu;i – £; ð3Þ

pu;i ¼ !ru þ lu;i

X

n2Gu;i

sim u;nð Þ rn;i & !rn# $

() Gu;i – £; ð4Þ

where l serves as a normalizing factor, usually computed:

lu;i ¼ 1X

n2Gu;i

simðu;nÞ () Gu;i – £

,: ð5Þ

The most popular similarity metrics are Pearson correlation (6),cosine (7), constrained Pearson’s correlation (8) and Spearmanrank correlation (9):

sim x; yð Þ ¼P

i rx;i & !rx# $

ry;i & !ry# $

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & !rx# $2P

i ry;i & !ry# $2

q ; ð6Þ

sim x; yð Þ ¼P

irx;iry;iffiffiffiffiffiffiffiffiffiffiffiffiPir2

x;i

q ffiffiffiffiffiffiffiffiffiffiffiffiPir2

y;i

q ; ð7Þ

sim x; yð Þ ¼P

i rx;i & rmed# $

ry;i & rmed# $

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & rmed# $2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi ry;i & rmed# $2

q ;

rmed : median value in the rating scale; ð8Þ

sim x; yð Þ ¼

Pi rankx;i & rankx

& 'ranky;i & ranky

& '

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

i rankx;i & rankx

& '2Pi ranky;i & ranky

& '2r : ð9Þ

Although Pearson correlation is the most commonly used met-ric in the process of memory-based CF (user to user), this choice isnot always backed by the nature and distribution of the data in theRS. Formally, in order to be able to apply this metric with guaran-tees, the following assumptions must be met:

! Linear relationship between x and y.! Continuous random variables.! Both variables must be normally distributed.

These conditions are not normally met in real RS, and Pearsoncorrelation presents some significant cases of erroneous operationthat should not be ignored in RS.

Despite the deficiencies of Pearson correlation, this similaritymeasure presents the best prediction and recommendation resultsin CF-based RS [15,16,31,7,35], furthermore, it is the most com-monly used, and therefore, any alternative metric proposed mustimprove its results.

On accepting that Pearson correlation is the metric for whichthe results must be improved, but not necessarily the most appro-priate to be taken as a base, it is advisable to focus on the informa-tion that is obtained in the different research processes and whichcan sometimes be overlooked when searching for other differentobjectives to improving the accuracy of RS (cold-start problem,trust and novelty, sparsity, etc.).

The simplest information to give us an idea of the nature of the RSis to find out what users usually vote: do they always tend to vote forthe same values? Do they always tend to vote for the same items? Isthere much difference between the votes of some users and others?

Fig. 1 shows the distribution of the votes cast in MovieLens 1Mand NetFlix RS (where you can vote in the interval [1..5]). We cansee how, on average, the users focus their votes on the higher levelsof the interval, but avoiding the extremes, particularly the lowerextremes. The distribution of the votes is not balanced and partic-ularly negative or particularly positive votes are avoided.

Fig. 2 shows the arithmetic average and the standard deviationof the votes cast in the MovieLens 1M and NetFlix databases.Graphs (A) and (B) of Fig. 2 show the number of items that displaythe arithmetic average specified on the x axis; we can see thatthere are hardly any items rated, on average, below 2 or above 4,whereby most of the cases are between the values 3 and 4. Graphs(C) and (D) of Fig. 2 show the number of items that display thestandard deviation specified on the x axis; we can see that mostof the items have been voted by the users, on average, with a max-imum difference of 1.2 votes.

According to the figures analyzed, we find that traditional met-rics must often achieve results by operating on a set of discrete rat-ings with very little variation (majority of votes between 3 and 4)and with the obligation of improving simpler and quicker estima-tions, such as always predicting with the value of the arithmeticaverage of the votes of each item (in which we know there is sel-dom a standard deviation higher than 1.2).

2.2. Basic experimentation

After ascertaining the low diversity of the votes cast by theusers, it seems reasonable to consider that the votes mainly tend

J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521

publications and reviews also exist which include the most com-monly accepted metrics, aggregation approaches and evaluationmeasures: mean absolute error, coverage, precision, recall andderivatives of these: mean squared error, normalized mean absoluteerror, ROC and fallout; Goldberg et al. [13] focuses on the aspects notrelated to the evaluation, Breese et al. [6] compare the predictiveaccuracy of various methods in a set of representative problemdomains. Candillier et al. [7] and Schafer et al. [36] review the maincollaborative filtering methods proposed in the literature.

The rest of the paper is structured as follows:

! In Section 2 we provide the basis for the principles on which thedesign of the new metric will be based, we present graphswhich show the way in which the users vote, we carry outexperiments which support the decisions made, we establishthe best way of selecting numerical and non-numerical infor-mation from the votes and, finally, we establish the hypothesison which the paper and its proposed metric are based.! In Section 3 we establish the mathematical formulation of the

metric.! In Sections 4 and 5, respectively, we list the experiments that

will be carried out and we present and discuss the resultsobtained.! Section 6 presents the most relevant conclusions of the

publication.

2. Approach and design of the new similarity metric

2.1. Introduction

Collaborative filtering methods work on a table of U users whocan rate I items. The prediction of a non-rated item i for a user u iscomputed as an aggregate of the ratings of the K most similar users(k-neighborhoods) for the same item i, where Ku denotes the set ofk-neighborhoods of u and rn,i denotes of value of the user n ratingon the item i (! if there is not rating value).

Once the set of K users (neighborhoods) similar to active u hasbeen calculated, in order to obtain the prediction of item i on useru, one of the following aggregation approaches is often used: theaverage (2), the weighted sum (3) and the adjusted weightedaggregation (deviation-from-mean) (4). We will use the auxiliarset Gu,i in order to define Eqs. (2)–(5):

Gu;i ¼ n 2 Kuj9rn;i – !! "

; ð1Þ

pu;i ¼1

#Gu;i

X

n2Gu;i

rn;i () Gu;i – £; ð2Þ

pu;i ¼ lu;i

X

n2Gu;i

sim u;nð Þrn;i () Gu;i – £; ð3Þ

pu;i ¼ !ru þ lu;i

X

n2Gu;i

sim u;nð Þ rn;i & !rn# $

() Gu;i – £; ð4Þ

where l serves as a normalizing factor, usually computed:

lu;i ¼ 1X

n2Gu;i

simðu;nÞ () Gu;i – £

,: ð5Þ

The most popular similarity metrics are Pearson correlation (6),cosine (7), constrained Pearson’s correlation (8) and Spearmanrank correlation (9):

sim x; yð Þ ¼P

i rx;i & !rx# $

ry;i & !ry# $

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & !rx# $2P

i ry;i & !ry# $2

q ; ð6Þ

sim x; yð Þ ¼P

irx;iry;iffiffiffiffiffiffiffiffiffiffiffiffiPir2

x;i

q ffiffiffiffiffiffiffiffiffiffiffiffiPir2

y;i

q ; ð7Þ

sim x; yð Þ ¼P

i rx;i & rmed# $

ry;i & rmed# $

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & rmed# $2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi ry;i & rmed# $2

q ;

rmed : median value in the rating scale; ð8Þ

sim x; yð Þ ¼

Pi rankx;i & rankx

& 'ranky;i & ranky

& '

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

i rankx;i & rankx

& '2Pi ranky;i & ranky

& '2r : ð9Þ

Although Pearson correlation is the most commonly used met-ric in the process of memory-based CF (user to user), this choice isnot always backed by the nature and distribution of the data in theRS. Formally, in order to be able to apply this metric with guaran-tees, the following assumptions must be met:

! Linear relationship between x and y.! Continuous random variables.! Both variables must be normally distributed.

These conditions are not normally met in real RS, and Pearsoncorrelation presents some significant cases of erroneous operationthat should not be ignored in RS.

Despite the deficiencies of Pearson correlation, this similaritymeasure presents the best prediction and recommendation resultsin CF-based RS [15,16,31,7,35], furthermore, it is the most com-monly used, and therefore, any alternative metric proposed mustimprove its results.

On accepting that Pearson correlation is the metric for whichthe results must be improved, but not necessarily the most appro-priate to be taken as a base, it is advisable to focus on the informa-tion that is obtained in the different research processes and whichcan sometimes be overlooked when searching for other differentobjectives to improving the accuracy of RS (cold-start problem,trust and novelty, sparsity, etc.).

The simplest information to give us an idea of the nature of the RSis to find out what users usually vote: do they always tend to vote forthe same values? Do they always tend to vote for the same items? Isthere much difference between the votes of some users and others?

Fig. 1 shows the distribution of the votes cast in MovieLens 1Mand NetFlix RS (where you can vote in the interval [1..5]). We cansee how, on average, the users focus their votes on the higher levelsof the interval, but avoiding the extremes, particularly the lowerextremes. The distribution of the votes is not balanced and partic-ularly negative or particularly positive votes are avoided.

Fig. 2 shows the arithmetic average and the standard deviationof the votes cast in the MovieLens 1M and NetFlix databases.Graphs (A) and (B) of Fig. 2 show the number of items that displaythe arithmetic average specified on the x axis; we can see thatthere are hardly any items rated, on average, below 2 or above 4,whereby most of the cases are between the values 3 and 4. Graphs(C) and (D) of Fig. 2 show the number of items that display thestandard deviation specified on the x axis; we can see that mostof the items have been voted by the users, on average, with a max-imum difference of 1.2 votes.

According to the figures analyzed, we find that traditional met-rics must often achieve results by operating on a set of discrete rat-ings with very little variation (majority of votes between 3 and 4)and with the obligation of improving simpler and quicker estima-tions, such as always predicting with the value of the arithmeticaverage of the votes of each item (in which we know there is sel-dom a standard deviation higher than 1.2).

2.2. Basic experimentation

After ascertaining the low diversity of the votes cast by theusers, it seems reasonable to consider that the votes mainly tend

J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521

vote is ‘‘not voted”, which we represent with the symbol !. All thelists have the same number of elements: I.

Example:

rba 2 ½1::5#f g [ f!g;

rx : ð4;5; !;3;2; !;1;1Þ; ry : ð4;3;1;2; !;3;4; !Þ:

using standardized values [0..1]:

rx : 0:75;1; !; 0:5;0:25; !;0;0ð Þ;ry : 0:75;0:5;0;0:25; !;0:5;0:75; !ð Þ:

We define the cardinality of a list: #l as the number of elementsin the list l different to !.

(1) We obtain the list

dx;y : d1x;y;d

2x;y; d

3x;y; . . . ; dI

x;y

! "j

dix;y ¼ ri

x ' riy

! "28ijri

x – ! ^riy – !; di

x;y ¼ !8ijrix ¼ ! _ ri

y ¼ !;

ð10Þ

in our example:

dx;y ¼ ð0;0:25; !;0:0625; !; !;0:5625; !Þ:

(2) We obtain the MSD(x,y) measure computing the arithmeticaverage of the values in the list dx,y

MSDðx; yÞ ¼ !dx;y ¼

Pi¼1::I;di

x;y–! dix;y

#dx;y; ð11Þ

in our example:

!dx;y ¼ ð0þ 0:25þ 0:0625þ 0:5625Þ=4 ¼ 0:218

MSD(x,y) (11) tends towards zero as the ratings of users x andy become more similar and tends towards 1 as they becamemore different (we assume that the votes are normalized inthe interval [0..1]).

(3) We obtain the Jaccard(x,y) measure computing the propor-tion between the number of positions [1..I] in which thereare elements different to ! in both rx and ry regarding thenumber of positions [1..I] in which there are elements differ-ent to ! in rx or in ry:

Jaccardðx; yÞ ¼ rx \ ry

rx [ ry¼ #dx;y

#rx þ#ry '#dx;y; ð12Þ

in our example: 4/(6 + 6'4) = 0.5.(4) We combine the above elements in the final equation:

newmetric x; yð Þ ¼ Jaccard x; yð Þ ) 1'MSD x; yð Þð Þ; ð13Þ

in the running example:

newmetricðx; yÞ ¼ 0:5) ð1' 0:218Þ ¼ 0;391:

If the values of the votes are normalized in the interval [0..1],then:

1'MSD x; yð Þð Þ ^ Jaccard x; yð Þ ^ newmetric x; yð Þ 2 ½0::1#:

4. Planning the experiments

The RS databases [2,30,32] that we use in our experiments pres-ent the general characteristics summarized in Table 1.

The experiments have been grouped in such a way that the fol-lowing can be determined:

! Accuracy.! Coverage.! Number of perfect predictions.! Precision/recall.

We consider a perfect prediction to be each situation in whichthe prediction of the rating recommended to one user in one filmmatches the value rated by that user for that film.

The previous experiments were carried out, depending on thesize of the database, for each of the following k-neighborhoods val-ues: MovieLens [2..1500] step 50, FilmAffinity [2..2000] step 100,NetFlix [2..10000] step100, due to the fact that depending on thesize of each particular RS database, it is necessary to use a differentnumber of k-neighborhoods in order to display tendencies in thegraphs that display their results. The precision/recall recommenda-tion quality results have been obtained using a range [2..20] of rec-ommendations and relevance thresholds h = 5 using MovieLensand NetFlix and h = 9 using FilmAffinity.

When we use MovieLens and FilmAffinity we use 20% of testusers taken at random from all the users of the database; withthe remaining 80% we carry out the training. When we use NetFlix,given the huge number of users in the database, we only use 5% ofits users as test users. In all cases we use 20% of test items.

Table 2 shows the numerical data exposed in this section.

5. Results

In this section we present the results obtained using the dat-abases specified in Table 1. Fig. 6 shows the results obtained withMovieLens, Fig. 7 shows those obtained with NetFlix and Fig. 8 cor-responds to FilmAffinity.

Graph 6A shows the MAE error obtained in MovieLens by apply-ing Pearson correlation (dashed) and the proposed metric (contin-uous). The new metric achieves significant fewer errors inpractically all the experiments carried out (by varying the numberof k-neighborhoods). The percentage improvement average isaround 0.2 stars in the most commonly used values of k (50, 100,150, 200).

Graph 6B shows us the coverage. Small values of k producesmall percentages in the capacity for prediction, as it is moreimprobable that the few neighbors of a test user have voted for afilm that this user has not voted for. As the number of neighborsincreases, the probability that at least one of them has voted forthe film also increases, as shown in the Graph.

Table 1Main parameters of the databases used in the experiments.

MovieLens FilmAffinity NetFlix

Number of users 4382 26447 480189Number of movies 3952 21128 17770Number of ratings 1000209 19126278 100480507Min and max values 1–5 1–10 1–5

Table 2Main parameters used in the experiments.

K (MAE, coverage, perfect predictions) Precision/recall Test users (%) Test items (%)

Range Step N h

MovieLens 1M [2..1500] 50 [2..20] 5 20 20FilmAffinity [2..2000] 100 [2..20] 5 20 20NetFlix [2..10000] 100 [2..20] 9 5 20

J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 525

Ortega,F.,SáNchez,J.L.,Bobadilla,J.,&GutiéRrez,A.(2013).Improvingcollaborativefiltering-basedrecommendersystemsresultsusingParetodominance.InformationSciences,239,50-61.

Page 70: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 70

Coverage Recall

CORR:Pearson;COS:cosinus;EUC:euclidienne;MSD:meansquareddifferences

Page 71: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 71

commonly used due to its low capacity to produce new recom-mendations.

MSD offers both a great advantage and a great disadvantage atthe same time; the advantage is that it generates very good generalresults: low average error, high percentage of correct predictionsand low percentage of incorrect predictions: the disadvantage isthat it has an intrinsic tendency to choose as similar users to onegiven user those users who have rated a very small number ofitems [35], e.g. if we have 7 items that can be rated from 1 to 5and three users u1, u2, u3 with the following ratings: u1: (!, !, 4,5, !, !, !), u2: (3, 4, 5, 5, 1, 4, !), u3: (3, 5, 4, 5, !, 3, !) (! meansnot rated item), the MSD metric will indicate that (u1,u3) have a to-tal similarity (0), (u1,u2) have a similarity 0.5 and (u2,u3) have alower similarity (0.6). This situation is not convincing, as intuitivelywe realize u2 and u3 are very similar, whilst u1 is only similar to u2and u3 in 2 ratios, and, therefore, it is not logical to choose it as themost similar to them, and what is worse, if it is chosen it will notprovide us with possibilities to recommend new items.

The strategy to follow to design the new metric is to consider-ably raise the capacity to generate MSD predictions, without losingalong the way its good behavior as regards accuracy and quality ofthe results.

The metric designed is based on two factors:

! The similarity between two users calculated as the mean of thesquared differences (MSD): the smaller these differences, thegreater the similarity between the 2 users. This part of the met-ric enables very good accuracy results to be obtained.! The number of items in which both one user and the other have

made a rating regarding the total number of items which havebeen rated between the two users. E.g. given users u1: (3, 2,4, !, !, !) and u2: (!, 4, 4, 3, !, 1), a common rating has beenmade in two items as regards a joint rating of five items. Thisfactor enables us to greatly improve the metric’s capacity tomake predictions.

An important design aspect is the decision whether not to use aparameter for which the value should be given arbitrarily, i.e. theresult provided by the metric should be obtained by only takingthe values of the ratings provided by the users of the RS.

By working on the 2 factors with standardized values [0..1], themetric obtained is as follows: Given the lists of ratings of 2 generic

users x and y: rx; ry! "

: r1x ; r

2x ; r

3x ; . . . ; rI

x

! "; r1

y ; r2y ; r

3y ; ; . . . ; rI

y

# $j I is the

number of items of our RS, where one of the possible values of each

Fig. 5. MAE and coverage obtained with Pearson correlation and by combining Jaccard with Pearson correlation, cosine, constrained Pearson’s correlation, Spearman rankcorrelation and mean squared differences. (A) MAE, (B) Coverage. MovieLens 1M, 20% of test users, 20% of test items, k e [2..1500] step 25.

Fig. 4. Measurements related to the Jaccard metric on MovieLens. (A) Number of pairs of users that display the Jaccard values represented on the x axis. (B) Averaged MAEobtained in the pairs of users with the Jaccard values represented on the x axis. (C) Averaged coverages obtained in the pairs of users with the Jaccard values represented onthe x axis.

524 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528

Bobadilla, J., Serradilla, F., & Bernal, J. (2010). A new collaborative filtering metric that improves the behavior of recommender systems. Knowledge-Based Systems, 23(6), 520-528.

The comparative results in Graph 6B show improvements of upto 9% when applying the new metric as regards the correlation.This is a very good result, as higher values of accuracy normally im-ply smaller capabilities for recommendation.

Graph 6C shows the percentage of perfect estimations as re-gards the total estimations made. Perfect estimations are thosewhich match the value voted by the test user, taking as an estima-tion the rounded value of the aggregation of the k-neighborhoods.The values obtained in Graph 6C show us a convincing improve-

ment in the results of the new metric regarding correlation, evenby 15% in some cases.

Graph 6D shows the recommendation quality measure: preci-sion versus recall. Although the prediction results (graphs A andC) of the new metric greatly improve the Pearson correlation ones,that improvement is not transferred to the same extent to the rec-ommendation quality results (approximate improvement of 0.02).In order to better understand this detachment between predictionquality and recommendation quality we must remember that with

Fig. 6. Pearson correlation and new metric comparative results using MovieLens: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 20% oftest users, 20% of test items, k e [2..1500] step 50, N e [2..20], h = 5.

Fig. 7. Correlation and new metric comparative results using NetFlix: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 5% of test users,20% of test items, k e [2..10000] step 100, N e [2..20], h = 9.

526 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528

Page 72: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 72

MichaelD.Ekstrand,MichaelLudwig,JosephA.Konstan,andJohnT.Riedl.2011.RethinkingTheRecommenderResearchEcosystem:Reproducibility,Openness,andLensKit.InProceedingsoftheFifthACMConferenceonRecommenderSystems(RecSys’11).ACM,NewYork,NY,USA,133-140.DOI=10.1145/2043932.2043958.

Page 73: Recommandation sociale : filtrage collaboratif et par le contenu

Evaluation par validation croisée

73

Page 74: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Le problème du démarrage à froid— Nouvelle application

— Recommandation éditoriale

— Encourager les utilisateurs à donner des avis

— Nouvel utilisateur

— Exploiter autant que possible d’autres informations sur l’utilisateur

— formulaires,

— amis sur les réseaux sociaux (= demander l’accès)

— préférences sous forme de tags…

— Nouvel item

— Exploiter les méta-données (pour un film : année, réalisateur, acteurs…)

— Exploiter les critiques que l’on peut trouver par ailleurs sur le Web

74

Page 75: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 75

Page 76: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 76

Page 77: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 77

Page 78: Recommandation sociale : filtrage collaboratif et par le contenu

Amazon : Organisation des objets (catégories)

78

Product Advertising API https://aws.amazon.com/

cf.http://www.codediesel.com/libraries/amazon-advertising-api-browsenodes/

Page 79: Recommandation sociale : filtrage collaboratif et par le contenu

Similarités et espaces latents

79

KorenY,BellR,VolinskyC.MatrixFactorizationTechniquesforRecommenderSystems.IEEEComputer.July2009:42-50.

Page 80: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Projection de la matrice individus / items

— Chaque item I est représenté par un vecteur q de dimension f

— Chaque utilisateur U est représenté par un vecteur p de dim. f

— Chaque facteur représente une propriété latente qui caractériseles items et qui souligne l’intérêt des utilisateurs pour celle-ci

— Le produit scalaire entre q et p est une estimation de l’intérêt de U pour I

— Méthode :

— Décomposition en valeurs singulières

— Approximation par descente de gradient (sur des données d’apprentissage)

80

noteréelle noteprédite facteurderégularisation

constantederégularisation(appriseparvalidationcroisée)

Page 81: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Espaces latents (suite)

— Espace non convexe : risque de solution éloignée de l’optimum global

— Approche par Moindres Carrés Alternés (Alternating Least Squares)

. Fixe q, cherche p ; fixe p, cherche q etc.

. Utile lorsque les données (notes d’apprentissage) sont implicites (matrice non creuse)

— Tenir compte des biais = modifier les valeurs prédites

— Des utilisateurs ont tendance à toujours donner de bonnes notes

— Certains items ont toujours tendance à avoir de bonnes notes

— Le score final doit dépendre de la moyenne de tous les scores (base de départ)

— Intégrer les préférences a priori des utilisateurs (x : items préférés de u ; y: attributs (âge…))

— Tenir compte de la dynamique

81

Page 82: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 82

https://www.slideshare.net/MrChrisJohnson/interactive-recommender-systems-with-netflix-and-spotify/48-Diversity_Scorenote

Page 83: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 83

KorenY,BellR,VolinskyC.MatrixFactorizationTechniquesforRecommenderSystems.IEEEComputer.July2009:42-50.

Page 84: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 84

Page 85: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 85

Page 86: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 86

KorenY,BellR,VolinskyC.MatrixFactorizationTechniquesforRecommenderSystems.IEEEComputer.July2009:42-50.

Page 87: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Filtrage collaboratif à destination de « groupes »

87

Page 88: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Intégration du contexte

— Très nombreuses définition du contexte

— Plusieurs stratégies d’intégration

88

AdomaviciusG,MobasheB,RicciF,TuzhilinA.Context-AwareRecommenderSystems.InAAAI2011:;2017:67-81.

Page 89: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Intégration du contexte (suite)

— Cube Individus x Items x Contextes remplace la matrice Individus x Items

— Factorisation de tenseurs

89

Karatzoglou, A.; Amatriain, X.; Baltrunas, L.; and Oliver, N. 2010. Multiverse Recommendation: N-Dimensional Tensor Factorization for Context-Aware Collaborative Filtering. In Proceedings of the 2010 ACM Conference on Recommender Systems, 79–86.

Page 90: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Exploitation des liens (réseaux sociaux)

— Le réseau social comme entrée supplémentaire

90

YangX,GuoY,LiuY,SteckH.Asurveyofcollaborativefilteringbasedsocialrecommendersystems.ComputerCommunications.2014;41(C):1-10.doi:10.1016/j.comcom.2013.06.009.

Page 91: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Exploitation des liens (réseaux sociaux) (2)

— Prédiction selon les liens entre individus (inférence Bayésienne)

91

individuquichercheunenote

individusquiontnotél’item

individusintermédiairesquiréunissentlesnotes

Page 92: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

FILTRAGE SELON LE CONTENU

92

Page 93: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Recommandation basée sur le contenu

— Lien fort avec la Recherche d’Information

— La notion de « Profil utilisateur » est à rapprocher de la notion de « Requête »

93

Page 94: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 94

https://www.slideshare.net/MrChrisJohnson/interactive-recommender-systems-with-netflix-and-spotify/81-81NLP_models_also_work_on

Page 95: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Un mot, une chose ? pas si simple…

95

Page 96: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Contenu audio

96

Wang, X., & Wang, Y. (2014, November). Improving content-based and hybrid music recommendation using deep learning. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 627-636). ACM.

Page 97: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

LA RECOMMANDATION DE LECTURES (LIVRES)

97

Page 98: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 98

Page 99: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Recommending Books vs Searching for Books ?

Very diverse needs :

— Topicality

— With a precise context eg. arts in China during the XXth century

— With named entities : locations (the book is about a specific location OR the action takes place at this location), proper names…

— Style / Expertise / Language

— fiction, novel, essay, proceedings, position papers…

— for experts / for dummies / for children …

— in English, in French, in old French, in (very) local languages …

— looking for citations / references

— in what book appears a given citation

— what are the books that refer to a given one

— Authority :

— What are the most important books about … (what most important means ?)

— What are the most popular books about …

99

Page 100: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 100

http://social-book-search.humanities.uva.nl/#/overview

approach to retrieval. We started by using Wikipedia as an external source ofinformation, since many books have their dedicated Wikipedia article [3]. Weassociate a Wikipedia article to each topic and we select the most informativewords from the articles in order to expand the query. For our recommendationruns, we used the reviews and the ratings attributed to books by Amazon users.We computed a ”social relevance” probability for each book, considering theamount of reviews and the ratings. This probability was then interpolated withscores obtained by Maximum Likelihood Estimates computed on whole Amazonpages, or only on reviews and titles, depending on the run.

The rest of the paper is organized as follows. The following Section gives aninsight into the document collection whereas Section 3 describes the our retrievalframework. Finally, we describe our runs in Section 4 and discuss some resultsin Sections 5 and 6.

2 The Amazon collection

The document used for this year’s Book Track is composed of Amazon pages ofexisting books. These pages consist of editorial information such as ISBN num-ber, title, number of pages etc... However, in this collection the most importantcontent resides in social data. Indeed Amazon is social-oriented, and user cancomment and rate products they purchased or they own. Reviews are identi-fied by the <review> fields and are unique for a single user: Amazon does notallow a forum-like discussion. They can also assign tags of their creation to aproduct. These tags are useful for refining the search of other users in the waythat they are not fixed: they reflect the trends for a specific product. In theXML documents, they can be found in the <tag> fields. Apart from this userclassification, Amazon provides its own category labels that are contained in the<browseNode> fields.

Table 1. Some facts about the Amazon collection.

Number of pages (i.e. books) 2, 781, 400Number of reviews 15, 785, 133Number of pages that contain a least a review 1, 915, 336

3 Retrieval model

3.1 Sequential Dependence Model

Like the previous year, we used a language modeling approach to retrieval [4].We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integratemultiword phrases in the query. Specifically, we use the Sequential Dependance

Organizers Marijn Koolen (University of Amsterdam)Toine Bogers (Aalborg University Copenhagen)Antal van den Bosch (Radboud University Nijmegen)Antoine Doucet (University of Caen)Maria Gaede (Humboldt University Berlin)Preben Hansen (Stockholm University)Mark Hall (Edge Hill University)Iris Hendrickx (Radboud University Nijmegen)Hugo Huurdeman (University of Amsterdam)Jaap Kamps (University of Amsterdam)Vivien Petras (Humboldt University Berlin)Michael Preminger (Oslo and Akershus University College of Applied Sciences)Mette Skov (Aalborg University Copenhagen)Suzan Verberne (Radboud University Nijmegen)David Walsh (Edge Hill University)

Page 101: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 101

http://social-book-search.humanities.uva.nl

SBS Collection : des requêtes réelles issues du forum Library Thing

Page 102: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 102

Page 103: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 103

Lecataloguedelapersonnequiposelaquestion

Page 104: Recommandation sociale : filtrage collaboratif et par le contenu

Social Tagging

104

Complementcategoriesbutalotoftags!

Page 105: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 105

Des profils « utilisateur » (catalog, reviews, ratings)

Page 106: Recommandation sociale : filtrage collaboratif et par le contenu

Idée : utiliser les critiques et commentaires plutôt que les contenus

106

Commentaires contiennent:- keywords- topics- sentiment- abstracts- otherbooks

Page 107: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 107

6

La Recommandation de Livres / RI

SBS 2016 – Dataset : Amazon collection of 2.8M records

Index Fields

Université Aix-Marseille Amal Htait

Page 108: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 108

7

La Recommandation de Livres / RI

SBS 2016 – Dataset : LibraryThing Collection of 113,490 users profiles

userid workid author booktitle publication-year catalogue-date rating tagsu3266995 660947 Rosina Lippi Homestead 1999 2006-06 10.0 fiction u1885143 2729214 Ellen Hopkins Glass 2009 2009-05 6.0 drugsu1885143 133315 Tite Kubo Bleach, Vol. 1 2004 2009-06 6.0 manga

Index Fields

Université Aix-Marseille Amal Htait

Page 109: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 109

8

La Recommandation de Livres / RI

SBS 2016 - Topics Query : Traitement de la requête par les Informations des Livres en Exemples

Université Aix-Marseille Amal Htait

Page 110: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 110

9

La Recommandation de Livres / RI

SBS 2016 - Retrieval Model : Méthode - SDM

Weighting query terms [Metzler2005]

● Unigram matches

● Bigram exact matches

● Bigram matches within an un-ordered window of 8 terme

Université Aix-Marseille Amal Htait

Page 111: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 111

Page 112: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 112

Koolen, M., Bogers, T., Gäde, M., Hall, M., Hendrickx, I., Huurdeman, H., ... & Walsh, D. (2016, September). Overview of the CLEF 2016 Social Book Search Lab. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 351-370). Springer International Publishing.

Page 113: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 113

Page 114: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 114

http://ceur-ws.org/Vol-1609/

Page 115: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Building a Graph of Books— Nodes = books + properties (metadata, #reviews and ranking, page ranks, ratings…)

— Edges = links between books

— Book A refers to Book B according to: — Bibliographic references and citations (in the book / in the reviews) — Amazon recommendation (People who bought A bought B, People who liked A liked B…)

— A is similar to B — They share bibliographic references — Full-text similarity + similarity between the metadata

115

Thegraphallowstoestimate—«BookRanks»(cf.theGoogle’sPageRank)—Neighborhood—Shortestpaths

Page 116: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 116

Jeh, G., & Widom, J. (2002, July). SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 538-543). ACM.

Page 117: Recommandation sociale : filtrage collaboratif et par le contenu

Recommending books : IR + graph mining

117

IR : Sequential Dependance Model (SDM) - Markov Random Field (Metzler & Croft, 2004) and/or Divergence From Randomness (InL2) model + Query Expansion with Dependance AnalysisRatings : The more a book has reviews and the more it has good ratings, the more relevant it is.Graph : Expanding the retrieved books with Similar Books then Reranking with PageRank

13

● We tested many reranking methods. Combining the retrieval model scores and other scores based on social information.

● For each document compute:– PageRank: algorithm that exploits link structure to score

the important of nodes in the graph.

– Likeliness: Computed from information generated by users (reviews and ratings). More the book has a lot of reviews and good ratings, the more interesting it is.

Graph Modeling – Reranking Schemes

12

ti Retrieving

Collection

DGD

Dti

DStartingNodes

Neighbors

SPnodes

DgraphDgraph

Delete

duplications

D nal

1 2

3

5

6

789 + 10

Reranking11

Graph Modeling - Recommendation

PageRank+SimilarProducts -Verygoodresultsin2011 (Judgementsobtainedbycrowdsourcing)(IRandratings) P@10≈0.58

-Goodresultsin2014(IR,ratings,expansion)P@10≈0.23;MAP≈0.44 -in2015:rank25/47(IR+graph butgraphimprovedIR)P@10≈0.2(best0.39,includedthepriceofbooks)

Page 118: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Une perspective: fouille de graphes multicouches

— Thèse de Mohamed Ettaleb (co-dirigée par Pr. C. Latiri, B. Douhar, P. Bellot)

118

Livres «similaireà»

couche«achetésensemble»

coucheauteurs

couchetags

Question:quelssous-graphesfréquents?commentlesinterpréter?

Page 119: Recommandation sociale : filtrage collaboratif et par le contenu

119

Et dans la vraie vie ? (pour nous : OpenEdition)

PBAC²:*ProvenanceDBased*Access*Control*for*the*Cloud!

Crowd4WS:*Crowdsourcing*for*Web*Service*Discovery!VGOLAP:*Volunteered*Geographic*OLAP*

DIMAGData, Information and content Management Group

5 PR, 8 MCF, 2 associés, 17 doctorants

http://www.lsis.org/dimag

- développer des modèles et des algorithmes de recherche d’information et de fouille sur de grandes masses de données et des contenus en langue naturelle- proposer des architectures pour les systèmes d’information, des modélisations de processus (BPM) et des approches pour la définition, l’intégration et la recherche de services Web

True Captiongame baseball ballCorrLDAgame area ball seaLOC-LDAgame air ball world

True Captionriver nightCorrLDAriver city area sunLOC-LDAriver night city world

True Captionworld area skylineCorrLDAworld area town cityLOC-LDAworld area skyline mountain

True Captionwater ironCorrLDAwater iron city townLOC-LDAwater iron sky world

True Captionair flight aircraftCorrLDAair water flight towerLOC-LDAair flight aircraft town

True Captioncity airCorrLDAcity air river waterLOC-LDAcity air fire tower

True Captionworld water area pacificCorrLDAworld water area seaLOC-LDAworld sea area sun

True Captionriver towerCorrLDAworld river city townLOC-LDAriver tower world town

Figure 6. Example images from the test set and their automatic annotations under CorrLDA and LOC-LDA.

Fouille et intégration de données

Figure 1: Graphical model of the Dynamic person-alization Topic Model (for two time slices t1 and t2).

Modèles probabilistes latents pour la recommandation personnalisée et l’annotation automatique Routage de requêtes dans les réseaux P2P via des traverses minimales d'un hypergraphe

. . .

. . .

y

x

Extraction d’information par CRF

Recherche d’information et fouille de textes

BILBO

ÉCHO

CLASSIFICATION AUTOMATIQUE ET MÉTADONNÉES

RECOMMANDATION

GRAPHE DE CONTENUS

questions de communication

vertigo

edc

echogeo

vertigo

quaderni

BILBO - MISE EN RELATION DES COMPTES-RENDUS AVEC LES LIVRES

ÉCHO - ANALYSE DES SENTIMENTS

Langouet, G., (1986), « Innovations pédago-giques et technologies éducatives », Revue française de pédagogie, n° 76, pp. 25-30.

Langouet, G., (1986), « Innovations pédagogiques et technologies éducatives », Revue française de pédagogie, n° 76, pp. 25-30.DOI : 10.3406/rfp.1986.1499

18 Voir Permanent Mandates Commission, Minutes of the Fifteenth Session (Geneva: League of Nations, 1929), pp. 100-1. Pour plus de détails, voir Paul Ghali, Les nationalités détachées de l’Empire ottoman à la suite de la guerre (Paris: Les Éditions Domat-Mont-chrestien, 1934), pp. 221-6.

ils ont déjà édité trois recueils et auxquelles ils ont consacré de nombreux travaux critiques. Leur nouvel ouvrage, intitulé Le Roman véritable. Stratégies préfacielles au XVIIIe siècle et rédigé à six mains par Jan Herman, Mladen Kozul et Nathalie Kremer – chaque auteur se chargeant de certains chapitres au sein

DÉTECTION

ANNOTATION ET RECHERCHE DU DOI

AFFICHAGE DU DOI DE LA REVUE

BILBO

NIVEAU 1

NIVEAU 2

NIVEAU 3

<bibl><author><surname>Langouet</surname>, <forename>G.</forename>,</author> (<date>1986</date>), <title level="a">« Innovations pédagogiques et technologies éducatives »</title>, <title level="j">Revue française de pédagogie</title>, <abbr>n°</abbr> <biblScope type="issue">76</biblScope>, <abbr>pp.</abbr> <biblScope type=page>25-30</biblScope>. <idno type="DOI">DOI : 10.3406/rfp.1986.1499</idno></bibl>

Equipex « Digital Library for Open Humanities »

RI sociale

parses, a.k.a. constituent parsing and produces [8]. This directed graph is the result of an

all-path parsing algorithm based on a dependency grammar [6] in which the syntactic structure is expressed in terms of de-

. This relation defines a dependency tree, whose root is a

We have adopted the typed dependencies proposed in [8] . It is worth noticing that

typed dependencies and phrases structures are different ways of representing the inner structure of sentences, in which a phrase structure (constituent) parsing represents the nesting of multi-word constituents, whereas a dependency parsing represents dependencies between individual words. In addition, a typed dependency graph labels dependencies with grammatical rela-

Different variants of the Stanford typed dependency repre-sentation are available in the dependency parsing system pro-

representation, where dependencies involving prepositions, conjuncts, as well as information about the referent of relative clauses, are collapsed to get direct dependencies between con-tent words. This collapsed representation can simplify the rela-

Fig. 2. Graph-based model of the sentence: “Mary is reading a book on

Semantic Web”.

This graph-based model can be expressed by a set of binary

Extraction d’information par Programmation Logique Inductive

Prediction

Training Document Document Training Corpus

Training Phase Classification Phase

Indexing Conceptualization

Document Document Conceptualized Corpus

Predicted Class

Test Document Conceptualized

Document

Indexed Corpus

Indexed Document

Classification model

Enriching Vectors

Proximity matrix

Enriched Document vector

Document Document Enriched model

Classification et recherche d’information sémantiques Filtrage de documents Web par modèles de langue temporels et apprentissage de méta-caractéristiques

OpenEditionOpenEdition

L’approche S3(Semantic support, Semantic descriptors, Search Patterns)

Base de descripteurs sémantiques

Mapping sémantique

Définition de patterns

...

Ressources diverses

Stratégie ascendante

Stratégie descendante

Résultat Enrichissement des descripteurs sémantiques

Base de patterns

Sélection

Ontologie Manufacturing Process

Dictionnaire Process Control

Matching

Résultat

Création/Stockage

Utilisateur

Ressources d’information manufacturière

Descripteurs sémantiques de ressources

Patterns de recherche: Artefacts de besoins métier

Alignement des ressources avec les besoins métier

Stratégie ascendante

Stratégie descendante

...

SI décisionnels et adaptatifs

Fig. 1. Learning content multimodality contextualization process

L’approche S3:Création de descripteurs sémantiques

Mapping sémantique

Ontologie du manufacturing process

Ressources

Sorties

Dictionnaire Process Control

Entrées

Validation

Descripteurs sémantiques de ressources

Description des experts

SD4

SD3

SD2

SD1...

...

Similarité entre concepts (Csim)

Relations entre concepts et inférence

+

Personnalisation dirigée par des ontologiesModélisation de système logistique — Application au management hospitalier

Gestion de ressources d’informationProcessus flexibles — Contrôle de processus industriels

Agents conversationnels — Systèmes de dialogue — Multimodalité

Architectures multi-agents, simulation de comportements et jeux sérieux pour la gestion de crise

Virtual Reality for Training Doctors

to Break Bad News

Services Web et eLearning

Objectifs:

SI —WebBig Data — Documents

Matchmaking et Crowdsourcing pour la découverte de Services Web Gestion de la Provenance pour un Cloud de confiance Le citoyen comme capteur humain : contribution à l'analyse OLAP Spatiale

Univ. Recife (Brésil)

Extractiond’information

ChercherdescritiquesLesreliersauxlivres

Analysedesentiments Recommandationdelivres

SVM-Z-score-CRF Graphscoring

NOTESPOLARITE

GRAPHE

RECOMMANDATION

Analysedecitations

Page 120: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Identifier des critiques de livres dans des blogs

• Classification supervisée « en genre »

• Caractéristiques : unigrammes, localisation des entités nommées, dates

• Sélection de caractéristiques : Seuil du Z-score + random forest

120

Figure 4: ”Location” named entity distribution

Figure 5: ”Date” named entity distribution

methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).

6.1 Naive Bayes (NB)

In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.

arg max

hi

P (hi).|w|Y

j=1

P (wj |hi) (5)

where P (wj |hi) =tfj,hinhi

We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.

6.2 Support Vector Machines (SVM)

SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)

6.3 Results

We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.

Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined

Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%

SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185

2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781

3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%* C = 8.0* � = 0.03125

the corpus. In order to give more importance tothe difference in how many times a term appear inboth classes, we used the normalized z-score de-scribed in Equation (4) with the measure � intro-duced in Equation (3)

� =tfC0 � tfC1

tfC0 + tfC1(3)

The normalization measure � is taken into accountto calculate normalized z-score as following:

Z

�(wi|Cj) =

8>>>>>>><

>>>>>>>:

Z(wi|Cj).(1 + |�(wi|Cj |)if Z > 0 and � > 0,or if Z 0 and � 0

Z(wi|Cj).(1� |�(wi|Cj |)if Z > 0 and � 0,or if Z 0 and � > 0

(4)In the table 3 we can observe the 30 highest nor-

malized Z scores for Review and Review classesfor the corpus after an unigram indexing schemewas performed. We can see that a lot of this fea-tures relate to the classe where they predominate.

Table 3: Distribution of the 30 highest normalizedZ scores across the corpus.

# Feature Z�

score# Feature Z�

score1 abandonne 30.14 16 winter 9.232 seront 30.00 17 cleo 8.883 biographie 21.84 18 visible 8.754 entranent 21.20 19 fondamentale 8.675 prise 21.20 20 david 8.546 sacre 21.20 21 pratiques 8.527 toute 20.70 22 signification 8.478 quitte 19.55 23 01 8.389 dimension 15.65 24 institutionnels 8.3810 les 14.43 25 1930 8.1611 commandement 11.01 26 attaques 8.1412 lie 10.61 27 courrier 8.0813 construisent 10.16 28 moyennes 7.9914 lieux 10.14 29 petite 7.8515 garde 9.75 30 adapted 7.84

In our training corpus, we have 106 911 wordsobtained from the Bag-of-Words approach. We se-lected all tokens (features) that appear more than5 times in each classes. The goal is therefore todesign a method capable of selecting terms thatclearly belong to one genre of documents. We ob-tained a vector space that contains 5 957 words(features). After calculating the normalized z-score of all features, we selected the first 1 000features according to this score.

5.3 Using Named Entity (NE) distribution asfeatures

Most of researches involve removing irrelevant de-scriptors. In this section, we describe a new ap-proach for better represent the documents in thecontext of this study. The purpose is to find ele-ments that characterize the Review class.

After a linguistic and statistical corpus analysis,we identified some common characteristics (illus-trated in Figure 3, 4 and 5). We have identifiedthat the presence of the bibliographical referenceor some its elements (title, author(s) and date) ofthe reviewed book is often in the title of the review,as in the following example:

[...]<title level="a" type="main"> Dean R. Hoge,Jacqueline E. Wenger,<hi rend="italic">

Evolving Visions of the Priesthood. Changes from Vatican IIto the Turn of the New Century</hi>

</title>

<title type="sub"> Collegeville (MIN), Liturgical Press,2003, 226 p.</title> [...]

In the Review class, we found scientific arti-cles. In those documents, a bibliography sectionis generally present at the end of the text. As weknow, this section contains authors’ names, loca-tions, dates, etc... However, in the Review classthis section is quite often absent. Based on thisanalysis, we tagged all documents of each classusing the Named Entity Recognition tool TagEN(Poibeau, 2003). We aim to explore the distribu-tion of 3 named entities (”authors’ names”, ”loca-tions” and ”dates”) in the text after removing allXML-HTML tags. After that, we divided textsinto 10 parts (the size of each part = total num-ber of words / 10). The distribution ratio of eachnamed entity in each part is used as feature to buildthe new document representation and we obtaineda set of 30 features.

Figure 3: ”Person” named entity distribution

6 Experiments

In this section we describe results from experi-ments using a collection of documents from Re-vues.org and the Web. We use supervised learning

Figure 4: ”Location” named entity distribution

Figure 5: ”Date” named entity distribution

methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).

6.1 Naive Bayes (NB)

In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.

arg max

hi

P (hi).|w|Y

j=1

P (wj |hi) (5)

where P (wj |hi) =tfj,hinhi

We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.

6.2 Support Vector Machines (SVM)

SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)

6.3 Results

We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.

Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined

Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%

SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185

2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781

3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%* C = 8.0* � = 0.03125

Figure 4: ”Location” named entity distribution

Figure 5: ”Date” named entity distribution

methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).

6.1 Naive Bayes (NB)

In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.

arg max

hi

P (hi).|w|Y

j=1

P (wj |hi) (5)

where P (wj |hi) =tfj,hinhi

We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.

6.2 Support Vector Machines (SVM)

SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)

6.3 Results

We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.

Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined

Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%

SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185

2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781

3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%* C = 8.0* � = 0.03125

Page 121: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

L’analyse de sentiments sur les critiques• Statistical Metrics (PMI, Z-score, odd ratio…)

• Combined with Linguistic Ressources

121

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score

features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f-measure. We also test all combinations with the-se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea-tures improve significantly the f-measure and they are better than pre-polarity features.

Figure 1 Z_score distribution in positive class

Figure 2 Z_score distribution in neutral class

Figure 3 Z_score distribution in negative class

Features F-measure 2013 2014

Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas-

ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure

2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22

Table 3. Average f-measures for positive and negative clas-ses of SemEval2013 and 2014 test sets after using a twitter

dictionary.

5 Conclusion

In this paper we tested the impact of using Twitter Dictionary, Sentiment Lexicons, Z_score features and POS tags for the sentiment classifi-cation of tweets. We extended the feature vector of tweets by all these features; we have proposed new type of features Z_score and demonstrated that they can improve the performance. We think that Z_score can be used in different ways for improving the Sentiment Analysis, we are going to test it in another type of corpus and using other methods in order to combine these features.

Reference Apoorv Agarwal,Boyi Xie,Ilia Vovsha,Owen

Rambow and Rebecca Passonneau (2011). Sentiment analysis of Twitter data. Proceedings of the Workshop on Languages

features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f-measure. We also test all combinations with the-se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea-tures improve significantly the f-measure and they are better than pre-polarity features.

Figure 1 Z_score distribution in positive class

Figure 2 Z_score distribution in neutral class

Figure 3 Z_score distribution in negative class

Features F-measure 2013 2014

Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas-

ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure

2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22

Table 3. Average f-measures for positive and negative clas-ses of SemEval2013 and 2014 test sets after using a twitter

dictionary.

5 Conclusion

In this paper we tested the impact of using Twitter Dictionary, Sentiment Lexicons, Z_score features and POS tags for the sentiment classifi-cation of tweets. We extended the feature vector of tweets by all these features; we have proposed new type of features Z_score and demonstrated that they can improve the performance. We think that Z_score can be used in different ways for improving the Sentiment Analysis, we are going to test it in another type of corpus and using other methods in order to combine these features.

Reference Apoorv Agarwal,Boyi Xie,Ilia Vovsha,Owen

Rambow and Rebecca Passonneau (2011). Sentiment analysis of Twitter data. Proceedings of the Workshop on Languages

[Hamdan,Béchet&Bellot,SemEval2014]

http://sentiwordnet.isti.cnr.it

Page 122: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 122

Page 123: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 123

Page 124: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 124

http://reviewofbooks.openeditionlab.org

Page 125: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Linking Contents by Analyzing the ReferencesIn Books : no common stylesheet (or a lot of stylesheets poorly respected…)

Our proposal :

1) Searching for references in the document / footnotes (Support Vector Machines)

2) Annotating the references (Conditional Random Fields)

BILBO : Our (open-source) software for Reference Analysis

125

Google Digital Humanities Research Awards (2012)

Annotation

DOIsearch(Crossref)

OpenEditionJournals:morethan1.5millionreferencesanalyzed

Test:http://bilbo.openeditionlab.orgSources:http://github.com/OpenEdition/bilbo

Page 126: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 126

Test:http://bilbo.openeditionlab.orgSources:http://github.com/OpenEdition/bilbo

Page 127: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 127

Ollagnier, A., Fournier, S., & Bellot, P. (2016). A Supervised Approach for Detecting Allusive Bibliographical References in Scholarly Publications. In WIMS (p. 36).

Page 128: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 128

thèsedeDoctoratdeAnaïsOllagnier(dir.P.Bellot/S.Fournier)

Page 129: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 129

http://sentiment-analyser.openeditionlab.org/aboutsemeval

Page 130: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

SYSTÈMES HYBRIDES

130

Page 131: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 131

Page 132: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 132

Page 133: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

CONCLUSION

133

Page 134: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Conclusion— De très nombreuses approches (hybrides)

— Filtrage collaboratif et exploitation de l’historique

— Analyse des contenus

— Exploitation de données comportementales et d’informations explicites

— Exploitation des réseaux sociaux

— -> tout combiner dans un seul modèle d’apprentissage ? Quelle fonction à optimiser ?

— Des liens forts avec d’autres domaines

— Méthodes statistiques, fouille de données et de graphes, apprentissage…

— Recherche d’information (n’est-ce pas aussi de la recommandation ?), traitement automatique des langues, analyse d’image/signal, ergonomie et interaction…

— Il faut choisir les approches mais aussi les données

— Usages et contextes

— Préservation de la vie privée

134

Page 135: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 135

https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf

Page 136: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 136

http://lenskit.org

MichaelD.Ekstrand,MichaelLudwig,JosephA.Konstan,andJohnT.Riedl.2011.RethinkingTheRecommenderResearchEcosystem:Reproducibility,Openness,andLensKit.InProceedingsoftheFifthACMConferenceonRecommenderSystems(RecSys’11).ACM,NewYork,NY,USA,133-140.DOI=10.1145/2043932.2043958.

Page 137: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 137

Page 138: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 138

Page 139: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 139

Çoba, L., & Zanker, M. rrecsys: an R-package for prototyping recommendation algorithms, RecSys 2016.

Page 140: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition)

Challenges

140

Page 141: Recommandation sociale : filtrage collaboratif et par le contenu

P.Bellot(AMU-CNRS,LSIS-OpenEdition) 141

http://lab.hypotheses.org

Mercidevotreattention:-)