View
96
Download
11
Embed Size (px)
Citation preview
RECOMMANDATION SOCIALEPatriceBellotAix-MarseilleUniversité-CNRS(LSISUMR7296)—OpenEdition
[email protected]://www.lsis.org/dimagOpenEditionLab:http://lab.hypotheses.org
OpenEditionhomepage
>4millionuniquevisitors/month
Ourpartners:librariesaninstitutionsallovertheworld
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Quelques questions ouvertes…
— Est-il utile d’exploiter les méta-données, les contenus, les commentaires ?
— Comment relier les contenus les uns aux autres ?
— Comment exploiter des contenus de nature différente ?
— Comment « comprendre » les besoins des lecteurs ? des requêtes longues ? des profils ?
— Quels sont les usages ? Quels sont les besoins ?
— Comment aller au-delà de la pertinence informationnelle ? (genre, niveau d’expertise, document récent ou non…)
3
— OpenEdition Lab : un programme de recherche HN — Détecter des tendances, des sujets émergents, les livres « à lire »…
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Plan
— Quelques exemples : poser les problèmes et les enjeux
— Quelles ressources ?
— Quelques généralités méthodologiques
— Quelques stratégies d’évaluation d’une recommandation
— Autour du filtrage collaboratif ( = recommandation « sociale » ?)
— Autour de l’analyse de contenu et de la suggestion de contenus
. focus sur la recherche de livres par requêtes longues en langue naturelle
4
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Introduction
Objectifs de la recommandation :
— Recommander des « objets » (films, livres, pages Web…)
— Prédire les notes que individus donneraient
Différents types de recommandation :
— Selon des connaissances : caractéristiques sur les individus cibles (âge, salaire…)
— Selon les préférences des individus
— exprimées par les individus eux-mêmes explicitement
— devinées en analysant leur comportement (%) — lien avec classification
— En croisant les comportements des individus : filtrage collaboratif
— En construisant des profils et en les comparant aux contenus
Un grand nombre de sources d’information :
— Informations explicitement données par les individus
— Les contenus et leurs méta-données
— Le Web et les réseaux sociaux (contenus, graphes…)
5
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 6
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
ACM Conférences et ateliers
— Conférences :
— Recommender Systems RecSys (depuis 2007)
— Sessions « Recommendation Systems » à SIGIR, CIKM,
— Ateliers :
— Context-aware Movie Recommendation (2010+2011)
— Information Heterogeneity and Fusion in Recommender Systems (2010+2011)
— Large-Scale Recommender Systems and the Netflix Prize Competition (2008)
— Recommendation Systems for Software Engineering (2008-14)
— Recommender Systems and the Social Web (2012)
7
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Articles « systèmes de recommandation »
Conférence ACM RecSys (https://recsys.acm.org)
8
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 9
Aperçu des approches
10
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
EXEMPLES
11
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 12
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 13
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 14
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 15
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 16
(2015)
https://www.slideshare.net/MrChrisJohnson/interactive-recommender-systems-with-netflix-and-spotify/20-Spotify_in_NumbersStarted_in_2006
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 17
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 18
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 19
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 20
Amazon NavigationGraph : YASIV
21
http://www.yasiv.com/#/Search?q=orwell&category=Books&lang=US
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 22
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 23
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 24
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 25
Nombreuses considérations
26
BobadillaJ,OrtegaF,HernandoA,GutiérrezA.Recommendersystemssurvey.Knowledge-BasedSystems.2013;46(C):109-132.doi:10.1016/j.knosys.2013.03.012.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
RESSOURCES
27
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 28
Quelques collections de données
29
vote is ‘‘not voted”, which we represent with the symbol !. All thelists have the same number of elements: I.
Example:
rba 2 ½1::5#f g [ f!g;
rx : ð4;5; !;3;2; !;1;1Þ; ry : ð4;3;1;2; !;3;4; !Þ:
using standardized values [0..1]:
rx : 0:75;1; !;0:5; 0:25; !;0;0ð Þ;ry : 0:75;0:5;0;0:25; !;0:5;0:75; !ð Þ:
We define the cardinality of a list: #l as the number of elementsin the list l different to !.
(1) We obtain the list
dx;y : d1x;y;d
2x;y; d
3x;y; . . . ; dI
x;y
! "j
dix;y ¼ ri
x ' riy
! "28ijri
x – ! ^riy – !; di
x;y ¼ !8ijrix ¼ ! _ ri
y ¼ !;
ð10Þ
in our example:
dx;y ¼ ð0;0:25; !;0:0625; !; !;0:5625; !Þ:
(2) We obtain the MSD(x,y) measure computing the arithmeticaverage of the values in the list dx,y
MSDðx; yÞ ¼ !dx;y ¼
Pi¼1::I;di
x;y–! dix;y
#dx;y; ð11Þ
in our example:
!dx;y ¼ ð0þ 0:25þ 0:0625þ 0:5625Þ=4 ¼ 0:218
MSD(x,y) (11) tends towards zero as the ratings of users x andy become more similar and tends towards 1 as they becamemore different (we assume that the votes are normalized inthe interval [0..1]).
(3) We obtain the Jaccard(x,y) measure computing the propor-tion between the number of positions [1..I] in which thereare elements different to ! in both rx and ry regarding thenumber of positions [1..I] in which there are elements differ-ent to ! in rx or in ry:
Jaccardðx; yÞ ¼ rx \ ry
rx [ ry¼ #dx;y
#rx þ#ry '#dx;y; ð12Þ
in our example: 4/(6 + 6'4) = 0.5.(4) We combine the above elements in the final equation:
newmetric x; yð Þ ¼ Jaccard x; yð Þ ) 1'MSD x; yð Þð Þ; ð13Þ
in the running example:
newmetricðx; yÞ ¼ 0:5) ð1' 0:218Þ ¼ 0;391:
If the values of the votes are normalized in the interval [0..1],then:
1'MSD x; yð Þð Þ ^ Jaccard x; yð Þ ^ newmetric x; yð Þ 2 ½0::1#:
4. Planning the experiments
The RS databases [2,30,32] that we use in our experiments pres-ent the general characteristics summarized in Table 1.
The experiments have been grouped in such a way that the fol-lowing can be determined:
! Accuracy.! Coverage.! Number of perfect predictions.! Precision/recall.
We consider a perfect prediction to be each situation in whichthe prediction of the rating recommended to one user in one filmmatches the value rated by that user for that film.
The previous experiments were carried out, depending on thesize of the database, for each of the following k-neighborhoods val-ues: MovieLens [2..1500] step 50, FilmAffinity [2..2000] step 100,NetFlix [2..10000] step100, due to the fact that depending on thesize of each particular RS database, it is necessary to use a differentnumber of k-neighborhoods in order to display tendencies in thegraphs that display their results. The precision/recall recommenda-tion quality results have been obtained using a range [2..20] of rec-ommendations and relevance thresholds h = 5 using MovieLensand NetFlix and h = 9 using FilmAffinity.
When we use MovieLens and FilmAffinity we use 20% of testusers taken at random from all the users of the database; withthe remaining 80% we carry out the training. When we use NetFlix,given the huge number of users in the database, we only use 5% ofits users as test users. In all cases we use 20% of test items.
Table 2 shows the numerical data exposed in this section.
5. Results
In this section we present the results obtained using the dat-abases specified in Table 1. Fig. 6 shows the results obtained withMovieLens, Fig. 7 shows those obtained with NetFlix and Fig. 8 cor-responds to FilmAffinity.
Graph 6A shows the MAE error obtained in MovieLens by apply-ing Pearson correlation (dashed) and the proposed metric (contin-uous). The new metric achieves significant fewer errors inpractically all the experiments carried out (by varying the numberof k-neighborhoods). The percentage improvement average isaround 0.2 stars in the most commonly used values of k (50, 100,150, 200).
Graph 6B shows us the coverage. Small values of k producesmall percentages in the capacity for prediction, as it is moreimprobable that the few neighbors of a test user have voted for afilm that this user has not voted for. As the number of neighborsincreases, the probability that at least one of them has voted forthe film also increases, as shown in the Graph.
Table 1Main parameters of the databases used in the experiments.
MovieLens FilmAffinity NetFlix
Number of users 4382 26447 480189Number of movies 3952 21128 17770Number of ratings 1000209 19126278 100480507Min and max values 1–5 1–10 1–5
Table 2Main parameters used in the experiments.
K (MAE, coverage, perfect predictions) Precision/recall Test users (%) Test items (%)
Range Step N h
MovieLens 1M [2..1500] 50 [2..20] 5 20 20FilmAffinity [2..2000] 100 [2..20] 5 20 20NetFlix [2..10000] 100 [2..20] 9 5 20
J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 525
30
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
The MovieLens Datasets
31
Harper,F.M.,&Konstan,J.A.(2016).Themovielensdatasets:Historyandcontext.ACMTransactionsonInteractiveIntelligentSystems(TiiS),5(4),19.
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 32
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 33
https://labrosa.ee.columbia.edu/millionsong/lastfm
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 34
https://labrosa.ee.columbia.edu/millionsong/lastfm
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 35
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 36
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 37
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 38
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 39http://files.grouplens.org/datasets/hetrec2011/hetrec2011-delicious-readme.txt
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 40
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
METHODES : GENERALITES
41
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Articles « Etat de l’art »
42
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 43
https://www.slideshare.net/xamat/recommender-systems-machine-learning-summer-school-2014-cmu
Des « individus » et des « données »
44
Chapitre 2
Analyse en composantes
principales
Soient T un tableau croisant n individus I (en lignes) et K variablesquantitatives X (en colonnes). xi,k est la valeur de la variable k pour l’indi-vidu i :
X1 X2 · · · XK variables
individusI1 x1,1 x1,2 · · · x1,K
I2 x2,1 x2,2... x2,K
......
... xi,k...
In xn,1 xn,2... xn,K
Un des objectifs de l’analyse de donnees est de determiner des profilsd’individus ou, dit autrement, des classes d’individus se ressemblant. Cetteressemblance est determinee a partir des valeurs des variables associees auxindividus.
Un autre objectif concerne les variables elles-memes : calcul des correlationsentre elles (a quel point une evolution des valeurs de l’une entraıne uneevolution des valeurs de l’autre et de quelle maniere), regression entre va-riables (formulation des liens entre variables)... L’Analyse en ComposantesPrincipales (ACP) concerne les liaisons lineaires entre variables, par op-position aux liaisons quadratiques, logarithmiques ou exponentielles parexemple. L’ACP fait partie des analyses factorielles qui vont determinerdes facteurs a partir des valeurs des variables associees aux individus. Ces
7
P. Bellot
• L’analyse des données peut être conduite selon
• les individus : recherche de ressemblance entre les individus (en fonction des valeurs des variables) = classification automatique des individus
• les variables : quelles sont les variables qui expliquent le mieux les données (les différences entre individus) ? quelles sont les composantes principales ? où se trouve la plus grande variabilité ?
Etude des individus / étude des variables
45
Chapitre 2
Analyse en composantes
principales
Soient T un tableau croisant n individus I (en lignes) et K variablesquantitatives X (en colonnes). xi,k est la valeur de la variable k pour l’indi-vidu i :
X1 X2 · · · XK variables
individusI1 x1,1 x1,2 · · · x1,K
I2 x2,1 x2,2... x2,K
......
... xi,k...
In xn,1 xn,2... xn,K
Un des objectifs de l’analyse de donnees est de determiner des profilsd’individus ou, dit autrement, des classes d’individus se ressemblant. Cetteressemblance est determinee a partir des valeurs des variables associees auxindividus.
Un autre objectif concerne les variables elles-memes : calcul des correlationsentre elles (a quel point une evolution des valeurs de l’une entraıne uneevolution des valeurs de l’autre et de quelle maniere), regression entre va-riables (formulation des liens entre variables)... L’Analyse en ComposantesPrincipales (ACP) concerne les liaisons lineaires entre variables, par op-position aux liaisons quadratiques, logarithmiques ou exponentielles parexemple. L’ACP fait partie des analyses factorielles qui vont determinerdes facteurs a partir des valeurs des variables associees aux individus. Ces
7
Chapitre 2
Analyse en composantes
principales
Soient T un tableau croisant n individus I (en lignes) et K variablesquantitatives X (en colonnes). xi,k est la valeur de la variable k pour l’indi-vidu i :
X1 X2 · · · XK variables
individusI1 x1,1 x1,2 · · · x1,K
I2 x2,1 x2,2... x2,K
......
... xi,k...
In xn,1 xn,2... xn,K
Un des objectifs de l’analyse de donnees est de determiner des profilsd’individus ou, dit autrement, des classes d’individus se ressemblant. Cetteressemblance est determinee a partir des valeurs des variables associees auxindividus.
Un autre objectif concerne les variables elles-memes : calcul des correlationsentre elles (a quel point une evolution des valeurs de l’une entraıne uneevolution des valeurs de l’autre et de quelle maniere), regression entre va-riables (formulation des liens entre variables)... L’Analyse en ComposantesPrincipales (ACP) concerne les liaisons lineaires entre variables, par op-position aux liaisons quadratiques, logarithmiques ou exponentielles parexemple. L’ACP fait partie des analyses factorielles qui vont determinerdes facteurs a partir des valeurs des variables associees aux individus. Ces
7
> temp<-data.frame(temperature[1:12])> cl = kmeans(temp,3,iter.max=2,nstart=15)
e) visualisez les classes :> summary(cl)> cl$cluster> summary(cl$cluster)> cl$center
f) Ajouter le résultat de la classification aux données - utilisez le paquetage cluster pour accéder à la fonction clusplot : > library(cluster)- puis :> aggregate(temperature,by=list(cl$cluster),FUN=mean)> cl2<-data.frame(temperature,cl$cluster)> clusplot(temperature,cl$cluster,color=TRUE,shade=TRUE,labels=2,lines=0)
5- Question «subsidiaire» : manipulation du paquetage APCluster
Installer le paquetage APCluster
Polytech’Marseille Page 2 sur 3
-6 -4 -2 0 2 4 6 8
-4-3
-2-1
01
23
Individuals factor map (PCA)
Dim 1 (86.87%)
Dim
2 (1
1.42
%)
Amsterdam
Athens
Berlin
Brussels
Budapest
Copenhagen
Dublin
Elsinki
Kiev
Krakow
Lisbon London
Madrid
Minsk
Moscow
Oslo
Paris
Prague
Reykjavik
Rome Sarajevo
Sofia
Stockholm
Antwerp
Barcelona
Bordeaux
Edinburgh
Frankfurt Geneva Genoa
Milan
Palermo
Seville
St. Petersburg
Zurich
East
North
South
West
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Variables factor map (PCA)
Dim 1 (86.87%)
Dim
2 (1
1.42
%)
JanuaryFebruary
March
April
MayJuneJulyAugust
September
October
November
December
Annual
Amplitude
Latitude
Longitude
Polytech’Marseille Page 2 sur 4
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 46
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 47
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 48
P. Bellot
ACP et réduction de la dimension• Une façon de représenter en quelques dimensions des nuages d’individus
— en conservant au mieux les distances entre les individus— en privilégiant les dimensions de plus grande variabilité (sélection itérative des facteurs qui maximisent la variance)= application d’une fonction de projection
49
P. Bellot 50
Méthodes d’apprentissage• Différentes formes d’apprentissage
• Agent « élève » recopie l’agent « maître » --> fournir des exemples
• Raisonnement par induction (à partir d’exemples)
• Apprentissage de caractéristiques importantes
• Détection de patterns récurrents
• Ajustement des paramètres importants
• Transformation d’informations en connaissancesExemples --> Modèle --> Test --> Correction / Enrichissement des exemples
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Approches statistiques, probabilistes Apprentissage automatique
51
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence DataProceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001)
Y
i�1 Y
i
Y
i+1
?ss -
?ss -
?ss
X
i�1 X
i
X
i+1
Y
i�1 Y
i
Y
i+1
c6s -
c6s -
c6s
X
i�1 X
i
X
i+1
Y
i�1 Y
i
Y
i+1
cs
cs
cs
X
i�1 X
i
X
i+1
Figure 2. Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.An open circle indicates that the variable is not generated by the model.
sequence. In addition, the features do not need to specifycompletely a state or observation, so one might expect thatthe model can be estimated from less training data. Anotherattractive property is the convexity of the loss function; in-deed, CRFs share all of the convexity properties of generalmaximum entropy models.
For the remainder of the paper we assume that the depen-dencies of Y, conditioned on X, form a chain. To sim-plify some expressions, we add special start and stop statesY0 = start and Y
n+1 = stop. Thus, we will be using thegraphical structure shown in Figure 2. For a chain struc-ture, the conditional probability of a label sequence can beexpressed concisely in matrix form, which will be usefulin describing the parameter estimation and inference al-gorithms in Section 4. Suppose that p
✓
(Y |X) is a CRFgiven by (1). For each position i in the observation se-quence x, we define the |Y| ⇥ |Y| matrix random variableM
i
(x) = [M
i
(y
0, y |x)] by
M
i
(y
0, y |x) = exp (⇤
i
(y
0, y |x))
⇤
i
(y
0, y |x) =
Pk
�
k
f
k
(e
i
,Y|ei = (y
0, y),x) +
Pk
µ
k
g
k
(v
i
,Y|vi = y,x) ,
where e
i
is the edge with labels (Y
i�1,Yi
) and v
i
is thevertex with labelY
i
. In contrast to generative models, con-ditional models like CRFs do not need to enumerate overall possible observation sequences x, and therefore thesematrices can be computed directly as needed from a giventraining or test observation sequence x and the parametervector ✓. Then the normalization (partition function)Z
✓
(x)
is the (start, stop) entry of the product of these matrices:
Z
✓
(x) = (M1(x) M2(x) · · ·Mn+1(x))
start,stop
.
Using this notation, the conditional probability of a labelsequence y is written as
p
✓
(y |x) =
Qn+1i=1 M
i
(y
i�1,yi
|x)⇣Qn+1i=1 M
i
(x)
⌘
start,stop
,
where y0 = start and y
n+1 = stop.
4. Parameter Estimation for CRFsWe now describe two iterative scaling algorithms to findthe parameter vector ✓ that maximizes the log-likelihood
of the training data. Both algorithms are based on the im-proved iterative scaling (IIS) algorithm of Della Pietra et al.(1997); the proof technique based on auxiliary functionscan be extended to show convergence of the algorithms forCRFs.
Iterative scaling algorithms update the weights as �
k
�
k
+ ��
k
and µ
k
µ
k
+ �µ
k
for appropriately chosen��
k
and �µ
k
. In particular, the IIS update ��
k
for an edgefeature f
k
is the solution of
eE[f
k
]
def=
X
x,y
ep(x,y)
n+1X
i=1
f
k
(e
i
,y|ei ,x)
=
X
x,y
ep(x) p(y |x)
n+1X
i=1
f
k
(e
i
,y|ei ,x) e
��kT (x,y) .
where T (x,y) is the total feature count
T (x,y)
def=
X
i,k
f
k
(e
i
,y|ei ,x) +
X
i,k
g
k
(v
i
,y|vi ,x) .
The equations for vertex feature updates �µ
k
have similarform.
However, efficiently computing the exponential sums onthe right-hand sides of these equations is problematic, be-cause T (x,y) is a global property of (x,y), and dynamicprogramming will sum over sequences with potentiallyvarying T . To deal with this, the first algorithm, AlgorithmS, uses a “slack feature.” The second, Algorithm T, keepstrack of partial T totals.
For Algorithm S, we define the slack feature by
s(x,y)
def=
S �X
i
X
k
f
k
(e
i
,y|ei ,x)�
X
i
X
k
g
k
(v
i
,y|vi ,x) ,
where S is a constant chosen so that s(x(i),y) � 0 for all
y and all observation vectors x
(i) in the training set, thusmaking T (x,y) = S. Feature s is “global,” that is, it doesnot correspond to any particular edge or vertex.
For each index i = 0, . . . , n+1 we now define the forward
vectors ↵
i
(x) with base case
↵0(y |x) =
n1 if y = start
0 otherwise
3 Conditional Random Fields
Lafferty et al. [8] define the the probability of a particular label sequence y
given observation sequence x to be a normalized product of potential functions,each of the form
exp (!
j
λjtj(yi−1, yi, x, i) +!
k
µksk(yi, x, i)), (2)
where tj(yi−1, yi, x, i) is a transition feature function of the entire observationsequence and the labels at positions i and i−1 in the label sequence; sk(yi, x, i)is a state feature function of the label at position i and the observation sequence;and λj and µk are parameters to be estimated from training data.
When defining feature functions, we construct a set of real-valued featuresb(x, i) of the observation to expresses some characteristic of the empirical dis-tribution of the training data that should also hold of the model distribution.An example of such a feature is
b(x, i) =
"
1 if the observation at position i is the word “September”
0 otherwise.
Each feature function takes on the value of one of these real-valued observationfeatures b(x, i) if the current state (in the case of a state function) or previousand current states (in the case of a transition function) take on particular val-ues. All feature functions are therefore real-valued. For example, consider thefollowing transition function:
tj(yi−1, yi, x, i) =
"
b(x, i) if yi−1 = IN and yi = NNP
0 otherwise.
In the remainder of this report, notation is simplified by writing
s(yi, x, i) = s(yi−1, yi, x, i)
and
Fj(y, x) =n
!
i=1
fj(yi−1, yi, x, i),
where each fj(yi−1, yi, x, i) is either a state function s(yi−1, yi, x, i) or a transi-tion function t(yi−1, yi, x, i). This allows the probability of a label sequence y
given an observation sequence x to be written as
p(y|x, λ) =1
Z(x)exp (
!
j
λjFj(y, x)). (3)
Z(x) is a normalization factor.
4
Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =
!"#!"!!"#$!!"# Eq. (1)
Z!"#$% !!" =
!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)
The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love Good Happy Great Excite Best Thank Hope Cant Wait
14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05
Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid
13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83
Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin
6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17
Table1. The first ten terms having the highest Z_score in each class
- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and
Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.
- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.
4 Evaluation
4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.
4.2 Experiments
Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.
Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score
Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =
!"#!"!!"#$!!"# Eq. (1)
Z!"#$% !!" =
!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)
The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love Good Happy Great Excite Best Thank Hope Cant Wait
14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05
Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid
13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83
Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin
6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17
Table1. The first ten terms having the highest Z_score in each class
- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and
Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.
- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.
4 Evaluation
4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.
4.2 Experiments
Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.
Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score
Quelssontlesmotscaractéristiquesd’ungroupededocuments?
Quellesrelationssignificativesàpartirdesseulesformesobservées?
Analogies,corrélations
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 52
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 53
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Recommandation et séries temporelles
54
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
EVALUATION
55
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Grille d’évaluation
56
Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).
5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.
APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy
x The items recommended to me matched my interests.*
x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse
scale).
A.1.2 Relative Accuracy x The recommendation I received better fits my interests than
what I may receive from a friend. x A recommendation from my friends better suits my interests
than the recommendation from this system (reverse scale).
A.1.3 Familiarity
x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me
(reverse scale).
A.1.4 Attractiveness x The items recommended to me are attractive.
A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty
x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse
scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other
(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.
x The items recommended to me took my personal context requirements into consideration.
x The recommendations are timely.
A2. Interaction Adequacy x The recommender provides an adequate way for me to express
my preferences. x The recommender provides an adequate way for me to revise
my preferences. x The recommender explains why the products are
recommended to me.*
A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is
sufficient for me. x The labels of the recommender interface are clear and
adequate. x The layout of the recommender interface is attractive and
adequate.*
A4. Perceived Ease of Use A.4.1 Ease of Initial Learning
19
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).
5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.
APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy
x The items recommended to me matched my interests.*
x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse
scale).
A.1.2 Relative Accuracy x The recommendation I received better fits my interests than
what I may receive from a friend. x A recommendation from my friends better suits my interests
than the recommendation from this system (reverse scale).
A.1.3 Familiarity
x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me
(reverse scale).
A.1.4 Attractiveness x The items recommended to me are attractive.
A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty
x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse
scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other
(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.
x The items recommended to me took my personal context requirements into consideration.
x The recommendations are timely.
A2. Interaction Adequacy x The recommender provides an adequate way for me to express
my preferences. x The recommender provides an adequate way for me to revise
my preferences. x The recommender explains why the products are
recommended to me.*
A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is
sufficient for me. x The labels of the recommender interface are clear and
adequate. x The layout of the recommender interface is attractive and
adequate.*
A4. Perceived Ease of Use A.4.1 Ease of Initial Learning
19
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).
5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.
APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy
x The items recommended to me matched my interests.*
x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse
scale).
A.1.2 Relative Accuracy x The recommendation I received better fits my interests than
what I may receive from a friend. x A recommendation from my friends better suits my interests
than the recommendation from this system (reverse scale).
A.1.3 Familiarity
x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me
(reverse scale).
A.1.4 Attractiveness x The items recommended to me are attractive.
A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty
x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse
scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other
(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.
x The items recommended to me took my personal context requirements into consideration.
x The recommendations are timely.
A2. Interaction Adequacy x The recommender provides an adequate way for me to express
my preferences. x The recommender provides an adequate way for me to revise
my preferences. x The recommender explains why the products are
recommended to me.*
A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is
sufficient for me. x The labels of the recommender interface are clear and
adequate. x The layout of the recommender interface is attractive and
adequate.*
A4. Perceived Ease of Use A.4.1 Ease of Initial Learning
19
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)57
Our overall motivation for this research was to understand the crucial factors that influence the user adoption of recommenders. Another motivation is to come up with a subjective evaluation questionnaire that other researchers and practitioners can employ. However, it is unlikely that a 60-item questionnaire can be administered for a quick and easy evaluation. This has motivated us in proposing a simplified model based on our past research. Between 2005 and 2010, we have administered 11 subjective questionnaires on a total of 807 subjects [4,5,6,12,13,14,23,24]. Initial questionnaires covered some of the four categories identified in the ResQue. As we conducted more experiments, we became more convinced of the four categories and used all of them in recent studies. On average, between 12 and 15 questions were used. Based this previous work, we have synthesized and organized a total of 15 questions as a simplified model for the purpose of performing a quick and easy usability and adoption evaluation of a recommender (see questions with * sign).
5. CONCLUSION AND FUTURE WORK User evaluation of recommender systems is a crucial subject of study that requires a deep understanding, development and testing of the right dimensions (or constructs) and the standardization of the questions used. The framework described in this paper presents the first attempt to develop a complete and balanced evaluation framework that measures users’ subjective attitudes based on their experience towards a recommender. ResQue consists of a set of 13 constructs and 60 questions for a high-quality recommender system from the user point of view and can be used as a standard guideline for a user evaluation. It can also be adapted to a custom-made user evaluation by tailoring it in an individual research context. Researchers and practitioners can use these questionnaires with ease to measure users’ general satisfaction with recommenders, their readiness to adopt the technology, and their intention to purchase recommended items and return to the site in the future. After ResQue was finalized, we asked several expert researchers in the community of recommender systems to review the model. Their feedback and comments were then incorporated into the final version of the model. This method, known as the Delphi method, is one of the first validation attempts on the model. Since the work was submitted, we have started conducting a survey to further validate the model’s reliability, validity and sensitivity using factor analysis, structural equation modeling (SEM), and other techniques described in [21]. Initial results based on 150 participants indicate how the model can be interpreted and show factors that correspond to the original model. At the same time, analysis also gives some indications on how to refine the model. More users are expected to participate in the survey and the final outcome will be soon reported.
APPENDIX A. Constructs and Questions of ResQue The following contains the questionnaire statements that can be used in a survey. They are developed based on the ResQue model described in this paper. Users should be asked to indicate their answers to each of the questions using the 1-5 Likert scales, where 1 indicates “strongly disagree” and 5 is “strongly agree.” A1. Quality of Recommended Items A.1.1 Accuracy
x The items recommended to me matched my interests.*
x The recommender gave me good suggestions. x I am not interested in the items recommended to me (reverse
scale).
A.1.2 Relative Accuracy x The recommendation I received better fits my interests than
what I may receive from a friend. x A recommendation from my friends better suits my interests
than the recommendation from this system (reverse scale).
A.1.3 Familiarity
x Some of the recommended items are familiar to me. x I am not familiar with the items that were recommended to me
(reverse scale).
A.1.4 Attractiveness x The items recommended to me are attractive.
A.1.5 Enjoyability x I enjoyed the items recommended to me. A.1.6 Novelty
x The items recommended to me are novel and interesting.* x The recommender system is educational. x The recommender system helps me discover new products. x I could not find new items through the recommender (reverse
scale). A.1.6 Diversity x The items recommended to me are diverse.* x The items recommended to me are similar to each other
(reverse scale).* A.1.7 Context Compatibility x I was only provided with general recommendations.
x The items recommended to me took my personal context requirements into consideration.
x The recommendations are timely.
A2. Interaction Adequacy x The recommender provides an adequate way for me to express
my preferences. x The recommender provides an adequate way for me to revise
my preferences. x The recommender explains why the products are
recommended to me.*
A3. Interface Adequacy x The recommender’s interface provides sufficient information. x The information provided for the recommended items is
sufficient for me. x The labels of the recommender interface are clear and
adequate. x The layout of the recommender interface is attractive and
adequate.*
A4. Perceived Ease of Use A.4.1 Ease of Initial Learning
19
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort
(reverse scale). A.4.2 Ease of Preference Elicitation
x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.
x It required too much effort to tell the system what I like (reversed scale).
A.4.3 Ease of Preference Revision
x I found it easy to make the system recommend different things to me.
x It is easy to train the system to update my preferences.
x I found it easy to alter the outcome of the recommended items due to my preference changes.
x It is easy for me to inform the system if I dislike/like the recommended item.
x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making
x Using the recommender to find what I like is easy.
x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is
easy.* x Finding an item to buy, even with the help of the
recommender, consumes too much time.
A5. Perceived Usefulness x The recommended items effectively helped me find the ideal
product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the
recommender.* x I feel supported in selecting the items to buy with the help of
the recommender.
A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my
preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were
recommended to me. x The system seems to control my decision process rather than
me (reverse scale).
A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *
x The recommender made me more confident about my selection/decision.
x The recommended items made me confused about my choice (reverse scale).
x The recommender can be trusted.
A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find
products to buy.
A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*
6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next
Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.
[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.
[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.
[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.
[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.
[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.
[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.
[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.
[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.
[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.
20
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort
(reverse scale). A.4.2 Ease of Preference Elicitation
x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.
x It required too much effort to tell the system what I like (reversed scale).
A.4.3 Ease of Preference Revision
x I found it easy to make the system recommend different things to me.
x It is easy to train the system to update my preferences.
x I found it easy to alter the outcome of the recommended items due to my preference changes.
x It is easy for me to inform the system if I dislike/like the recommended item.
x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making
x Using the recommender to find what I like is easy.
x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is
easy.* x Finding an item to buy, even with the help of the
recommender, consumes too much time.
A5. Perceived Usefulness x The recommended items effectively helped me find the ideal
product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the
recommender.* x I feel supported in selecting the items to buy with the help of
the recommender.
A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my
preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were
recommended to me. x The system seems to control my decision process rather than
me (reverse scale).
A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *
x The recommender made me more confident about my selection/decision.
x The recommended items made me confused about my choice (reverse scale).
x The recommender can be trusted.
A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find
products to buy.
A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*
6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next
Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.
[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.
[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.
[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.
[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.
[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.
[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.
[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.
[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.
[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.
20
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.
58
x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort
(reverse scale). A.4.2 Ease of Preference Elicitation
x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.
x It required too much effort to tell the system what I like (reversed scale).
A.4.3 Ease of Preference Revision
x I found it easy to make the system recommend different things to me.
x It is easy to train the system to update my preferences.
x I found it easy to alter the outcome of the recommended items due to my preference changes.
x It is easy for me to inform the system if I dislike/like the recommended item.
x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making
x Using the recommender to find what I like is easy.
x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is
easy.* x Finding an item to buy, even with the help of the
recommender, consumes too much time.
A5. Perceived Usefulness x The recommended items effectively helped me find the ideal
product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the
recommender.* x I feel supported in selecting the items to buy with the help of
the recommender.
A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my
preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were
recommended to me. x The system seems to control my decision process rather than
me (reverse scale).
A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *
x The recommender made me more confident about my selection/decision.
x The recommended items made me confused about my choice (reverse scale).
x The recommender can be trusted.
A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find
products to buy.
A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*
6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next
Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.
[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.
[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.
[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.
[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.
[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.
[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.
[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.
[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.
[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.
20
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
x I became familiar with the recommender system very quickly. x I easily found the recommended items. x Looking for a recommended item required too much effort
(reverse scale). A.4.2 Ease of Preference Elicitation
x I found it easy to tell the system about my preferences. x It is easy to learn to tell the system what I like.
x It required too much effort to tell the system what I like (reversed scale).
A.4.3 Ease of Preference Revision
x I found it easy to make the system recommend different things to me.
x It is easy to train the system to update my preferences.
x I found it easy to alter the outcome of the recommended items due to my preference changes.
x It is easy for me to inform the system if I dislike/like the recommended item.
x It is easy for me to get a new set of recommendations. A.4.4 Ease of Decision Making
x Using the recommender to find what I like is easy.
x I was able to take advantage of the recommender very quickly. x I quickly became productive with the recommender. x Finding an item to buy with the help of the recommender is
easy.* x Finding an item to buy, even with the help of the
recommender, consumes too much time.
A5. Perceived Usefulness x The recommended items effectively helped me find the ideal
product.* x The recommended items influence my selection of products. x I feel supported to find what I like with the help of the
recommender.* x I feel supported in selecting the items to buy with the help of
the recommender.
A6. Control/Transparency x I feel in control of telling the recommender what I want. x I don’t feel in control of telling the system what I want. x I don’t feel in control of specifying and changing my
preferences (reverse scale). x I understood why the items were recommended to me. x The system helps me understand why the items were
recommended to me. x The system seems to control my decision process rather than
me (reverse scale).
A7. Attitudes x Overall, I am satisfied with the recommender.* x I am convinced of the products recommended to me.* x I am confident I will like the items recommended to me. *
x The recommender made me more confident about my selection/decision.
x The recommended items made me confused about my choice (reverse scale).
x The recommender can be trusted.
A8. Behavioral Intentions A.8.1 Intention to Use the System x If a recommender such as this exists, I will use it to find
products to buy.
A.8.2 Continuance and Frequency x I will use this recommender again.* x I will use this type of recommender frequently. x I prefer to use this type of recommender in the future. A.8.3 Recommendation to Friends x I will tell my friends about this recommender.* A.8.4 Purchase Intention x I would buy the items recommended, given the opportunity.*
6. REFERENCES [1] Adomavicius, G. and Tuzhilin, A. 2005. Toward the Next
Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734-749.
[2] Beenen, G., Ling, K., Wang, X., Chang, K., Frankowski, D., Resnick, P., et al. 2004. Using social psychology to motivate contributions to online communities. In CSCW '04: Proceedings of the ACM Conference On Computer Supported Cooperative Work. New York: ACM Press.
[3] Castagnos, S., Jones, N., and Pu, P. 2009. Recommenders' Influence on Buyers' Decision Process. In proceedings of the 3rd ACM Conference on Recommender Systems (RecSys 2009), 361-364.
[4] Chen, L. and Pu, P. 2006. Trust Building with Explanation Interfaces. In Proceedings of International Conference on Intelligent User Interface (IUI’06), 93-100.
[5] Chen, L. and Pu, P. 2008. A Cross-Cultural User Evaluation of Product Recommender Interfaces. RecSys 2008, 75-82.
[6] Chen, L. and Pu, P. 2009. Interaction Design Guidelines on Critiquing-based Recommender Systems. User Modeling and User-Adapted Interaction Journal (UMUAI), Springer Netherlands, Volume 19, Issue3, 167-206.
[7] Davis, F.D. 1989. Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quart. 13 319-339.
[8] Grabner-Kräuter, S. and Kaluscha, E.A. 2003. Empirical research in on-line trust: a review and critical assessment Int. J. Hum.-Comput. Stud. (IJMMS) 58(6), 783-812.
[9] Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J. An algorithmic framework for performing collaborative filtering. In Proc. of ACM SIGIR 1999, ACM Press (1999), 230-237.
[10] Herlocker, J.L., Konstan, J.A., and Riedl, J. 2000. Explaining collaborative filtering recommendations. CSCW 2000, 241-250.
20
FULL PAPER
Proceedings of the ACM RecSys 2010 Workshop on User-Centric Evaluation of Recommender Systems and Their Interfaces (UCERSTI), Barcelona, Spain, Sep 30, 2010
Published by CEUR-WS.org, ISSN 1613-0073, online ceur-ws.org/Vol-612/paper3.pdf
Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 59
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 60
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Mesures d’évaluation
— Qualité de la prédiction : Mean Absolute Error, Root Mean Squared Error, Coverage
— Qualité de la recommandation : Precision, Recall, F1-Measure
61
subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.
The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.
Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].
The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.
The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.
Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.
Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.
4.1. Quality of the predictions: mean absolute error, accuracy andcoverage
In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.
We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.
Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.
MAE ¼ 1#U
X
u2U
1#Ou
X
i2Ou
jpu;i " ru;ij !
ð1Þ
RMSE ¼ 1#U
X
u2U
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
#Ou
X
i2Ou
ðpu;i " ru;iÞ2s
ð2Þ
The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:
Let
Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g
coverage ¼ 1#U
X
u2U
100& #Cu
#Du
" #ð3Þ
4.2. Quality of the set of recommendations: precision, recall and F1
The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.
In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.
Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:
precision ¼ 1#U
X
u2U
#fi 2 Zujru;i P hgn
ð4Þ
recall ¼ 1#U
X
u2U
#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc
u
$$ru;i P h% & ð5Þ
F1 ¼ 2& precision& recallprecisionþ recall
ð6Þ
4.3. Quality of the list of recommendations: rank measures
When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.
HL ¼ 1#U
X
u2U
XN
i¼1
maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ
DCGk ¼ 1#U
X
u2U
ru;p1 þXk
i¼2
ru;pi
log2ðiÞ
!ð8Þ
J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117
subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.
The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.
Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].
The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.
The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.
Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.
Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.
4.1. Quality of the predictions: mean absolute error, accuracy andcoverage
In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.
We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.
Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.
MAE ¼ 1#U
X
u2U
1#Ou
X
i2Ou
jpu;i " ru;ij !
ð1Þ
RMSE ¼ 1#U
X
u2U
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
#Ou
X
i2Ou
ðpu;i " ru;iÞ2s
ð2Þ
The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:
Let
Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g
coverage ¼ 1#U
X
u2U
100& #Cu
#Du
" #ð3Þ
4.2. Quality of the set of recommendations: precision, recall and F1
The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.
In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.
Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:
precision ¼ 1#U
X
u2U
#fi 2 Zujru;i P hgn
ð4Þ
recall ¼ 1#U
X
u2U
#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc
u
$$ru;i P h% & ð5Þ
F1 ¼ 2& precision& recallprecisionþ recall
ð6Þ
4.3. Quality of the list of recommendations: rank measures
When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.
HL ¼ 1#U
X
u2U
XN
i¼1
maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ
DCGk ¼ 1#U
X
u2U
ru;p1 þXk
i¼2
ru;pi
log2ðiÞ
!ð8Þ
J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117
subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.
The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.
Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].
The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.
The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.
Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.
Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.
4.1. Quality of the predictions: mean absolute error, accuracy andcoverage
In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.
We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.
Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.
MAE ¼ 1#U
X
u2U
1#Ou
X
i2Ou
jpu;i " ru;ij !
ð1Þ
RMSE ¼ 1#U
X
u2U
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
#Ou
X
i2Ou
ðpu;i " ru;iÞ2s
ð2Þ
The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:
Let
Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g
coverage ¼ 1#U
X
u2U
100& #Cu
#Du
" #ð3Þ
4.2. Quality of the set of recommendations: precision, recall and F1
The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.
In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.
Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:
precision ¼ 1#U
X
u2U
#fi 2 Zujru;i P hgn
ð4Þ
recall ¼ 1#U
X
u2U
#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc
u
$$ru;i P h% & ð5Þ
F1 ¼ 2& precision& recallprecisionþ recall
ð6Þ
4.3. Quality of the list of recommendations: rank measures
When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.
HL ¼ 1#U
X
u2U
XN
i¼1
maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ
DCGk ¼ 1#U
X
u2U
ru;p1 þXk
i¼2
ru;pi
log2ðiÞ
!ð8Þ
J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Mesures d’évaluation (2)
— Qualité d’une liste de recommandations (selon les rangs) : DCG au rang k : . le gain apporté par un item est inversement lié à sa position dans la liste
. calculé pour chaque utilisateur (u) puis moyenne sur tous les utilisateur
nDCG est la version normalisée selon le « DCG idéal » (liste idéale)
— Nouveauté et diversité
62
subsystems. General publications and reviews also exist which in-clude the most commonly accepted evaluation measures: meanabsolute error, coverage, precision, recall and derivatives of these:mean squared error, normalized mean absolute error, ROC and fallout;Goldberg et al. [87] focuses on the aspects not related to the eval-uation, Breese et al. [43] compare the predictive accuracy of vari-ous methods in a set of representative problem domains.
The majority of articles discuss attempted improvements to theaccuracy of RS results (RMSE, MAE, etc.). It is also common to at-tempt an improvement in recommendations (precision, recall,ROC, etc.). However, additional objectives should be consideredfor generating greater user satisfaction [253], such as topic diversi-fication and coverage serendipity.
Currently, the field has a growing interest in generating algo-rithms with diverse and innovative recommendations, even atthe expense of accuracy and precision. To evaluate these aspects,various metrics have been proposed to measure recommendationnovelty and diversity [105,220].
The frameworks aid in defining and standardizing the methodsand algorithms employed by RS as well as the mechanisms to eval-uate the quality of the results. Among the most significant papersthat propose CF frameworks are Herlocker et al. [92] whichevaluates the following: similarity weight, significance weighting,variance weighting, selecting neighborhood and rating normaliza-tion; Hernández and Gaudioso [95] proposes a framework in whichany RS is formed by two different subsystems, one of them toguide the user and the other to provide useful/interesting items.Koutrika et al. [125] is a framework which introduces levels ofabstraction in CF process, making the modifications in the RS moreflexible. Antunes et al. [12] presents an evaluation frameworkassuming that evaluation is an evolving process during the systemlifecicle.
The majority of RS evaluation frameworks proposed until nowpresent two deficiencies: the first of these is the lack of formal-ization. Although the evaluation metrics are well defined, thereare a variety of details in the implementation of the methodswhich, in the event they are not specified, can lead to thegeneration of different results in similar experiments. Thesecond deficiency is the absence of standardization of the evalu-ation measures in aspects such as novelty and trust of therecommendations.
Bobadilla et al. [32] provides a complete series of mathematicalformalizations based on sets theory. Authors provide a set of eval-uation measures, which include the quality analysis of the follow-ing aspects: predictions, recommendations, novelty and trust.
Presented next is a representative selection of the RS evaluationquality measures most often used in the bibliography.
4.1. Quality of the predictions: mean absolute error, accuracy andcoverage
In order to measure the accuracy of the results of an RS, it isusual to use the calculation of some of the most common predic-tion error metrics, amongst which the Mean Absolute Error(MAE) and its related metrics: mean squared error, root meansquared error, and normalized mean absolute error stand out.
We define U as the set of the RS users, I as the set of the RSitems, ru,i the rating of user u on item i, ! the lack of rating (ru,i = !means user u has not rated item i), pu,i the prediction of item i onuser u.
Let Ou = {i 2 Ijpu,i – ! ^ ru,i – !}, set of items rated by user u hav-ing prediction values. We define the MAE and RMSE of the systemas the average of the user’s MAE. We remark that the absolute dif-ference between prediction and real value, jpu,i " ru,ij, informsabout the error in the prediction.
MAE ¼ 1#U
X
u2U
1#Ou
X
i2Ou
jpu;i " ru;ij !
ð1Þ
RMSE ¼ 1#U
X
u2U
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
#Ou
X
i2Ou
ðpu;i " ru;iÞ2s
ð2Þ
The coverage could be defined as the capacity of predicting froma metric applied to a specific RS. In short, it calculates the percent-age of situations in which at least one k-neighbor of each activeuser can rate an item that has not been rated yet by that activeuser. We defined Ku,i as the set of neighbors of u which have ratedthe item i. We define the coverage of the system as the average ofthe user’s coverage:
Let
Cu ¼ fi 2 Ijru;i ¼ ! ^ Ku;i – £g; Du ¼ fi 2 Ijru;i ¼ !g
coverage ¼ 1#U
X
u2U
100& #Cu
#Du
" #ð3Þ
4.2. Quality of the set of recommendations: precision, recall and F1
The confidence of users for a certain RS does not depend directlyon the accuracy for the set of possible predictions. A user gainsconfidence on the RS when this user agrees with a reduced set ofrecommendations made by the RS.
In this section, we define the following three most widely usedrecommendation quality measures: (1) precision, which indicatesthe proportion of relevant recommended items from the totalnumber of recommended items, (2) recall, which indicates the pro-portion of relevant recommended items from the number of rele-vant items, and (3) F1, which is a combination of precision andrecall.
Let Xu as the set of recommendations to user u, and Zu as the setof n recommendations to user u. We will represent the evaluationprecision, recall and F1 measures for recommendations obtainedby making n test recommendations to the user u, taking a h rele-vancy threshold. Assuming that all users accept n testrecommendations:
precision ¼ 1#U
X
u2U
#fi 2 Zujru;i P hgn
ð4Þ
recall ¼ 1#U
X
u2U
#fi 2 Zujru;i P hg#fi 2 Zujru;i P hgþ# i 2 Zc
u
$$ru;i P h% & ð5Þ
F1 ¼ 2& precision& recallprecisionþ recall
ð6Þ
4.3. Quality of the list of recommendations: rank measures
When the number n of recommended items is not small, usersgive greater importance to the first items on the list of recommen-dations. The mistakes incurred in these items are more serious er-rors than those in the last items on the list. The ranking measuresconsider this situation. Among the ranking measures most oftenused are the following standard information retrieval measures:(a) half-life (7) [43], which assumes an exponential decrease inthe interest of users as they move away from the recommenda-tions at the top and (b) discounted cumulative gain (8) [17], whereindecay is logarithmic.
HL ¼ 1#U
X
u2U
XN
i¼1
maxðru;pi " d;0Þ2ði"1Þ=ða"1Þ ð7Þ
DCGk ¼ 1#U
X
u2U
ru;p1 þXk
i¼2
ru;pi
log2ðiÞ
!ð8Þ
J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132 117
p1, . . . ,pn represents the recommendation list, ru,pi representsthe true rating of the user u for the item pi, k is the rank of the eval-uated item, d is the default rating, a is the number of the item onthe list such that there is a 50% chance the user will review thatitem.
4.4. Novelty and diversity
The novelty evaluation measure indicates the degree of differ-ence between the items recommended to and known by the user.The diversity quality measure indicates the degree of differentia-tion among recommended items.
Currently, novelty and diversity measures do not have a stan-dard; therefore, different authors propose different metrics[163,220]. Certain authors have [105] used the following:
diversityZu¼
1#Zuð#Zu # 1Þ
X
i2Zu
X
j2Zu ;j–i
½1# simði; jÞ& ð9Þ
noveltyi ¼1
#Zu # 1
X
j2Zu
½1# simði; jÞ&; i 2 Zu ð10Þ
Here, sim(i, j) indicates item to item memory-based CF similar-ity measures. Zu indicates the set of n recommendations to user u.
4.5. Stability
The stability in the predictions and recommendations influ-ences on the users’ trust towards the RS. A RS is stable if the pre-dicitions it provides do not change strongly over a short periodof time. Adomavicius and Zhang [4] propose a quality measure ofstability, MAS (Mean Absolute Shift). This measure is definedthrough a set of known ratings R1 and a set of predictions of all un-known ratings, P1. For an interval of time, users of the RS will haverated a subset S of these unknown ratings and the RS can nowmake new predictions, P2. MAS is defined as follows:
stability ¼MAS ¼ 1jP2j
X
ðu;iÞ2P2
jP2ðu; iÞ # P1ðu; iÞj ð11Þ
4.6. Reliability
The reliability of a prediction or a recommendation informsabout how seriously we may consider this prediction. When RSrecommends an item to a user with prediction 4.5 in a scale{1, . . . ,5}, this user hopes to be satisfied by this item. However, thisvalue of prediction (4.5 over 5) does not reflect with which certaindegree the RS has concluded that the user will like this item (withvalue 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-able if it has obtained by means of 200 similar users than if it hasobtained by only two similar users.
In Hernando et al. [96], a realibility measure is proposed accord-ing the usual notion that the more reliable a prediction, the less lia-ble to be wrong. Although this reliability measure is not a qualitymeasure used for comparing different techniques of RS throughcross validation, this can be regarded as a quality measure associ-ated to a prediction and a recommendation. In this way, the RS pro-vides a pair of values (prediction value, reliability value), throughwhich users may balance its preference: for example users wouldprobably prefer the option (4,0.9) to the option (4.5,0.1). Conse-quently, the reliability measure proposed in Hernando et al. [96]provides a new understandable factor, which users may considerfor taking its decisions. Nevertheless, the use of this reliabilitymeasure is just constrained to those RS based on the kNNalgorithm.
The definition of reliability on the prediction, pu,i, isbased on two numeric factors: su,i and vu,i. su,i measures the similar-ity of the neighbors used for making the prediction pu,i; vu,i
measures the degree of disagreement between these neighborsrating the item i. Finally, the reliablity measure is defined asfollows:
fSðsu;iÞ ¼ 1#!s
!sþ su;i; su;i ¼
X
v2Ku;i
simðu;vÞ ð12Þ
where
fSðsu;iÞ ¼ 1#!s
!sþ su;i; su;i ¼
X
v2Ku;i
simðu;vÞ ð13Þ
Fig. 7. Recommender systems evaluation process.
118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132
p1, . . . ,pn represents the recommendation list, ru,pi representsthe true rating of the user u for the item pi, k is the rank of the eval-uated item, d is the default rating, a is the number of the item onthe list such that there is a 50% chance the user will review thatitem.
4.4. Novelty and diversity
The novelty evaluation measure indicates the degree of differ-ence between the items recommended to and known by the user.The diversity quality measure indicates the degree of differentia-tion among recommended items.
Currently, novelty and diversity measures do not have a stan-dard; therefore, different authors propose different metrics[163,220]. Certain authors have [105] used the following:
diversityZu¼
1#Zuð#Zu # 1Þ
X
i2Zu
X
j2Zu ;j–i
½1# simði; jÞ& ð9Þ
noveltyi ¼1
#Zu # 1
X
j2Zu
½1# simði; jÞ&; i 2 Zu ð10Þ
Here, sim(i, j) indicates item to item memory-based CF similar-ity measures. Zu indicates the set of n recommendations to user u.
4.5. Stability
The stability in the predictions and recommendations influ-ences on the users’ trust towards the RS. A RS is stable if the pre-dicitions it provides do not change strongly over a short periodof time. Adomavicius and Zhang [4] propose a quality measure ofstability, MAS (Mean Absolute Shift). This measure is definedthrough a set of known ratings R1 and a set of predictions of all un-known ratings, P1. For an interval of time, users of the RS will haverated a subset S of these unknown ratings and the RS can nowmake new predictions, P2. MAS is defined as follows:
stability ¼MAS ¼ 1jP2j
X
ðu;iÞ2P2
jP2ðu; iÞ # P1ðu; iÞj ð11Þ
4.6. Reliability
The reliability of a prediction or a recommendation informsabout how seriously we may consider this prediction. When RSrecommends an item to a user with prediction 4.5 in a scale{1, . . . ,5}, this user hopes to be satisfied by this item. However, thisvalue of prediction (4.5 over 5) does not reflect with which certaindegree the RS has concluded that the user will like this item (withvalue 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-able if it has obtained by means of 200 similar users than if it hasobtained by only two similar users.
In Hernando et al. [96], a realibility measure is proposed accord-ing the usual notion that the more reliable a prediction, the less lia-ble to be wrong. Although this reliability measure is not a qualitymeasure used for comparing different techniques of RS throughcross validation, this can be regarded as a quality measure associ-ated to a prediction and a recommendation. In this way, the RS pro-vides a pair of values (prediction value, reliability value), throughwhich users may balance its preference: for example users wouldprobably prefer the option (4,0.9) to the option (4.5,0.1). Conse-quently, the reliability measure proposed in Hernando et al. [96]provides a new understandable factor, which users may considerfor taking its decisions. Nevertheless, the use of this reliabilitymeasure is just constrained to those RS based on the kNNalgorithm.
The definition of reliability on the prediction, pu,i, isbased on two numeric factors: su,i and vu,i. su,i measures the similar-ity of the neighbors used for making the prediction pu,i; vu,i
measures the degree of disagreement between these neighborsrating the item i. Finally, the reliablity measure is defined asfollows:
fSðsu;iÞ ¼ 1#!s
!sþ su;i; su;i ¼
X
v2Ku;i
simðu;vÞ ð12Þ
where
fSðsu;iÞ ¼ 1#!s
!sþ su;i; su;i ¼
X
v2Ku;i
simðu;vÞ ð13Þ
Fig. 7. Recommender systems evaluation process.
118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132
p1, . . . ,pn represents the recommendation list, ru,pi representsthe true rating of the user u for the item pi, k is the rank of the eval-uated item, d is the default rating, a is the number of the item onthe list such that there is a 50% chance the user will review thatitem.
4.4. Novelty and diversity
The novelty evaluation measure indicates the degree of differ-ence between the items recommended to and known by the user.The diversity quality measure indicates the degree of differentia-tion among recommended items.
Currently, novelty and diversity measures do not have a stan-dard; therefore, different authors propose different metrics[163,220]. Certain authors have [105] used the following:
diversityZu¼
1#Zuð#Zu # 1Þ
X
i2Zu
X
j2Zu ;j–i
½1# simði; jÞ& ð9Þ
noveltyi ¼1
#Zu # 1
X
j2Zu
½1# simði; jÞ&; i 2 Zu ð10Þ
Here, sim(i, j) indicates item to item memory-based CF similar-ity measures. Zu indicates the set of n recommendations to user u.
4.5. Stability
The stability in the predictions and recommendations influ-ences on the users’ trust towards the RS. A RS is stable if the pre-dicitions it provides do not change strongly over a short periodof time. Adomavicius and Zhang [4] propose a quality measure ofstability, MAS (Mean Absolute Shift). This measure is definedthrough a set of known ratings R1 and a set of predictions of all un-known ratings, P1. For an interval of time, users of the RS will haverated a subset S of these unknown ratings and the RS can nowmake new predictions, P2. MAS is defined as follows:
stability ¼MAS ¼ 1jP2j
X
ðu;iÞ2P2
jP2ðu; iÞ # P1ðu; iÞj ð11Þ
4.6. Reliability
The reliability of a prediction or a recommendation informsabout how seriously we may consider this prediction. When RSrecommends an item to a user with prediction 4.5 in a scale{1, . . . ,5}, this user hopes to be satisfied by this item. However, thisvalue of prediction (4.5 over 5) does not reflect with which certaindegree the RS has concluded that the user will like this item (withvalue 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-able if it has obtained by means of 200 similar users than if it hasobtained by only two similar users.
In Hernando et al. [96], a realibility measure is proposed accord-ing the usual notion that the more reliable a prediction, the less lia-ble to be wrong. Although this reliability measure is not a qualitymeasure used for comparing different techniques of RS throughcross validation, this can be regarded as a quality measure associ-ated to a prediction and a recommendation. In this way, the RS pro-vides a pair of values (prediction value, reliability value), throughwhich users may balance its preference: for example users wouldprobably prefer the option (4,0.9) to the option (4.5,0.1). Conse-quently, the reliability measure proposed in Hernando et al. [96]provides a new understandable factor, which users may considerfor taking its decisions. Nevertheless, the use of this reliabilitymeasure is just constrained to those RS based on the kNNalgorithm.
The definition of reliability on the prediction, pu,i, isbased on two numeric factors: su,i and vu,i. su,i measures the similar-ity of the neighbors used for making the prediction pu,i; vu,i
measures the degree of disagreement between these neighborsrating the item i. Finally, the reliablity measure is defined asfollows:
fSðsu;iÞ ¼ 1#!s
!sþ su;i; su;i ¼
X
v2Ku;i
simðu;vÞ ð12Þ
where
fSðsu;iÞ ¼ 1#!s
!sþ su;i; su;i ¼
X
v2Ku;i
simðu;vÞ ð13Þ
Fig. 7. Recommender systems evaluation process.
118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Mesures d’évaluation (3)
— D’autres mesures orientées utilisateurs
— Pertinence (accuracy) perçue par l’utilisateur
— Familiarité : les items sont connus (leur existence) des utilisateurs
— Nouveauté : découverte d’items nouveaux
— Attractivité : les items attirent les utilisateurs (pas toujours le cas d’items pertinents…)
— Utilité : les items ont été appréciés (après usage / lecture)
— Compatibilité avec le contexte de l’utilisateur
— Niveau de l’interaction
— Contrôle des paramètres
— Explications de la recommandation
— Transparence de la méthode
63
PuP,ChenL.AUser-CentricEvaluationFrameworkofRecommenderSystems.In:ACMRecSys2010WorkshoponUser-CentricEvaluationofRecommenderSystemsandTheirInterfaces;2010:14-22.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
FILTRAGE COLLABORATIF
64
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 65
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Filtrage collaboratif
Nous sommes des êtres « sociaux »
— Les « autres » dictent / influencent nos choix
— Nos relations sont typées (amis / ennemis, famille, relations professionnelles…)
— « Dis moi qui sont tes amis, je te dirai qui tu es » — homophilie
66
Table 3 Sets.
Table 5 Running example : users similarities.
N a m e Sets descr ipt ions Pa ramete r s MSD Ui U2 U3 U, U5
N.e
ft 9 ft h
U Users L I I t ems M V Rating va lues min, max Ru User ra t ings user Ku Ne ighborhoods of t he user user, k P„ Predict ions to t he user user, k Xu Top r e c o m m e n d e d i t ems to t he user user, k, 9 Zu Top N r e c o m m e n d e d i t ems to t he user user, k, Y I t ems voted of the m o s t by y users y T„ User 's ne ighborhoods taking into account ft user, k, Qu Trust users user, k, Hu Trust pairs (user, i t em) user, k, ft Axy I t ems ra ted s imul taneous ly by users x and y use r l ,
user2 Gui User 's ne ighborhoods w h i c h have ra ted i tem / user, k Buj Users w h o have vo ted for i t em /, exceptuser user, item 0„ I t ems tha t the user has vo ted for and on w h i c h user, k
predic t ions exist O Users from w h o m a MAE can be ob ta ined k C„ I tems tha t the user has no t voted for and on w h i c h user, k
predic t ions exist D„ I t ems tha t the user has no t voted for user, k C Pairs (user, i t em) t ha t have no t been voted for and k
accept predic t ions D Pairs (user, i t em) t ha t have no t been voted for Exy I t ems t ha t have recent ly been voted for by bo th user x ft use r l ,
and user y user2 S„ User 's recen t votes user, ji
Table 4 Running example : RS database.
ru,¡ h h h u Í5 Í6 h Í8 h ho in Í12 Í13 Í14
Ui 5 • 3 • 4 • • 4 • 2 4 • u?. 1 • 2 4 1 4 1 U3 5 2 4 • • 3 5 4 • • 4 • U4 4 • 3 • • • 5 4 • • • • ¡A 3 3 4 5 • • 5 •
2.3. Obtaining a user's K-neighbors
2.3.1. Formalization We define Ku as the set of K neighbors of the user u. The following must be true:
Ku<zUA#Ku = kAu$Ku, (10) VxeKu, V y e ( [ / - K u ) , sim(u,x) > sim(u,y). (11)
2.3.2. Running example Table 6 shows the sets of neighbors using K= 2 and K= 3:
2.4. Prediction of the value of an item
2.4.1. Formalization Based on the information provided by the K-neighbors of a user
u, the CF process enables the value of an item to be predicted as follows:
let Pu = {(i,p)\ie I,peR}, set of prediction to the user u (R •. real numbers) (12)
We will assign the value of the prediction p made to user u on item i as pui=p (13)
U2
U3
U4
Us
0 6.5 0.25 0.33 2
6.5 0 6.66 5 1
0.25 6.66 0 0.5 0.75
0.33 5 0.5 0 1
2 1 0.75 1 0
Table 6 Running example : 2 and 3 neighbors of each user.
K=2 {U3,U4} {U5,U4} {UUU4} {UUU3} {U3,U2} K=3 {U3,U4,U5} {U5,U4,U,} {UUU4,U2} {UUU3,U5} {U3,U2,U4}
Once the set of K users (neighbors) similar to active u has been calculated (Ku), in order to obtain the prediction of item i on user u(12), one of the following aggregation approaches is often used: the average (15), the weighted sum (16) and the adjusted weighted aggregation (Deviation-From-Mean) (17).
Let Gu>i = {neKu|3rn > i^«}
Pu,i = 177^ Y] rnf ^ Guf * 0
Pu,i = K.> J2 sim(u,n)rni ^ GUji ^ 0
Pu,i = fu + [tu,i J2 S Í m ( " ' n)(r">'' ~ f") ^ G".'' ^ 0
where ¡i serves as a normalizing factor, usually computed:
(14)
(15)
(16)
(17)
(18)
When it is not possible to make the prediction of an item as none of the K-neighbors has voted for this item, we can decide to make use of the average ratings given to that item by all users of the RS who have voted for it; in this case, Eqs. (14)-(18) are com-plemented with Eqs. (19)-(23):
where BUj¡ = {n e U\n ^ u, rn>¡ •}
Pu,i h Er».'-#BU„ ' Guj ••
; A B u > i ^ : neB„,
Pu,< = Vuj Yl S Í m ( U ' n)r".'' ^ ^ Gu* = 0 A B",> ^ '•
(19)
(20)
(21)
Puf = ru + / iu i Y^ sim(u, n)(rn>1- - r„) ^ ^ Gu>i = 0 A BUji ^ 0 (22) neB„,
ft,,- = 1 / E s í m ( u . n ) ^ G",¡ = 0 A B",¡ ^ 0 (23)
Finally, in RS cases exist in which it is impossible to make pre-dictions on some items that any other user has voted for:
Pu,i = ' Gu,i = 0 A Bu, = ; (24)
2.4.2. Running example By using the simplest prediction Eq. (15) we obtain the predic-
tions that the users can receive using K = 3 neighbors. Table 7 shows these predictions.
Notesexplicitesvs.notesimplicites(nombred’accèsoudecitations,tempspassé…)
Notesàprédire
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 67
Filtrage collaboratif : similarités et voisinages
68
Variante:itemtoitem
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Quelles fonctions de similarité ?
- Corrélation de Pearson
- Corrélation de Spearman sur les rangs
- Cosinus
- Distance euclidienne
- Métriques plus complexes:
- JMSD pour intégrer des informations non numériques (combinaison de Pearson et de Jaccard)
- « Optimum de Pareto » pour filtrer les individus les moins représentatifs
- Intégration des scores des autres individus / autres items
69
publications and reviews also exist which include the most com-monly accepted metrics, aggregation approaches and evaluationmeasures: mean absolute error, coverage, precision, recall andderivatives of these: mean squared error, normalized mean absoluteerror, ROC and fallout; Goldberg et al. [13] focuses on the aspects notrelated to the evaluation, Breese et al. [6] compare the predictiveaccuracy of various methods in a set of representative problemdomains. Candillier et al. [7] and Schafer et al. [36] review the maincollaborative filtering methods proposed in the literature.
The rest of the paper is structured as follows:
! In Section 2 we provide the basis for the principles on which thedesign of the new metric will be based, we present graphswhich show the way in which the users vote, we carry outexperiments which support the decisions made, we establishthe best way of selecting numerical and non-numerical infor-mation from the votes and, finally, we establish the hypothesison which the paper and its proposed metric are based.! In Section 3 we establish the mathematical formulation of the
metric.! In Sections 4 and 5, respectively, we list the experiments that
will be carried out and we present and discuss the resultsobtained.! Section 6 presents the most relevant conclusions of the
publication.
2. Approach and design of the new similarity metric
2.1. Introduction
Collaborative filtering methods work on a table of U users whocan rate I items. The prediction of a non-rated item i for a user u iscomputed as an aggregate of the ratings of the K most similar users(k-neighborhoods) for the same item i, where Ku denotes the set ofk-neighborhoods of u and rn,i denotes of value of the user n ratingon the item i (! if there is not rating value).
Once the set of K users (neighborhoods) similar to active u hasbeen calculated, in order to obtain the prediction of item i on useru, one of the following aggregation approaches is often used: theaverage (2), the weighted sum (3) and the adjusted weightedaggregation (deviation-from-mean) (4). We will use the auxiliarset Gu,i in order to define Eqs. (2)–(5):
Gu;i ¼ n 2 Kuj9rn;i – !! "
; ð1Þ
pu;i ¼1
#Gu;i
X
n2Gu;i
rn;i () Gu;i – £; ð2Þ
pu;i ¼ lu;i
X
n2Gu;i
sim u;nð Þrn;i () Gu;i – £; ð3Þ
pu;i ¼ !ru þ lu;i
X
n2Gu;i
sim u;nð Þ rn;i & !rn# $
() Gu;i – £; ð4Þ
where l serves as a normalizing factor, usually computed:
lu;i ¼ 1X
n2Gu;i
simðu;nÞ () Gu;i – £
,: ð5Þ
The most popular similarity metrics are Pearson correlation (6),cosine (7), constrained Pearson’s correlation (8) and Spearmanrank correlation (9):
sim x; yð Þ ¼P
i rx;i & !rx# $
ry;i & !ry# $
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & !rx# $2P
i ry;i & !ry# $2
q ; ð6Þ
sim x; yð Þ ¼P
irx;iry;iffiffiffiffiffiffiffiffiffiffiffiffiPir2
x;i
q ffiffiffiffiffiffiffiffiffiffiffiffiPir2
y;i
q ; ð7Þ
sim x; yð Þ ¼P
i rx;i & rmed# $
ry;i & rmed# $
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & rmed# $2
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi ry;i & rmed# $2
q ;
rmed : median value in the rating scale; ð8Þ
sim x; yð Þ ¼
Pi rankx;i & rankx
& 'ranky;i & ranky
& '
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP
i rankx;i & rankx
& '2Pi ranky;i & ranky
& '2r : ð9Þ
Although Pearson correlation is the most commonly used met-ric in the process of memory-based CF (user to user), this choice isnot always backed by the nature and distribution of the data in theRS. Formally, in order to be able to apply this metric with guaran-tees, the following assumptions must be met:
! Linear relationship between x and y.! Continuous random variables.! Both variables must be normally distributed.
These conditions are not normally met in real RS, and Pearsoncorrelation presents some significant cases of erroneous operationthat should not be ignored in RS.
Despite the deficiencies of Pearson correlation, this similaritymeasure presents the best prediction and recommendation resultsin CF-based RS [15,16,31,7,35], furthermore, it is the most com-monly used, and therefore, any alternative metric proposed mustimprove its results.
On accepting that Pearson correlation is the metric for whichthe results must be improved, but not necessarily the most appro-priate to be taken as a base, it is advisable to focus on the informa-tion that is obtained in the different research processes and whichcan sometimes be overlooked when searching for other differentobjectives to improving the accuracy of RS (cold-start problem,trust and novelty, sparsity, etc.).
The simplest information to give us an idea of the nature of the RSis to find out what users usually vote: do they always tend to vote forthe same values? Do they always tend to vote for the same items? Isthere much difference between the votes of some users and others?
Fig. 1 shows the distribution of the votes cast in MovieLens 1Mand NetFlix RS (where you can vote in the interval [1..5]). We cansee how, on average, the users focus their votes on the higher levelsof the interval, but avoiding the extremes, particularly the lowerextremes. The distribution of the votes is not balanced and partic-ularly negative or particularly positive votes are avoided.
Fig. 2 shows the arithmetic average and the standard deviationof the votes cast in the MovieLens 1M and NetFlix databases.Graphs (A) and (B) of Fig. 2 show the number of items that displaythe arithmetic average specified on the x axis; we can see thatthere are hardly any items rated, on average, below 2 or above 4,whereby most of the cases are between the values 3 and 4. Graphs(C) and (D) of Fig. 2 show the number of items that display thestandard deviation specified on the x axis; we can see that mostof the items have been voted by the users, on average, with a max-imum difference of 1.2 votes.
According to the figures analyzed, we find that traditional met-rics must often achieve results by operating on a set of discrete rat-ings with very little variation (majority of votes between 3 and 4)and with the obligation of improving simpler and quicker estima-tions, such as always predicting with the value of the arithmeticaverage of the votes of each item (in which we know there is sel-dom a standard deviation higher than 1.2).
2.2. Basic experimentation
After ascertaining the low diversity of the votes cast by theusers, it seems reasonable to consider that the votes mainly tend
J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521
publications and reviews also exist which include the most com-monly accepted metrics, aggregation approaches and evaluationmeasures: mean absolute error, coverage, precision, recall andderivatives of these: mean squared error, normalized mean absoluteerror, ROC and fallout; Goldberg et al. [13] focuses on the aspects notrelated to the evaluation, Breese et al. [6] compare the predictiveaccuracy of various methods in a set of representative problemdomains. Candillier et al. [7] and Schafer et al. [36] review the maincollaborative filtering methods proposed in the literature.
The rest of the paper is structured as follows:
! In Section 2 we provide the basis for the principles on which thedesign of the new metric will be based, we present graphswhich show the way in which the users vote, we carry outexperiments which support the decisions made, we establishthe best way of selecting numerical and non-numerical infor-mation from the votes and, finally, we establish the hypothesison which the paper and its proposed metric are based.! In Section 3 we establish the mathematical formulation of the
metric.! In Sections 4 and 5, respectively, we list the experiments that
will be carried out and we present and discuss the resultsobtained.! Section 6 presents the most relevant conclusions of the
publication.
2. Approach and design of the new similarity metric
2.1. Introduction
Collaborative filtering methods work on a table of U users whocan rate I items. The prediction of a non-rated item i for a user u iscomputed as an aggregate of the ratings of the K most similar users(k-neighborhoods) for the same item i, where Ku denotes the set ofk-neighborhoods of u and rn,i denotes of value of the user n ratingon the item i (! if there is not rating value).
Once the set of K users (neighborhoods) similar to active u hasbeen calculated, in order to obtain the prediction of item i on useru, one of the following aggregation approaches is often used: theaverage (2), the weighted sum (3) and the adjusted weightedaggregation (deviation-from-mean) (4). We will use the auxiliarset Gu,i in order to define Eqs. (2)–(5):
Gu;i ¼ n 2 Kuj9rn;i – !! "
; ð1Þ
pu;i ¼1
#Gu;i
X
n2Gu;i
rn;i () Gu;i – £; ð2Þ
pu;i ¼ lu;i
X
n2Gu;i
sim u;nð Þrn;i () Gu;i – £; ð3Þ
pu;i ¼ !ru þ lu;i
X
n2Gu;i
sim u;nð Þ rn;i & !rn# $
() Gu;i – £; ð4Þ
where l serves as a normalizing factor, usually computed:
lu;i ¼ 1X
n2Gu;i
simðu;nÞ () Gu;i – £
,: ð5Þ
The most popular similarity metrics are Pearson correlation (6),cosine (7), constrained Pearson’s correlation (8) and Spearmanrank correlation (9):
sim x; yð Þ ¼P
i rx;i & !rx# $
ry;i & !ry# $
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & !rx# $2P
i ry;i & !ry# $2
q ; ð6Þ
sim x; yð Þ ¼P
irx;iry;iffiffiffiffiffiffiffiffiffiffiffiffiPir2
x;i
q ffiffiffiffiffiffiffiffiffiffiffiffiPir2
y;i
q ; ð7Þ
sim x; yð Þ ¼P
i rx;i & rmed# $
ry;i & rmed# $
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & rmed# $2
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi ry;i & rmed# $2
q ;
rmed : median value in the rating scale; ð8Þ
sim x; yð Þ ¼
Pi rankx;i & rankx
& 'ranky;i & ranky
& '
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP
i rankx;i & rankx
& '2Pi ranky;i & ranky
& '2r : ð9Þ
Although Pearson correlation is the most commonly used met-ric in the process of memory-based CF (user to user), this choice isnot always backed by the nature and distribution of the data in theRS. Formally, in order to be able to apply this metric with guaran-tees, the following assumptions must be met:
! Linear relationship between x and y.! Continuous random variables.! Both variables must be normally distributed.
These conditions are not normally met in real RS, and Pearsoncorrelation presents some significant cases of erroneous operationthat should not be ignored in RS.
Despite the deficiencies of Pearson correlation, this similaritymeasure presents the best prediction and recommendation resultsin CF-based RS [15,16,31,7,35], furthermore, it is the most com-monly used, and therefore, any alternative metric proposed mustimprove its results.
On accepting that Pearson correlation is the metric for whichthe results must be improved, but not necessarily the most appro-priate to be taken as a base, it is advisable to focus on the informa-tion that is obtained in the different research processes and whichcan sometimes be overlooked when searching for other differentobjectives to improving the accuracy of RS (cold-start problem,trust and novelty, sparsity, etc.).
The simplest information to give us an idea of the nature of the RSis to find out what users usually vote: do they always tend to vote forthe same values? Do they always tend to vote for the same items? Isthere much difference between the votes of some users and others?
Fig. 1 shows the distribution of the votes cast in MovieLens 1Mand NetFlix RS (where you can vote in the interval [1..5]). We cansee how, on average, the users focus their votes on the higher levelsof the interval, but avoiding the extremes, particularly the lowerextremes. The distribution of the votes is not balanced and partic-ularly negative or particularly positive votes are avoided.
Fig. 2 shows the arithmetic average and the standard deviationof the votes cast in the MovieLens 1M and NetFlix databases.Graphs (A) and (B) of Fig. 2 show the number of items that displaythe arithmetic average specified on the x axis; we can see thatthere are hardly any items rated, on average, below 2 or above 4,whereby most of the cases are between the values 3 and 4. Graphs(C) and (D) of Fig. 2 show the number of items that display thestandard deviation specified on the x axis; we can see that mostof the items have been voted by the users, on average, with a max-imum difference of 1.2 votes.
According to the figures analyzed, we find that traditional met-rics must often achieve results by operating on a set of discrete rat-ings with very little variation (majority of votes between 3 and 4)and with the obligation of improving simpler and quicker estima-tions, such as always predicting with the value of the arithmeticaverage of the votes of each item (in which we know there is sel-dom a standard deviation higher than 1.2).
2.2. Basic experimentation
After ascertaining the low diversity of the votes cast by theusers, it seems reasonable to consider that the votes mainly tend
J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521
publications and reviews also exist which include the most com-monly accepted metrics, aggregation approaches and evaluationmeasures: mean absolute error, coverage, precision, recall andderivatives of these: mean squared error, normalized mean absoluteerror, ROC and fallout; Goldberg et al. [13] focuses on the aspects notrelated to the evaluation, Breese et al. [6] compare the predictiveaccuracy of various methods in a set of representative problemdomains. Candillier et al. [7] and Schafer et al. [36] review the maincollaborative filtering methods proposed in the literature.
The rest of the paper is structured as follows:
! In Section 2 we provide the basis for the principles on which thedesign of the new metric will be based, we present graphswhich show the way in which the users vote, we carry outexperiments which support the decisions made, we establishthe best way of selecting numerical and non-numerical infor-mation from the votes and, finally, we establish the hypothesison which the paper and its proposed metric are based.! In Section 3 we establish the mathematical formulation of the
metric.! In Sections 4 and 5, respectively, we list the experiments that
will be carried out and we present and discuss the resultsobtained.! Section 6 presents the most relevant conclusions of the
publication.
2. Approach and design of the new similarity metric
2.1. Introduction
Collaborative filtering methods work on a table of U users whocan rate I items. The prediction of a non-rated item i for a user u iscomputed as an aggregate of the ratings of the K most similar users(k-neighborhoods) for the same item i, where Ku denotes the set ofk-neighborhoods of u and rn,i denotes of value of the user n ratingon the item i (! if there is not rating value).
Once the set of K users (neighborhoods) similar to active u hasbeen calculated, in order to obtain the prediction of item i on useru, one of the following aggregation approaches is often used: theaverage (2), the weighted sum (3) and the adjusted weightedaggregation (deviation-from-mean) (4). We will use the auxiliarset Gu,i in order to define Eqs. (2)–(5):
Gu;i ¼ n 2 Kuj9rn;i – !! "
; ð1Þ
pu;i ¼1
#Gu;i
X
n2Gu;i
rn;i () Gu;i – £; ð2Þ
pu;i ¼ lu;i
X
n2Gu;i
sim u;nð Þrn;i () Gu;i – £; ð3Þ
pu;i ¼ !ru þ lu;i
X
n2Gu;i
sim u;nð Þ rn;i & !rn# $
() Gu;i – £; ð4Þ
where l serves as a normalizing factor, usually computed:
lu;i ¼ 1X
n2Gu;i
simðu;nÞ () Gu;i – £
,: ð5Þ
The most popular similarity metrics are Pearson correlation (6),cosine (7), constrained Pearson’s correlation (8) and Spearmanrank correlation (9):
sim x; yð Þ ¼P
i rx;i & !rx# $
ry;i & !ry# $
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & !rx# $2P
i ry;i & !ry# $2
q ; ð6Þ
sim x; yð Þ ¼P
irx;iry;iffiffiffiffiffiffiffiffiffiffiffiffiPir2
x;i
q ffiffiffiffiffiffiffiffiffiffiffiffiPir2
y;i
q ; ð7Þ
sim x; yð Þ ¼P
i rx;i & rmed# $
ry;i & rmed# $
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi rx;i & rmed# $2
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPi ry;i & rmed# $2
q ;
rmed : median value in the rating scale; ð8Þ
sim x; yð Þ ¼
Pi rankx;i & rankx
& 'ranky;i & ranky
& '
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP
i rankx;i & rankx
& '2Pi ranky;i & ranky
& '2r : ð9Þ
Although Pearson correlation is the most commonly used met-ric in the process of memory-based CF (user to user), this choice isnot always backed by the nature and distribution of the data in theRS. Formally, in order to be able to apply this metric with guaran-tees, the following assumptions must be met:
! Linear relationship between x and y.! Continuous random variables.! Both variables must be normally distributed.
These conditions are not normally met in real RS, and Pearsoncorrelation presents some significant cases of erroneous operationthat should not be ignored in RS.
Despite the deficiencies of Pearson correlation, this similaritymeasure presents the best prediction and recommendation resultsin CF-based RS [15,16,31,7,35], furthermore, it is the most com-monly used, and therefore, any alternative metric proposed mustimprove its results.
On accepting that Pearson correlation is the metric for whichthe results must be improved, but not necessarily the most appro-priate to be taken as a base, it is advisable to focus on the informa-tion that is obtained in the different research processes and whichcan sometimes be overlooked when searching for other differentobjectives to improving the accuracy of RS (cold-start problem,trust and novelty, sparsity, etc.).
The simplest information to give us an idea of the nature of the RSis to find out what users usually vote: do they always tend to vote forthe same values? Do they always tend to vote for the same items? Isthere much difference between the votes of some users and others?
Fig. 1 shows the distribution of the votes cast in MovieLens 1Mand NetFlix RS (where you can vote in the interval [1..5]). We cansee how, on average, the users focus their votes on the higher levelsof the interval, but avoiding the extremes, particularly the lowerextremes. The distribution of the votes is not balanced and partic-ularly negative or particularly positive votes are avoided.
Fig. 2 shows the arithmetic average and the standard deviationof the votes cast in the MovieLens 1M and NetFlix databases.Graphs (A) and (B) of Fig. 2 show the number of items that displaythe arithmetic average specified on the x axis; we can see thatthere are hardly any items rated, on average, below 2 or above 4,whereby most of the cases are between the values 3 and 4. Graphs(C) and (D) of Fig. 2 show the number of items that display thestandard deviation specified on the x axis; we can see that mostof the items have been voted by the users, on average, with a max-imum difference of 1.2 votes.
According to the figures analyzed, we find that traditional met-rics must often achieve results by operating on a set of discrete rat-ings with very little variation (majority of votes between 3 and 4)and with the obligation of improving simpler and quicker estima-tions, such as always predicting with the value of the arithmeticaverage of the votes of each item (in which we know there is sel-dom a standard deviation higher than 1.2).
2.2. Basic experimentation
After ascertaining the low diversity of the votes cast by theusers, it seems reasonable to consider that the votes mainly tend
J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 521
vote is ‘‘not voted”, which we represent with the symbol !. All thelists have the same number of elements: I.
Example:
rba 2 ½1::5#f g [ f!g;
rx : ð4;5; !;3;2; !;1;1Þ; ry : ð4;3;1;2; !;3;4; !Þ:
using standardized values [0..1]:
rx : 0:75;1; !; 0:5;0:25; !;0;0ð Þ;ry : 0:75;0:5;0;0:25; !;0:5;0:75; !ð Þ:
We define the cardinality of a list: #l as the number of elementsin the list l different to !.
(1) We obtain the list
dx;y : d1x;y;d
2x;y; d
3x;y; . . . ; dI
x;y
! "j
dix;y ¼ ri
x ' riy
! "28ijri
x – ! ^riy – !; di
x;y ¼ !8ijrix ¼ ! _ ri
y ¼ !;
ð10Þ
in our example:
dx;y ¼ ð0;0:25; !;0:0625; !; !;0:5625; !Þ:
(2) We obtain the MSD(x,y) measure computing the arithmeticaverage of the values in the list dx,y
MSDðx; yÞ ¼ !dx;y ¼
Pi¼1::I;di
x;y–! dix;y
#dx;y; ð11Þ
in our example:
!dx;y ¼ ð0þ 0:25þ 0:0625þ 0:5625Þ=4 ¼ 0:218
MSD(x,y) (11) tends towards zero as the ratings of users x andy become more similar and tends towards 1 as they becamemore different (we assume that the votes are normalized inthe interval [0..1]).
(3) We obtain the Jaccard(x,y) measure computing the propor-tion between the number of positions [1..I] in which thereare elements different to ! in both rx and ry regarding thenumber of positions [1..I] in which there are elements differ-ent to ! in rx or in ry:
Jaccardðx; yÞ ¼ rx \ ry
rx [ ry¼ #dx;y
#rx þ#ry '#dx;y; ð12Þ
in our example: 4/(6 + 6'4) = 0.5.(4) We combine the above elements in the final equation:
newmetric x; yð Þ ¼ Jaccard x; yð Þ ) 1'MSD x; yð Þð Þ; ð13Þ
in the running example:
newmetricðx; yÞ ¼ 0:5) ð1' 0:218Þ ¼ 0;391:
If the values of the votes are normalized in the interval [0..1],then:
1'MSD x; yð Þð Þ ^ Jaccard x; yð Þ ^ newmetric x; yð Þ 2 ½0::1#:
4. Planning the experiments
The RS databases [2,30,32] that we use in our experiments pres-ent the general characteristics summarized in Table 1.
The experiments have been grouped in such a way that the fol-lowing can be determined:
! Accuracy.! Coverage.! Number of perfect predictions.! Precision/recall.
We consider a perfect prediction to be each situation in whichthe prediction of the rating recommended to one user in one filmmatches the value rated by that user for that film.
The previous experiments were carried out, depending on thesize of the database, for each of the following k-neighborhoods val-ues: MovieLens [2..1500] step 50, FilmAffinity [2..2000] step 100,NetFlix [2..10000] step100, due to the fact that depending on thesize of each particular RS database, it is necessary to use a differentnumber of k-neighborhoods in order to display tendencies in thegraphs that display their results. The precision/recall recommenda-tion quality results have been obtained using a range [2..20] of rec-ommendations and relevance thresholds h = 5 using MovieLensand NetFlix and h = 9 using FilmAffinity.
When we use MovieLens and FilmAffinity we use 20% of testusers taken at random from all the users of the database; withthe remaining 80% we carry out the training. When we use NetFlix,given the huge number of users in the database, we only use 5% ofits users as test users. In all cases we use 20% of test items.
Table 2 shows the numerical data exposed in this section.
5. Results
In this section we present the results obtained using the dat-abases specified in Table 1. Fig. 6 shows the results obtained withMovieLens, Fig. 7 shows those obtained with NetFlix and Fig. 8 cor-responds to FilmAffinity.
Graph 6A shows the MAE error obtained in MovieLens by apply-ing Pearson correlation (dashed) and the proposed metric (contin-uous). The new metric achieves significant fewer errors inpractically all the experiments carried out (by varying the numberof k-neighborhoods). The percentage improvement average isaround 0.2 stars in the most commonly used values of k (50, 100,150, 200).
Graph 6B shows us the coverage. Small values of k producesmall percentages in the capacity for prediction, as it is moreimprobable that the few neighbors of a test user have voted for afilm that this user has not voted for. As the number of neighborsincreases, the probability that at least one of them has voted forthe film also increases, as shown in the Graph.
Table 1Main parameters of the databases used in the experiments.
MovieLens FilmAffinity NetFlix
Number of users 4382 26447 480189Number of movies 3952 21128 17770Number of ratings 1000209 19126278 100480507Min and max values 1–5 1–10 1–5
Table 2Main parameters used in the experiments.
K (MAE, coverage, perfect predictions) Precision/recall Test users (%) Test items (%)
Range Step N h
MovieLens 1M [2..1500] 50 [2..20] 5 20 20FilmAffinity [2..2000] 100 [2..20] 5 20 20NetFlix [2..10000] 100 [2..20] 9 5 20
J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528 525
Ortega,F.,SáNchez,J.L.,Bobadilla,J.,&GutiéRrez,A.(2013).Improvingcollaborativefiltering-basedrecommendersystemsresultsusingParetodominance.InformationSciences,239,50-61.
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 70
Coverage Recall
CORR:Pearson;COS:cosinus;EUC:euclidienne;MSD:meansquareddifferences
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 71
commonly used due to its low capacity to produce new recom-mendations.
MSD offers both a great advantage and a great disadvantage atthe same time; the advantage is that it generates very good generalresults: low average error, high percentage of correct predictionsand low percentage of incorrect predictions: the disadvantage isthat it has an intrinsic tendency to choose as similar users to onegiven user those users who have rated a very small number ofitems [35], e.g. if we have 7 items that can be rated from 1 to 5and three users u1, u2, u3 with the following ratings: u1: (!, !, 4,5, !, !, !), u2: (3, 4, 5, 5, 1, 4, !), u3: (3, 5, 4, 5, !, 3, !) (! meansnot rated item), the MSD metric will indicate that (u1,u3) have a to-tal similarity (0), (u1,u2) have a similarity 0.5 and (u2,u3) have alower similarity (0.6). This situation is not convincing, as intuitivelywe realize u2 and u3 are very similar, whilst u1 is only similar to u2and u3 in 2 ratios, and, therefore, it is not logical to choose it as themost similar to them, and what is worse, if it is chosen it will notprovide us with possibilities to recommend new items.
The strategy to follow to design the new metric is to consider-ably raise the capacity to generate MSD predictions, without losingalong the way its good behavior as regards accuracy and quality ofthe results.
The metric designed is based on two factors:
! The similarity between two users calculated as the mean of thesquared differences (MSD): the smaller these differences, thegreater the similarity between the 2 users. This part of the met-ric enables very good accuracy results to be obtained.! The number of items in which both one user and the other have
made a rating regarding the total number of items which havebeen rated between the two users. E.g. given users u1: (3, 2,4, !, !, !) and u2: (!, 4, 4, 3, !, 1), a common rating has beenmade in two items as regards a joint rating of five items. Thisfactor enables us to greatly improve the metric’s capacity tomake predictions.
An important design aspect is the decision whether not to use aparameter for which the value should be given arbitrarily, i.e. theresult provided by the metric should be obtained by only takingthe values of the ratings provided by the users of the RS.
By working on the 2 factors with standardized values [0..1], themetric obtained is as follows: Given the lists of ratings of 2 generic
users x and y: rx; ry! "
: r1x ; r
2x ; r
3x ; . . . ; rI
x
! "; r1
y ; r2y ; r
3y ; ; . . . ; rI
y
# $j I is the
number of items of our RS, where one of the possible values of each
Fig. 5. MAE and coverage obtained with Pearson correlation and by combining Jaccard with Pearson correlation, cosine, constrained Pearson’s correlation, Spearman rankcorrelation and mean squared differences. (A) MAE, (B) Coverage. MovieLens 1M, 20% of test users, 20% of test items, k e [2..1500] step 25.
Fig. 4. Measurements related to the Jaccard metric on MovieLens. (A) Number of pairs of users that display the Jaccard values represented on the x axis. (B) Averaged MAEobtained in the pairs of users with the Jaccard values represented on the x axis. (C) Averaged coverages obtained in the pairs of users with the Jaccard values represented onthe x axis.
524 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528
Bobadilla, J., Serradilla, F., & Bernal, J. (2010). A new collaborative filtering metric that improves the behavior of recommender systems. Knowledge-Based Systems, 23(6), 520-528.
The comparative results in Graph 6B show improvements of upto 9% when applying the new metric as regards the correlation.This is a very good result, as higher values of accuracy normally im-ply smaller capabilities for recommendation.
Graph 6C shows the percentage of perfect estimations as re-gards the total estimations made. Perfect estimations are thosewhich match the value voted by the test user, taking as an estima-tion the rounded value of the aggregation of the k-neighborhoods.The values obtained in Graph 6C show us a convincing improve-
ment in the results of the new metric regarding correlation, evenby 15% in some cases.
Graph 6D shows the recommendation quality measure: preci-sion versus recall. Although the prediction results (graphs A andC) of the new metric greatly improve the Pearson correlation ones,that improvement is not transferred to the same extent to the rec-ommendation quality results (approximate improvement of 0.02).In order to better understand this detachment between predictionquality and recommendation quality we must remember that with
Fig. 6. Pearson correlation and new metric comparative results using MovieLens: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 20% oftest users, 20% of test items, k e [2..1500] step 50, N e [2..20], h = 5.
Fig. 7. Correlation and new metric comparative results using NetFlix: (A) accuracy, (B) coverage, (C) percentage of perfect predictions, (D) precision/recall. 5% of test users,20% of test items, k e [2..10000] step 100, N e [2..20], h = 9.
526 J. Bobadilla et al. / Knowledge-Based Systems 23 (2010) 520–528
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 72
MichaelD.Ekstrand,MichaelLudwig,JosephA.Konstan,andJohnT.Riedl.2011.RethinkingTheRecommenderResearchEcosystem:Reproducibility,Openness,andLensKit.InProceedingsoftheFifthACMConferenceonRecommenderSystems(RecSys’11).ACM,NewYork,NY,USA,133-140.DOI=10.1145/2043932.2043958.
Evaluation par validation croisée
73
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Le problème du démarrage à froid— Nouvelle application
— Recommandation éditoriale
— Encourager les utilisateurs à donner des avis
— Nouvel utilisateur
— Exploiter autant que possible d’autres informations sur l’utilisateur
— formulaires,
— amis sur les réseaux sociaux (= demander l’accès)
— préférences sous forme de tags…
— Nouvel item
— Exploiter les méta-données (pour un film : année, réalisateur, acteurs…)
— Exploiter les critiques que l’on peut trouver par ailleurs sur le Web
74
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 75
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 76
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 77
Amazon : Organisation des objets (catégories)
78
Product Advertising API https://aws.amazon.com/
cf.http://www.codediesel.com/libraries/amazon-advertising-api-browsenodes/
Similarités et espaces latents
79
KorenY,BellR,VolinskyC.MatrixFactorizationTechniquesforRecommenderSystems.IEEEComputer.July2009:42-50.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Projection de la matrice individus / items
— Chaque item I est représenté par un vecteur q de dimension f
— Chaque utilisateur U est représenté par un vecteur p de dim. f
— Chaque facteur représente une propriété latente qui caractériseles items et qui souligne l’intérêt des utilisateurs pour celle-ci
— Le produit scalaire entre q et p est une estimation de l’intérêt de U pour I
— Méthode :
— Décomposition en valeurs singulières
— Approximation par descente de gradient (sur des données d’apprentissage)
80
noteréelle noteprédite facteurderégularisation
constantederégularisation(appriseparvalidationcroisée)
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Espaces latents (suite)
— Espace non convexe : risque de solution éloignée de l’optimum global
— Approche par Moindres Carrés Alternés (Alternating Least Squares)
. Fixe q, cherche p ; fixe p, cherche q etc.
. Utile lorsque les données (notes d’apprentissage) sont implicites (matrice non creuse)
— Tenir compte des biais = modifier les valeurs prédites
— Des utilisateurs ont tendance à toujours donner de bonnes notes
— Certains items ont toujours tendance à avoir de bonnes notes
— Le score final doit dépendre de la moyenne de tous les scores (base de départ)
— Intégrer les préférences a priori des utilisateurs (x : items préférés de u ; y: attributs (âge…))
— Tenir compte de la dynamique
81
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 82
https://www.slideshare.net/MrChrisJohnson/interactive-recommender-systems-with-netflix-and-spotify/48-Diversity_Scorenote
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 83
KorenY,BellR,VolinskyC.MatrixFactorizationTechniquesforRecommenderSystems.IEEEComputer.July2009:42-50.
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 84
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 85
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 86
KorenY,BellR,VolinskyC.MatrixFactorizationTechniquesforRecommenderSystems.IEEEComputer.July2009:42-50.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Filtrage collaboratif à destination de « groupes »
87
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Intégration du contexte
— Très nombreuses définition du contexte
— Plusieurs stratégies d’intégration
88
AdomaviciusG,MobasheB,RicciF,TuzhilinA.Context-AwareRecommenderSystems.InAAAI2011:;2017:67-81.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Intégration du contexte (suite)
— Cube Individus x Items x Contextes remplace la matrice Individus x Items
— Factorisation de tenseurs
89
Karatzoglou, A.; Amatriain, X.; Baltrunas, L.; and Oliver, N. 2010. Multiverse Recommendation: N-Dimensional Tensor Factorization for Context-Aware Collaborative Filtering. In Proceedings of the 2010 ACM Conference on Recommender Systems, 79–86.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Exploitation des liens (réseaux sociaux)
— Le réseau social comme entrée supplémentaire
90
YangX,GuoY,LiuY,SteckH.Asurveyofcollaborativefilteringbasedsocialrecommendersystems.ComputerCommunications.2014;41(C):1-10.doi:10.1016/j.comcom.2013.06.009.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Exploitation des liens (réseaux sociaux) (2)
— Prédiction selon les liens entre individus (inférence Bayésienne)
91
individuquichercheunenote
individusquiontnotél’item
individusintermédiairesquiréunissentlesnotes
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
FILTRAGE SELON LE CONTENU
92
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Recommandation basée sur le contenu
— Lien fort avec la Recherche d’Information
— La notion de « Profil utilisateur » est à rapprocher de la notion de « Requête »
93
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 94
https://www.slideshare.net/MrChrisJohnson/interactive-recommender-systems-with-netflix-and-spotify/81-81NLP_models_also_work_on
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Un mot, une chose ? pas si simple…
95
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Contenu audio
96
Wang, X., & Wang, Y. (2014, November). Improving content-based and hybrid music recommendation using deep learning. In Proceedings of the 22nd ACM international conference on Multimedia (pp. 627-636). ACM.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
LA RECOMMANDATION DE LECTURES (LIVRES)
97
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 98
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Recommending Books vs Searching for Books ?
Very diverse needs :
— Topicality
— With a precise context eg. arts in China during the XXth century
— With named entities : locations (the book is about a specific location OR the action takes place at this location), proper names…
— Style / Expertise / Language
— fiction, novel, essay, proceedings, position papers…
— for experts / for dummies / for children …
— in English, in French, in old French, in (very) local languages …
— looking for citations / references
— in what book appears a given citation
— what are the books that refer to a given one
— Authority :
— What are the most important books about … (what most important means ?)
— What are the most popular books about …
99
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 100
http://social-book-search.humanities.uva.nl/#/overview
approach to retrieval. We started by using Wikipedia as an external source ofinformation, since many books have their dedicated Wikipedia article [3]. Weassociate a Wikipedia article to each topic and we select the most informativewords from the articles in order to expand the query. For our recommendationruns, we used the reviews and the ratings attributed to books by Amazon users.We computed a ”social relevance” probability for each book, considering theamount of reviews and the ratings. This probability was then interpolated withscores obtained by Maximum Likelihood Estimates computed on whole Amazonpages, or only on reviews and titles, depending on the run.
The rest of the paper is organized as follows. The following Section gives aninsight into the document collection whereas Section 3 describes the our retrievalframework. Finally, we describe our runs in Section 4 and discuss some resultsin Sections 5 and 6.
2 The Amazon collection
The document used for this year’s Book Track is composed of Amazon pages ofexisting books. These pages consist of editorial information such as ISBN num-ber, title, number of pages etc... However, in this collection the most importantcontent resides in social data. Indeed Amazon is social-oriented, and user cancomment and rate products they purchased or they own. Reviews are identi-fied by the <review> fields and are unique for a single user: Amazon does notallow a forum-like discussion. They can also assign tags of their creation to aproduct. These tags are useful for refining the search of other users in the waythat they are not fixed: they reflect the trends for a specific product. In theXML documents, they can be found in the <tag> fields. Apart from this userclassification, Amazon provides its own category labels that are contained in the<browseNode> fields.
Table 1. Some facts about the Amazon collection.
Number of pages (i.e. books) 2, 781, 400Number of reviews 15, 785, 133Number of pages that contain a least a review 1, 915, 336
3 Retrieval model
3.1 Sequential Dependence Model
Like the previous year, we used a language modeling approach to retrieval [4].We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integratemultiword phrases in the query. Specifically, we use the Sequential Dependance
Organizers Marijn Koolen (University of Amsterdam)Toine Bogers (Aalborg University Copenhagen)Antal van den Bosch (Radboud University Nijmegen)Antoine Doucet (University of Caen)Maria Gaede (Humboldt University Berlin)Preben Hansen (Stockholm University)Mark Hall (Edge Hill University)Iris Hendrickx (Radboud University Nijmegen)Hugo Huurdeman (University of Amsterdam)Jaap Kamps (University of Amsterdam)Vivien Petras (Humboldt University Berlin)Michael Preminger (Oslo and Akershus University College of Applied Sciences)Mette Skov (Aalborg University Copenhagen)Suzan Verberne (Radboud University Nijmegen)David Walsh (Edge Hill University)
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 101
http://social-book-search.humanities.uva.nl
SBS Collection : des requêtes réelles issues du forum Library Thing
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 102
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 103
Lecataloguedelapersonnequiposelaquestion
Social Tagging
104
Complementcategoriesbutalotoftags!
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 105
Des profils « utilisateur » (catalog, reviews, ratings)
Idée : utiliser les critiques et commentaires plutôt que les contenus
106
Commentaires contiennent:- keywords- topics- sentiment- abstracts- otherbooks
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 107
6
La Recommandation de Livres / RI
SBS 2016 – Dataset : Amazon collection of 2.8M records
Index Fields
Université Aix-Marseille Amal Htait
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 108
7
La Recommandation de Livres / RI
SBS 2016 – Dataset : LibraryThing Collection of 113,490 users profiles
userid workid author booktitle publication-year catalogue-date rating tagsu3266995 660947 Rosina Lippi Homestead 1999 2006-06 10.0 fiction u1885143 2729214 Ellen Hopkins Glass 2009 2009-05 6.0 drugsu1885143 133315 Tite Kubo Bleach, Vol. 1 2004 2009-06 6.0 manga
Index Fields
Université Aix-Marseille Amal Htait
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 109
8
La Recommandation de Livres / RI
SBS 2016 - Topics Query : Traitement de la requête par les Informations des Livres en Exemples
Université Aix-Marseille Amal Htait
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 110
9
La Recommandation de Livres / RI
SBS 2016 - Retrieval Model : Méthode - SDM
Weighting query terms [Metzler2005]
● Unigram matches
● Bigram exact matches
● Bigram matches within an un-ordered window of 8 terme
Université Aix-Marseille Amal Htait
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 111
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 112
Koolen, M., Bogers, T., Gäde, M., Hall, M., Hendrickx, I., Huurdeman, H., ... & Walsh, D. (2016, September). Overview of the CLEF 2016 Social Book Search Lab. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 351-370). Springer International Publishing.
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 113
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 114
http://ceur-ws.org/Vol-1609/
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Building a Graph of Books— Nodes = books + properties (metadata, #reviews and ranking, page ranks, ratings…)
— Edges = links between books
— Book A refers to Book B according to: — Bibliographic references and citations (in the book / in the reviews) — Amazon recommendation (People who bought A bought B, People who liked A liked B…)
— A is similar to B — They share bibliographic references — Full-text similarity + similarity between the metadata
115
Thegraphallowstoestimate—«BookRanks»(cf.theGoogle’sPageRank)—Neighborhood—Shortestpaths
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 116
Jeh, G., & Widom, J. (2002, July). SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 538-543). ACM.
Recommending books : IR + graph mining
117
IR : Sequential Dependance Model (SDM) - Markov Random Field (Metzler & Croft, 2004) and/or Divergence From Randomness (InL2) model + Query Expansion with Dependance AnalysisRatings : The more a book has reviews and the more it has good ratings, the more relevant it is.Graph : Expanding the retrieved books with Similar Books then Reranking with PageRank
13
● We tested many reranking methods. Combining the retrieval model scores and other scores based on social information.
● For each document compute:– PageRank: algorithm that exploits link structure to score
the important of nodes in the graph.
– Likeliness: Computed from information generated by users (reviews and ratings). More the book has a lot of reviews and good ratings, the more interesting it is.
Graph Modeling – Reranking Schemes
12
ti Retrieving
Collection
DGD
Dti
DStartingNodes
Neighbors
SPnodes
DgraphDgraph
Delete
duplications
D nal
1 2
3
5
6
789 + 10
Reranking11
Graph Modeling - Recommendation
PageRank+SimilarProducts -Verygoodresultsin2011 (Judgementsobtainedbycrowdsourcing)(IRandratings) P@10≈0.58
-Goodresultsin2014(IR,ratings,expansion)P@10≈0.23;MAP≈0.44 -in2015:rank25/47(IR+graph butgraphimprovedIR)P@10≈0.2(best0.39,includedthepriceofbooks)
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Une perspective: fouille de graphes multicouches
— Thèse de Mohamed Ettaleb (co-dirigée par Pr. C. Latiri, B. Douhar, P. Bellot)
118
Livres «similaireà»
couche«achetésensemble»
coucheauteurs
couchetags
Question:quelssous-graphesfréquents?commentlesinterpréter?
119
Et dans la vraie vie ? (pour nous : OpenEdition)
PBAC²:*ProvenanceDBased*Access*Control*for*the*Cloud!
Crowd4WS:*Crowdsourcing*for*Web*Service*Discovery!VGOLAP:*Volunteered*Geographic*OLAP*
DIMAGData, Information and content Management Group
5 PR, 8 MCF, 2 associés, 17 doctorants
http://www.lsis.org/dimag
- développer des modèles et des algorithmes de recherche d’information et de fouille sur de grandes masses de données et des contenus en langue naturelle- proposer des architectures pour les systèmes d’information, des modélisations de processus (BPM) et des approches pour la définition, l’intégration et la recherche de services Web
True Captiongame baseball ballCorrLDAgame area ball seaLOC-LDAgame air ball world
True Captionriver nightCorrLDAriver city area sunLOC-LDAriver night city world
True Captionworld area skylineCorrLDAworld area town cityLOC-LDAworld area skyline mountain
True Captionwater ironCorrLDAwater iron city townLOC-LDAwater iron sky world
True Captionair flight aircraftCorrLDAair water flight towerLOC-LDAair flight aircraft town
True Captioncity airCorrLDAcity air river waterLOC-LDAcity air fire tower
True Captionworld water area pacificCorrLDAworld water area seaLOC-LDAworld sea area sun
True Captionriver towerCorrLDAworld river city townLOC-LDAriver tower world town
Figure 6. Example images from the test set and their automatic annotations under CorrLDA and LOC-LDA.
Fouille et intégration de données
Figure 1: Graphical model of the Dynamic person-alization Topic Model (for two time slices t1 and t2).
Modèles probabilistes latents pour la recommandation personnalisée et l’annotation automatique Routage de requêtes dans les réseaux P2P via des traverses minimales d'un hypergraphe
. . .
. . .
y
x
Extraction d’information par CRF
Recherche d’information et fouille de textes
BILBO
ÉCHO
CLASSIFICATION AUTOMATIQUE ET MÉTADONNÉES
RECOMMANDATION
GRAPHE DE CONTENUS
questions de communication
vertigo
edc
echogeo
vertigo
quaderni
BILBO - MISE EN RELATION DES COMPTES-RENDUS AVEC LES LIVRES
ÉCHO - ANALYSE DES SENTIMENTS
Langouet, G., (1986), « Innovations pédago-giques et technologies éducatives », Revue française de pédagogie, n° 76, pp. 25-30.
Langouet, G., (1986), « Innovations pédagogiques et technologies éducatives », Revue française de pédagogie, n° 76, pp. 25-30.DOI : 10.3406/rfp.1986.1499
18 Voir Permanent Mandates Commission, Minutes of the Fifteenth Session (Geneva: League of Nations, 1929), pp. 100-1. Pour plus de détails, voir Paul Ghali, Les nationalités détachées de l’Empire ottoman à la suite de la guerre (Paris: Les Éditions Domat-Mont-chrestien, 1934), pp. 221-6.
ils ont déjà édité trois recueils et auxquelles ils ont consacré de nombreux travaux critiques. Leur nouvel ouvrage, intitulé Le Roman véritable. Stratégies préfacielles au XVIIIe siècle et rédigé à six mains par Jan Herman, Mladen Kozul et Nathalie Kremer – chaque auteur se chargeant de certains chapitres au sein
DÉTECTION
ANNOTATION ET RECHERCHE DU DOI
AFFICHAGE DU DOI DE LA REVUE
BILBO
NIVEAU 1
NIVEAU 2
NIVEAU 3
<bibl><author><surname>Langouet</surname>, <forename>G.</forename>,</author> (<date>1986</date>), <title level="a">« Innovations pédagogiques et technologies éducatives »</title>, <title level="j">Revue française de pédagogie</title>, <abbr>n°</abbr> <biblScope type="issue">76</biblScope>, <abbr>pp.</abbr> <biblScope type=page>25-30</biblScope>. <idno type="DOI">DOI : 10.3406/rfp.1986.1499</idno></bibl>
Equipex « Digital Library for Open Humanities »
RI sociale
parses, a.k.a. constituent parsing and produces [8]. This directed graph is the result of an
all-path parsing algorithm based on a dependency grammar [6] in which the syntactic structure is expressed in terms of de-
. This relation defines a dependency tree, whose root is a
We have adopted the typed dependencies proposed in [8] . It is worth noticing that
typed dependencies and phrases structures are different ways of representing the inner structure of sentences, in which a phrase structure (constituent) parsing represents the nesting of multi-word constituents, whereas a dependency parsing represents dependencies between individual words. In addition, a typed dependency graph labels dependencies with grammatical rela-
Different variants of the Stanford typed dependency repre-sentation are available in the dependency parsing system pro-
representation, where dependencies involving prepositions, conjuncts, as well as information about the referent of relative clauses, are collapsed to get direct dependencies between con-tent words. This collapsed representation can simplify the rela-
Fig. 2. Graph-based model of the sentence: “Mary is reading a book on
Semantic Web”.
This graph-based model can be expressed by a set of binary
Extraction d’information par Programmation Logique Inductive
Prediction
Training Document Document Training Corpus
Training Phase Classification Phase
Indexing Conceptualization
Document Document Conceptualized Corpus
Predicted Class
Test Document Conceptualized
Document
Indexed Corpus
Indexed Document
Classification model
Enriching Vectors
Proximity matrix
Enriched Document vector
Document Document Enriched model
Classification et recherche d’information sémantiques Filtrage de documents Web par modèles de langue temporels et apprentissage de méta-caractéristiques
OpenEditionOpenEdition
L’approche S3(Semantic support, Semantic descriptors, Search Patterns)
Base de descripteurs sémantiques
Mapping sémantique
Définition de patterns
...
Ressources diverses
Stratégie ascendante
Stratégie descendante
Résultat Enrichissement des descripteurs sémantiques
Base de patterns
Sélection
Ontologie Manufacturing Process
Dictionnaire Process Control
Matching
Résultat
Création/Stockage
Utilisateur
Ressources d’information manufacturière
Descripteurs sémantiques de ressources
Patterns de recherche: Artefacts de besoins métier
Alignement des ressources avec les besoins métier
Stratégie ascendante
Stratégie descendante
...
SI décisionnels et adaptatifs
Fig. 1. Learning content multimodality contextualization process
L’approche S3:Création de descripteurs sémantiques
Mapping sémantique
Ontologie du manufacturing process
Ressources
Sorties
Dictionnaire Process Control
Entrées
Validation
Descripteurs sémantiques de ressources
Description des experts
SD4
SD3
SD2
SD1...
...
Similarité entre concepts (Csim)
Relations entre concepts et inférence
+
Personnalisation dirigée par des ontologiesModélisation de système logistique — Application au management hospitalier
Gestion de ressources d’informationProcessus flexibles — Contrôle de processus industriels
Agents conversationnels — Systèmes de dialogue — Multimodalité
Architectures multi-agents, simulation de comportements et jeux sérieux pour la gestion de crise
Virtual Reality for Training Doctors
to Break Bad News
Services Web et eLearning
Objectifs:
SI —WebBig Data — Documents
Matchmaking et Crowdsourcing pour la découverte de Services Web Gestion de la Provenance pour un Cloud de confiance Le citoyen comme capteur humain : contribution à l'analyse OLAP Spatiale
Univ. Recife (Brésil)
Extractiond’information
ChercherdescritiquesLesreliersauxlivres
Analysedesentiments Recommandationdelivres
SVM-Z-score-CRF Graphscoring
NOTESPOLARITE
GRAPHE
RECOMMANDATION
Analysedecitations
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Identifier des critiques de livres dans des blogs
• Classification supervisée « en genre »
• Caractéristiques : unigrammes, localisation des entités nommées, dates
• Sélection de caractéristiques : Seuil du Z-score + random forest
120
Figure 4: ”Location” named entity distribution
Figure 5: ”Date” named entity distribution
methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).
6.1 Naive Bayes (NB)
In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.
arg max
hi
P (hi).|w|Y
j=1
P (wj |hi) (5)
where P (wj |hi) =tfj,hinhi
We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.
6.2 Support Vector Machines (SVM)
SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)
6.3 Results
We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.
Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined
Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%
SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185
2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781
3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%* C = 8.0* � = 0.03125
the corpus. In order to give more importance tothe difference in how many times a term appear inboth classes, we used the normalized z-score de-scribed in Equation (4) with the measure � intro-duced in Equation (3)
� =tfC0 � tfC1
tfC0 + tfC1(3)
The normalization measure � is taken into accountto calculate normalized z-score as following:
Z
�(wi|Cj) =
8>>>>>>><
>>>>>>>:
Z(wi|Cj).(1 + |�(wi|Cj |)if Z > 0 and � > 0,or if Z 0 and � 0
Z(wi|Cj).(1� |�(wi|Cj |)if Z > 0 and � 0,or if Z 0 and � > 0
(4)In the table 3 we can observe the 30 highest nor-
malized Z scores for Review and Review classesfor the corpus after an unigram indexing schemewas performed. We can see that a lot of this fea-tures relate to the classe where they predominate.
Table 3: Distribution of the 30 highest normalizedZ scores across the corpus.
# Feature Z�
score# Feature Z�
score1 abandonne 30.14 16 winter 9.232 seront 30.00 17 cleo 8.883 biographie 21.84 18 visible 8.754 entranent 21.20 19 fondamentale 8.675 prise 21.20 20 david 8.546 sacre 21.20 21 pratiques 8.527 toute 20.70 22 signification 8.478 quitte 19.55 23 01 8.389 dimension 15.65 24 institutionnels 8.3810 les 14.43 25 1930 8.1611 commandement 11.01 26 attaques 8.1412 lie 10.61 27 courrier 8.0813 construisent 10.16 28 moyennes 7.9914 lieux 10.14 29 petite 7.8515 garde 9.75 30 adapted 7.84
In our training corpus, we have 106 911 wordsobtained from the Bag-of-Words approach. We se-lected all tokens (features) that appear more than5 times in each classes. The goal is therefore todesign a method capable of selecting terms thatclearly belong to one genre of documents. We ob-tained a vector space that contains 5 957 words(features). After calculating the normalized z-score of all features, we selected the first 1 000features according to this score.
5.3 Using Named Entity (NE) distribution asfeatures
Most of researches involve removing irrelevant de-scriptors. In this section, we describe a new ap-proach for better represent the documents in thecontext of this study. The purpose is to find ele-ments that characterize the Review class.
After a linguistic and statistical corpus analysis,we identified some common characteristics (illus-trated in Figure 3, 4 and 5). We have identifiedthat the presence of the bibliographical referenceor some its elements (title, author(s) and date) ofthe reviewed book is often in the title of the review,as in the following example:
[...]<title level="a" type="main"> Dean R. Hoge,Jacqueline E. Wenger,<hi rend="italic">
Evolving Visions of the Priesthood. Changes from Vatican IIto the Turn of the New Century</hi>
</title>
<title type="sub"> Collegeville (MIN), Liturgical Press,2003, 226 p.</title> [...]
In the Review class, we found scientific arti-cles. In those documents, a bibliography sectionis generally present at the end of the text. As weknow, this section contains authors’ names, loca-tions, dates, etc... However, in the Review classthis section is quite often absent. Based on thisanalysis, we tagged all documents of each classusing the Named Entity Recognition tool TagEN(Poibeau, 2003). We aim to explore the distribu-tion of 3 named entities (”authors’ names”, ”loca-tions” and ”dates”) in the text after removing allXML-HTML tags. After that, we divided textsinto 10 parts (the size of each part = total num-ber of words / 10). The distribution ratio of eachnamed entity in each part is used as feature to buildthe new document representation and we obtaineda set of 30 features.
Figure 3: ”Person” named entity distribution
6 Experiments
In this section we describe results from experi-ments using a collection of documents from Re-vues.org and the Web. We use supervised learning
Figure 4: ”Location” named entity distribution
Figure 5: ”Date” named entity distribution
methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).
6.1 Naive Bayes (NB)
In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.
arg max
hi
P (hi).|w|Y
j=1
P (wj |hi) (5)
where P (wj |hi) =tfj,hinhi
We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.
6.2 Support Vector Machines (SVM)
SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)
6.3 Results
We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.
Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined
Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%
SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185
2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781
3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%* C = 8.0* � = 0.03125
Figure 4: ”Location” named entity distribution
Figure 5: ”Date” named entity distribution
methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).
6.1 Naive Bayes (NB)
In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.
arg max
hi
P (hi).|w|Y
j=1
P (wj |hi) (5)
where P (wj |hi) =tfj,hinhi
We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.
6.2 Support Vector Machines (SVM)
SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)
6.3 Results
We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.
Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined
Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%
SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185
2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781
3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%* C = 8.0* � = 0.03125
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
L’analyse de sentiments sur les critiques• Statistical Metrics (PMI, Z-score, odd ratio…)
• Combined with Linguistic Ressources
121
Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =
!"#!"!!"#$!!"# Eq. (1)
Z!"#$% !!" =
!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)
The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love Good Happy Great Excite Best Thank Hope Cant Wait
14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05
Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid
13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83
Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin
6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17
Table1. The first ten terms having the highest Z_score in each class
- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and
Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.
- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.
4 Evaluation
4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.
4.2 Experiments
Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.
Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score
Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =
!"#!"!!"#$!!"# Eq. (1)
Z!"#$% !!" =
!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)
The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love Good Happy Great Excite Best Thank Hope Cant Wait
14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05
Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid
13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83
Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin
6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17
Table1. The first ten terms having the highest Z_score in each class
- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and
Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.
- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.
4 Evaluation
4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.
4.2 Experiments
Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.
Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score
Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =
!"#!"!!"#$!!"# Eq. (1)
Z!"#$% !!" =
!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)
The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love Good Happy Great Excite Best Thank Hope Cant Wait
14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05
Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid
13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83
Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin
6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17
Table1. The first ten terms having the highest Z_score in each class
- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and
Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.
- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.
4 Evaluation
4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.
4.2 Experiments
Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.
Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score
Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =
!"#!"!!"#$!!"# Eq. (1)
Z!"#$% !!" =
!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)
The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.
positive
Z_score
negative
Z_score
Neutral
Z_score
Love Good Happy Great Excite Best Thank Hope Cant Wait
14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05
Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid
13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83
Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin
6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17
Table1. The first ten terms having the highest Z_score in each class
- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and
Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.
- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.
4 Evaluation
4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.
4.2 Experiments
Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.
Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score
features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f-measure. We also test all combinations with the-se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea-tures improve significantly the f-measure and they are better than pre-polarity features.
Figure 1 Z_score distribution in positive class
Figure 2 Z_score distribution in neutral class
Figure 3 Z_score distribution in negative class
Features F-measure 2013 2014
Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure
2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22
Table 3. Average f-measures for positive and negative clas-ses of SemEval2013 and 2014 test sets after using a twitter
dictionary.
5 Conclusion
In this paper we tested the impact of using Twitter Dictionary, Sentiment Lexicons, Z_score features and POS tags for the sentiment classifi-cation of tweets. We extended the feature vector of tweets by all these features; we have proposed new type of features Z_score and demonstrated that they can improve the performance. We think that Z_score can be used in different ways for improving the Sentiment Analysis, we are going to test it in another type of corpus and using other methods in order to combine these features.
Reference Apoorv Agarwal,Boyi Xie,Ilia Vovsha,Owen
Rambow and Rebecca Passonneau (2011). Sentiment analysis of Twitter data. Proceedings of the Workshop on Languages
features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f-measure. We also test all combinations with the-se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea-tures improve significantly the f-measure and they are better than pre-polarity features.
Figure 1 Z_score distribution in positive class
Figure 2 Z_score distribution in neutral class
Figure 3 Z_score distribution in negative class
Features F-measure 2013 2014
Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas-
ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure
2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22
Table 3. Average f-measures for positive and negative clas-ses of SemEval2013 and 2014 test sets after using a twitter
dictionary.
5 Conclusion
In this paper we tested the impact of using Twitter Dictionary, Sentiment Lexicons, Z_score features and POS tags for the sentiment classifi-cation of tweets. We extended the feature vector of tweets by all these features; we have proposed new type of features Z_score and demonstrated that they can improve the performance. We think that Z_score can be used in different ways for improving the Sentiment Analysis, we are going to test it in another type of corpus and using other methods in order to combine these features.
Reference Apoorv Agarwal,Boyi Xie,Ilia Vovsha,Owen
Rambow and Rebecca Passonneau (2011). Sentiment analysis of Twitter data. Proceedings of the Workshop on Languages
[Hamdan,Béchet&Bellot,SemEval2014]
http://sentiwordnet.isti.cnr.it
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 122
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 123
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 124
http://reviewofbooks.openeditionlab.org
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Linking Contents by Analyzing the ReferencesIn Books : no common stylesheet (or a lot of stylesheets poorly respected…)
Our proposal :
1) Searching for references in the document / footnotes (Support Vector Machines)
2) Annotating the references (Conditional Random Fields)
BILBO : Our (open-source) software for Reference Analysis
125
Google Digital Humanities Research Awards (2012)
Annotation
DOIsearch(Crossref)
OpenEditionJournals:morethan1.5millionreferencesanalyzed
Test:http://bilbo.openeditionlab.orgSources:http://github.com/OpenEdition/bilbo
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 126
Test:http://bilbo.openeditionlab.orgSources:http://github.com/OpenEdition/bilbo
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 127
Ollagnier, A., Fournier, S., & Bellot, P. (2016). A Supervised Approach for Detecting Allusive Bibliographical References in Scholarly Publications. In WIMS (p. 36).
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 128
thèsedeDoctoratdeAnaïsOllagnier(dir.P.Bellot/S.Fournier)
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 129
http://sentiment-analyser.openeditionlab.org/aboutsemeval
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
SYSTÈMES HYBRIDES
130
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 131
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 132
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
CONCLUSION
133
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Conclusion— De très nombreuses approches (hybrides)
— Filtrage collaboratif et exploitation de l’historique
— Analyse des contenus
— Exploitation de données comportementales et d’informations explicites
— Exploitation des réseaux sociaux
— -> tout combiner dans un seul modèle d’apprentissage ? Quelle fonction à optimiser ?
— Des liens forts avec d’autres domaines
— Méthodes statistiques, fouille de données et de graphes, apprentissage…
— Recherche d’information (n’est-ce pas aussi de la recommandation ?), traitement automatique des langues, analyse d’image/signal, ergonomie et interaction…
— Il faut choisir les approches mais aussi les données
— Usages et contextes
— Préservation de la vie privée
134
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 135
https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 136
http://lenskit.org
MichaelD.Ekstrand,MichaelLudwig,JosephA.Konstan,andJohnT.Riedl.2011.RethinkingTheRecommenderResearchEcosystem:Reproducibility,Openness,andLensKit.InProceedingsoftheFifthACMConferenceonRecommenderSystems(RecSys’11).ACM,NewYork,NY,USA,133-140.DOI=10.1145/2043932.2043958.
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 137
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 138
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 139
Çoba, L., & Zanker, M. rrecsys: an R-package for prototyping recommendation algorithms, RecSys 2016.
P.Bellot(AMU-CNRS,LSIS-OpenEdition)
Challenges
140
P.Bellot(AMU-CNRS,LSIS-OpenEdition) 141
http://lab.hypotheses.org
Mercidevotreattention:-)