60
ALGORITHMES D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX SOCIAUX, UTILISANT UN GRAPHE SOCIAL PARTIEL THÈSE PRÉSENTÉE À LA FACULTÉ DES ÉTUDES SUPÉRIEURES ET DE LA RECHERCHE EN VUE DE L’OBTENTION DE LA MAÎTRISE ÈS SCIENCES EN INFORMATIQUE RAÏSSA YAPAN DOUGNON DÉPARTEMENT D’INFORMATIQUE FACULTÉ DES SCIENCES UNIVERSITÉ DE MONCTON Septembre 2015

RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

ALGORITHMES D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉSEAUX

SOCIAUX, UTILISANT UN GRAPHE SOCIAL PARTIEL

THÈSE PRÉSENTÉE À LA FACULTÉ DES ÉTUDES SUPÉRIEURES ET DE LA

RECHERCHE EN VUE DE L’OBTENTION DE LA MAÎTRISE ÈS SCIENCES EN

INFORMATIQUE

RAÏSSA YAPAN DOUGNON

DÉPARTEMENT D’INFORMATIQUE

FACULTÉ DES SCIENCES

UNIVERSITÉ DE MONCTON

Septembre 2015

Page 2: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

COMPOSITION DU JURY

Président du jury :

Julien Chiasson

Professeur d’informatique, Université de MonctonExaminateur externe :

Usef Faghihi

Professeur d’informatique, Cameron University, États-UnisExaminateur interne :

Mustapha Kardouchi

Professeur d’informatique, Université de MonctonDirecteur de thèse :

Philippe Fournier-Viger

Professeur d’informatique, Université de Moncton

Page 3: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

REMERCIEMENTS

Mes remerciements les plus sincères vont à Philippe Fournier-Viger, mon directeur de thèse.

Tout au long de ce travail, Philippe à toujours été disponible et enthousiaste, et il m’a beaucoup

aidée. Il a su me guider, me soutenir et surtout m’a encouragée dans ma recherche. Merci Philippe,

pour ta générosité en matière de formation et d’encadrement en recherche.

À mes parents Wifried et Angèle DOUGNON pour leur amour et leur attachement indéfectible

à ma réussite sur le chemin de la vie.

À mes soeurs et mon frère, dont j’espère atteindre le seuil de vos espérances. Merci pour le

soutien moral et l’encouragement que vous m’avez accordés.

Un profond Merci à Harouna M Coulibaly, pour son amour constant, inconditionnel ainsi que

son support et encouragement tout au long de mes études.

À mon oncle et ami Hamadoun Baba Touré qui a su m’apporter son soutien aux moments

propices, mes reconnaissances et remerciements chaleureux à toi.

Finalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon

travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger.

ii

Page 4: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

TABLE DES MATIÈRES

RÉSUMÉ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

RÉSUMÉ (anglais) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

INTRODUCTION GÉNÉRALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPITRE IPGPI : ALGORITHME D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES RÉ-SEAUX SOCIAUX EN UTILISANT UN GRAPHE SOCIAL PARTIEL . . . . . . . . . . 5

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Travaux antérieurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Définition du problème . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 L’algorithme proposé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Inférence de profils utilisateurs en utilisant les noeuds et les liens d’amitiés . . . 12

4.2 Inférence de profils utilisateurs en utilisant les groupes et les publications . . . . 13

4.3 Inférence de profils utilisateurs en utilisant les noeuds, les liens, les groupes etles publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Évaluation experimentale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

CHAPITRE IIPGPI+ : ALGORITHME AMÉLIORÉ D’INFÉRENCE DE PROFILS UTILISATEURSSUR LES RÉSEAUX SOCIAUX À PARTIR D’UN GRAPHE SOCIAL PARTIEL . . . . . 25

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Travaux antérieurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Définition du problème . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 L’algorithme proposé PGPI+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 L’algorithme PGPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Optimisation de la couverture et de l’exactitude . . . . . . . . . . . . . . . . . . 33

4.3 Extension de PGPI+ pour évaluer les attributs numériques . . . . . . . . . . . . 34

4.4 Extension de PGPI+ pour évaluer la certitude des prédictions . . . . . . . . . . . 35

5 Évaluation expérimentale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

CONCLUSION GÉNÉRALE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

RÉFÉRENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Page 5: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

TABLE OF CONTENTS

ABSTRACT (French) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

GENERAL INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CHAPTER IPGPI: INFERRING USER PROFILES IN SOCIAL NETWORKS USING A PAR-TIAL SOCIAL GRAPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Inferring user profiles using nodes and links . . . . . . . . . . . . . . . . . . . . 12

4.2 Inferring user profiles using groups and publications . . . . . . . . . . . . . . . 13

4.3 Inferring user profiles using nodes, links, groups and publications . . . . . . . . 15

5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

CHAPTER IIPGPI+: ACCURATE SOCIAL NETWORK USER PROFILING UNDER THE CON-STRAINT OF A PARTIAL SOCIAL GRAPH . . . . . . . . . . . . . . . . . . . . . . . 25

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 The proposed PGPI+ algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Optimizations to improve accuracy and coverage . . . . . . . . . . . . . . . . . 33

4.3 Extension to handle numerical attributes . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Extension to evaluate the certainty of predictions . . . . . . . . . . . . . . . . . 35

5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

GENERAL CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Page 6: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

LISTE DES TABLEAUX

I-1 Best accuracy/coverage results for each attribute . . . . . . . . . . . . . . . . . . . . 24

II-1 Best accuracy results for nominal attributes on Facebook . . . . . . . . . . . . . . . 41

II-2 Best accuracy results for nominal attributes on Pokec . . . . . . . . . . . . . . . . . 42

II-3 Average error and standard deviation for numeric attributes . . . . . . . . . . . . . . 43

II-4 Best accuracy for nominal attributes using the full social graph . . . . . . . . . . . . 44

v

Page 7: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

LISTE DES FIGURES

I-1 Product of accuracy and coverage w.r.t. number of accessed facts . . . . . . . . . . . 17

I-2 Accuracy and coverage w.r.t. number of accessed facts . . . . . . . . . . . . . . . . 18

I-3 Number of accessed facts w.r.t. maxFacts . . . . . . . . . . . . . . . . . . . . . . . 19

I-4 Product of accuracy and coverage w.r.t. ratioFacts . . . . . . . . . . . . . . . . . . 20

I-5 Product of accuracy and coverage (left) and number of accessed facts (right) w.r.t.maxDistance for PGPI-N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

II-1 Accuracy w.r.t. number of accessed facts for nominal attributes . . . . . . . . . . . . 39

II-2 Average error, standard deviation and coverage w.r.t. certainty for the "year" attribute 44

II-3 Average error, standard deviation and coverage w.r.t. certainty for the "age" attribute . 45

vi

Page 8: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

LISTE DES ALGORITHMES

1 The PGPI-N algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 The PGPI-G algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 The PGPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Certainty assessment of a nominal attribute value . . . . . . . . . . . . . . . . . . . 37

vii

Page 9: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

RÉSUMÉ

Les réseaux sociaux en ligne occupent une place importante dans la société d’aujourd’hui. Un

problème important pour le ciblage publicitaire dans les réseaux sociaux est que les utilisateurs y

divulguent souvent peu d’informations personnelles. Pour pallier ce problème, un certain nombre

d’algorithmes ont été proposés pour inférer automatiquement les profils des utilisateurs sur la base

d’informations publiques. Cependant, ces algorithmes présentent néanmoins tous au moins deux

des quatre limites qui sont : (1) ils supposent que le graphe social complet est disponible, (2) ils ne

pas exploitent le contenu riche disponible sur certains réseaux sociaux tel que l’appartenance à des

groupes et les "J’aime", (3) ils traitent les attributs numériques comme des attributs nominaux et (4)

ils n’évaluent pas la certitude des prédictions. Afin de pallier (1) et (2), nous présentons un algo-

rithme appelé Partial Graph Profile Inference (PGPI). Cet algorithme permet l’inférence des profils

utilisateurs en l’absence d’un graphe social complet. PGPI ne requiert aucun entraînement pour être

appliqué, et permet de contrôler finement le compromis entre la quantité d’informations à laquelle

ont peut accédée pour faire une inférence et l’exactitude de l’inférence. De plus, PGPI est conçu

pour utiliser des informations variées sur les utilisateurs (liens d’amitié, profils utilisateurs, apparte-

nance à des groupes d’intérêt, et les « vues » et les « J’aime ») présents sur certains réseaux sociaux

comme Facebook. Afin de pallier les limites (3) et (4), nous présentons une version améliorée de

PGPI appelé PGPI+ (Partial Graph Profile Inference+). Cette version ajoute la prise en compte des

attributs numériques à celle des attributs nominaux. Il permet également l’évaluation de la certitude

des prédictions. Une évaluation expérimentale avec plus de 30,000 profils utilisateurs des réseaux

sociaux Facebook et Pokec montrent que PGPI/PGPI+ infèrent les profils d’utilisateurs avec une

plus grande exactitude que plusieurs approches données dans la littérature. Nos deux approches

accédent également à moins d’information du graphe social. De plus, un résultat intéressant est que

certains attributs comme le statut (étudiant/professeur) et le genre peuvent être prédits avec plus de

95% d’exactitude.

Mots clés : réseaux sociaux, inférence, profils utilisateurs, graphe partiel

viii

Page 10: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

ABSTRACT

Online social networks have an important place in today’s society. An important problem for

ad targeting on social networks is that users often do not reveal much personal information. To

address this issue, various algorithms have been proposed to automatically infer user profiles using

the publicly disclosed information. Nevertheless, all these algorithms suffer from at least two of

the following limitations: (1) to assume that the full social graph is available for training, (2) to not

exploit the rich information available on some social networks such as group memberships, "views"

and "likes", (3) to treat numeric attributes as nominal attributes and (4) to not assess the certainty

of predictions.

To address limitations 1 and 2, we propose an algorithm named Partial Graph Profile Inference

(PGPI). The PGPI algorithm can accurately infer user profiles under the constraint of a partial social

graph. PGPI does not require training, and it lets the user select the trade-off between the amount

of information used for inference and the accuracy of predictions. Besides, PGPI is designed to use

rich information about users when available: user profiles, friendship links, group memberships,

and the "views" and "likes" from social networks such as Facebook.

Moreover, to also address limitations 3 and 4, we present an improved version of PGPI named

PGPI+ (Partial Graph Profile Inference +). PGPI+ consider numeric attributes in addition to nom-

inal attributes, and can evaluate the certainty of its predictions. An experimental evaluation with

more than 30,000 user profiles from the Facebook and Pokec social networks shows that PG-

PI/PGPI+ predicts user profiles with an accuracy higher than several start-of-the-art algorithms,

and by accessing less information from the social graph. Furthermore, an interesting result is that

some profile attributes such as the status (student/professor) and genre can be predicted with more

than 95 % accuracy using PGPI+.

Keywords: social networks, inference, user profiles, partial graph

ix

Page 11: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

INTRODUCTION GÉNÉRALE

Dans la société d’aujourd’hui, les réseaux sociaux en ligne sont extrêmement populaires. Une

étude récente indique qu’il y a 3.025 milliards d’internautes à travers le monde, que 2.060 milliards

sont actifs sur les réseaux sociaux, soit 68% des internautes et 28% de la population mondiale, et

qu’un utilisateur de réseaux sociaux y passe en moyenne deux heures par jour [15]. Il existe divers

types de réseaux sociaux comme les réseaux d’amitié (notamment, Facebook et Google plus), les

réseaux professionnels (par exemple, ResearchGate et Linkedin) et les réseaux liés à des intérêts

spécifiques (par exemple, Flickr).

Un problème important pour le ciblage publicitaire dans les réseaux sociaux est due au fait

que les utilisateurs partagent généralement peu d’informations publiquement [1, 12]. Afin de re-

médier à ce problème, plusieurs chercheurs du domaine de l’extraction de connaissances dans les

réseaux sociaux (social network mining) s’intéressent maintenant au développement d’algorithmes

automatiques pour inférer des profils utilisateurs détaillés en utilisant les informations divulguées

publiquement par les utilisateurs [3, 6, 8, 11, 13, 14]. L’inférence du profil d’un utilisateur consiste

à découvrir les valeurs des attributs de son profil tels que : son opinion politique, son lieu de rési-

dence, son âge, ses traits de personnalité, son orientation sexuelle, son origine ethnique, etc.

Afin de résoudre ce problème, les chercheurs ont proposé plusieurs algorithmes d’inférence

de profil utilisateur dont les classificateurs relationnels naïfs de Bayes [14], la propagation d’éti-

quettes [11, 13], le vote majoritaire [6], la régression linéaire [12], l’allocation latente de Diri-

chlet [3] et la détection de communautés [16]. Ces différents algorithmes permettent la prédiction

des attributs privés des profils utilisateurs. Cependant, ils souffrent tous d’au moins une des quatre

limites importantes suivantes.

La première limite vient du fait que la grande majorité des algorithmes supposent avoir à leur

disposition le graphe complet du réseau social ou une grande partie pour l’entrainement [11]. Ceci

est une hypothèse pratique en recherche lorsque des chercheurs souhaitent comparer la performance

d’algorithmes en utilisant un grand ensemble de données d’entrainement. Or, dans la réalité, le

1

Page 12: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

graphe social complet est généralement indisponible ou peut être très coûteux à obtenir ou à mettre à

jour [3,8]. Ainsi, plusieurs de ces algorithmes peuvent être inadaptés ou fournissent des prédictions

avec une exactitude faible. Quelques approches comme le vote majoritaire [6] ne supposent pas

un graphe social complet. Par contre, elles se limitent à l’exploration des voisins immédiats d’un

noeud et ne laissent pas la possibilité à l’utilisateur de décider du compromis entre le nombre de

noeuds auxquels on peut accéder dans le graphe social et l’exactitude des prédictions, ce qui peut

mener à des prédictions ayant une faible exactitude.

Concernant la seconde limite, nous remarquons que la plupart de ces algorithmes n’utilisent

ou ne considèrent pas l’information riche souvent disponible sur les réseaux sociaux. D’une part,

plusieurs de ces algorithmes considèrent les liens entre les utilisateurs et les différents attributs des

profils utilisateurs, mais ils ne considèrent pas d’autres informations sur l’utilisateur comme les

adhésions à des groupe, et les "J’aime" et les "vues" qui sont disponibles sur les réseaux sociaux

comme Facebook [1, 6, 9, 11, 13, 14, 16]. D’autre part, plusieurs de ces algorithmes considèrent la

similarité entre les profils utilisateurs et les informations comme les "J’aime", mais pas les liens

d’amitié entre les utilisateurs et d’autres types d’information disponibles sur les réseaux sociaux

[3, 12, 20].

La troisième limite est due au fait que plusieurs de ces algorithmes n’offre pas de procédure

d’inférence spécifique aux attributs numériques. Ils traitent donc les attributs numériques (par

exemple, l’âge) comme un attribut nominal [1, 6], ce qui peut entrainer une perte d’exactitude

pour l’inférence du profil d’un utilisateur. Certains de ces algorithmes sont spécialement conçus

pour traiter des attributs numériques, mais supposent tout de même l’entraînement avec un graphe

social complet ou une grande partie du graphe, ce qui n’est pas pratique [7, 11, 12, 16].

La quatrième limite importante est la non-évaluation de la certitude des prédictions. En effet,

peu d’approches évaluent la certitude des prédictions alors que cette information est essentielle

pour déterminer si une prédiction est fiable. Par exemple, dans le contexte du ciblage publicitaire,

si la certitude qu’une prédiction soit correcte est faible, il est préferable de ne pas utiliser la prédic-

tion pour le ciblage publicitaire [4], car une publicité ciblant une différente catégorie d’utilisateurs

pourrait être présentée.

2

Page 13: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Objectif. Le principal objectif de ce travail est de proposer un nouvel algorithme pour l’infé-

rence de profils utilisateurs dans les réseaux sociaux qui pallie les limites susmentionnées, c’est-à-

dire (1) qui laisse l’utilisateur décider du compromis entre la quantité d’information à utiliser pour

une prédiction et son exactitude (2) qui tient compte de l’information riche qu’on trouve dans les

réseaux sociaux comme les adhésions à des groupes, et les "J’aime" et les "vues" de Facebook, (3)

le traitement des attributs numériques et (4) l’évaluation de la certitude des prédictions. De plus, il

est souhaitable que l’algorithme offre un taux d’exactitude élevé comparativement aux algorithmes

de la littérature, tout en accédant à un nombre limité de noeuds du graphe social.

Hypothèse. Ce travail repose sur les hypothèses suivantes :

— Un algorithme qui prédit les valeurs d’attributs d’un profil utilisateur dans un réseau social

en utilisant un graphe social partiel peut offrir une exactitude comparable ou supérieure à

celle de plusieurs approches de la littérature qui utilisent un graphe social complet.

— Utiliser un algorithme paresseux (sans entraînement) permettra de laisser à l’utilisateur dé-

cider de la quantité d’information à utiliser pour l’inférence et l’exactitude des prédictions.

— Utiliser une information riche sur les utilisateurs (information sur l’adhésion aux groupes,

les "J’aime" et les "vues") permettra d’obtenir des prédictions avec une exactitude plus

élevée, lorsque cette information est disponible.

— Traiter les attributs numériques des utilisateurs (par exemple, l’âge, le poids et la taille) de

façon différente des attributs nominaux, permettra également d’obtenir des prédictions plus

exactes.

— Un algorithme peut être conçu pour l’inférence des profils utilisateurs en exploitant un

graphe social partiel, qui permet d’évaluer la certitude des prédictions.

Organisation de la thèse. Cette thèse est une thèse par publications. Les chapitres 2 et 3

contiennent respectivement deux articles publiés dans des actes de conférences internationales d’in-

telligence artificielle.

— Le premier article, « Inferring User Profiles in Online Social Networks using a Partial Social

Graph », a été publié dans les actes de AI 2015 : The 28th Canadian Conference on Artificial

Intelligence. L’article propose un algorithme nommé PGPI qui résout les deux premières

limites mentionnées précédemment. PGPI permet d’inférer des profils utilisateurs en ex-

3

Page 14: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

ploitant un graphe social partiel et des informations riches sur les utilisateurs comme leurs

adhésions à des groupes, les "J’aime" et les "vues". Les résultats expérimentaux montrent

que PGPI offre une exactitude supérieure à plusieurs algorithmes de la littérature. De plus,

PGPI permet de prédire certains attributs tels que le statut et le genre sont avec une exacti-

tude supérieure à 90 %.

— Le deuxième article, « PGPI+ : Accurate Social Network User Profiling under the Constraint

of a Partial Social Graph » a été accepté pour publication dans les actes de MICAI 2015 : The

14th edition of the Mexican International Conference on Artificial Intelligence Intelligence.

Cet article présente une version améliorée de PGPI nommée PGPI+. PGPI+ introduit de

nouvelles optimisations pour augmenter l’exactitude des prédictions. De plus, il offre le

traitement des attributs numériques de façon particulière et l’évaluation de la certitude des

prédictions.

Il est à noter que la deuxième publication a été légèrement altérée pour ajouter des détails qui

avaient été omis par manque d’espace dans la version publiée.

Finalement, le chapitre 4 de cette thèse présente la conclusion générale ainsi que des perspec-

tives de travaux futurs.

4

Page 15: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

CHAPITRE I

PGPI: INFERRING USER PROFILES IN SOCIAL NETWORKS USING A

PARTIAL SOCIAL GRAPH

PGPI: ALGORITHME D’INFÉRENCE DE PROFILS UTILISATEURS SUR LES

RÉSEAUX SOCIAUX EN UTILISANT UN GRAPHE SOCIAL PARTIEL

Page 16: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Abstract

Most algorithms for user profile inference in online social networks assume that the full social

graph is available for training. This assumption is convenient in a research setting. However, in

real-life, the full social graph is generally unavailable or may be very costly to obtain or update.

Thus, several of these algorithms may be inapplicable or provide poor accuracy. Moreover, current

approaches often do not exploit all the rich information that is available in social networks. In this

paper, we address these challenges by proposing an algorithm named PGPI (Partial Graph Profile

Inference) to accurately infer user profiles under the constraint of a partial social graph and without

training. It is to our knowledge, the first algorithm that let the user control the trade-off between the

amount of information accessed from the social graph and the accuracy of predictions. Moreover,

it is also designed to use rich information about users such as group memberships, views and likes.

An experimental evaluation with 11,247 Facebook user profiles shows that PGPI predicts user

profiles more accurately and by accessing a smaller part of the social graph than four state-of-the-art

algorithms. Moreover, an interesting result is that profile attributes such as status (student/professor)

and gender can be predicted with more than 90% accuracy using PGPI.

Keywords: social networks, inference, user profiles, partial graph

1 Introduction

In today’s society, online social networks have become extremely popular. Various types of

social networks are used such as friendship networks (e.g. Facebook and Twitter), professional

networks (eg. LinkedIn and ResearchGate) and networks dedicated to specific interests (e.g. Flickr

and IMDB). A concern that is often raised about social networks is how to protect users’ personal

information [10]. To address this concern, much work has been done to help users select how their

personal information is shared and with whom, and to propose automatic anonymization techniques

[10]. The result is that personal information is better protected but also that less information is now

available to companies to provide targeted advertisements and services to their users. To address

this issue, an important sub-field of social network mining is now interested in developing automatic

techniques to infer user profiles using the disclosed public information.

6

Page 17: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Various methods have been proposed to solve this problem such as relational Naïve Bayes

classifiers [14], label propagation [11, 13], majority voting [6] and approaches based on linear

regression [12], Latent-Dirichlet Allocation [3] and community detection [16]. It was shown that

these methods can accurately predict hidden attributes of user profiles in many cases. However,

these work suffers from two important limitations. First, the great majority of these approaches

assumes the unrealistic assumption that the full social graph is available for training (e.g. label

propagation). This is a convenient assumption used by researchers for testing algorithms using large

public datasets. However, in real-life, the full social graph is often unavailable (e.g. on Facebook,

LinkedIn and Google Plus) or may be very costly to obtain [3]. Moreover, even if the full social

graph is available, it may be unpractical to keep it up to date. The result is that several of these

methods are inapplicable or provide poor accuracy if the full social graph is not available (see

the experiment section of this paper). A few approaches does not assume a full social graph such

as majority-voting [6]. However, this latter is limited to only exploring immediate neighbors of a

node, and thus do not let the user control the trade-off between the number of nodes accessed and

prediction accuracy, which may lead to low accuracy. It is thus an important challenge to develop

algorithms that let the user choose the best trade-off between the number of nodes accessed and

prediction accuracy.

Second, another important limitation is that several algorithms do not consider the rich infor-

mation that is generally available on social networks. On one hand, several algorithms consider

links between users and user attributes but do not consider other information such as group mem-

berships, "likes" and "views" that are available on social networks such as Facebook and Twit-

ter [1, 6, 9, 11, 13, 14, 16]. On the other hand, several algorithms consider the similarity between

user profiles and information such as "likes" but not the links between users and other rich infor-

mation [3, 12, 20]. But exploiting more information may help to further increase accuracy.

In this paper, we address these challenges. Our contributions are as follows. First, we propose

a new lazy algorithm named PGPI (Partial Graph Profile Inference) that can accurately infer user

profiles under the constraint of a partial graph and without training. The algorithm is to our knowl-

edge, the first algorithm that let the user select how many nodes of the social graph can be accessed

to infer a user profile. This lets the user choose the best trade-off between accuracy versus number

7

Page 18: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

of nodes visited.

Second, the algorithm is also designed to be able to use not only information about friendship

links and profiles but also group memberships, likes and views, when the information is available.

Third, we report results from an extensive empirical evaluation against four state-of-the-art algo-

rithms when applied to 11,247 Facebook user profiles. Results show that the proposed algorithm

can provide a considerably higher accuracy while accessing a much smaller number of nodes from

the social graph. Moreover, an interesting result is that profile attributes such as status (student/pro-

fessor) and gender can be predicted with more than 90% accuracy using PGPI.

The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respectively presents the

related work, the problem definition, the proposed algorithm, the experimental evaluation and the

conclusion.

2 Related Work

Several work have been done to infer user profiles on online social networks. We briefly review

recent work on this topic. Davis Jr et al. [6] inferred location of Twitter users based on relationships

to other users with known locations. They used a dataset of about 50,000 users. Their approach

consists of performing a majority vote over the locations of directly connected users. An important

limitation is that this approach is designed to predict a single attribute based on a single attribute

from other user profiles. Having the same goal, Jurgens [11] designed a variation of the popular la-

bel propagation approach to predict user locations on Twitter/Foursquare social networks. A major

limitation is that it is an iterative algorithm that requires the full social graph since it propagates

known labels to unlabeled nodes through links between nodes. Li et al. [13] developed an iterative

algorithm to deduce LinkedIn user profiles based on relation types. The algorithm is similar to label

propagation and requires a large training set to discover relation types.

He et al. [9] proposed an approach consisting of building a Bayesian network based on the

full social graph to then predict user attribute values. The approach considers similarity between

user profiles and links between users to perform predictions, and was applied to data collected from

LiveJournal. Recently, Dong et al. [7] proposed a similar approach based on using graphical-models

8

Page 19: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

to predict the age and gender of users. Their study was performed with 1 billion phone and SMS

data and 7 millions user profiles. Chaudhari [4] also proposed a graphical model-based approach to

infer user profiles, designed to only perform predictions when the probability of making an accurate

prediction is high. The approach has shown high accuracy on a dataset of more than 1M Twitter

users and a datasets from the Pokac social network. A limitation of these approaches however,

it that they all assumes that the full social graph is available for training. Furthermore, they only

consider user attributes and links but do not consider additional information such as likes, views

and group membership.

A community detection based approach for user profile inference was proposed by Mislove

[16]. It was applied on more than 60K Facebook profiles with friendship links. It consists of ap-

plying a community detection algorithm and then to infer user profiles based on similarity to other

members of the same community. Lindamood et al. [14] also inferred user profiles from Facebook

data. They applied a modified Naïve Bayes classifier on 167K Facebook profiles with friendship

links, and concluded that if links or attributes are erased, accuracy of the approach can greatly

decrease. This raises the challenges of performing accurate predictions using few data. Recently,

Blenn et al. [1] utilized bird flocking, association rule mining and statistical analysis to infer user

profiles in a dataset of 3 millions Hyves.nl users. However, all these work assume that the full so-

cial graph is available for training and they only use profile information and social links to perform

predictions.

Chaabane et al. [3] proposed an approach based on Latent Dirichlet Allocation (LDA) to in-

fer Facebook user profiles. The approach extracts a probabilistic model from music interests and

additional information provided from Wikipedia. A drawback is that it requires a large training

dataset, which was very difficult and time-consuming to obtain according to the authors [3]. Kosin-

ski et al. [12] also utilized information about user preferences to infer Facebook user profiles. The

approach consists of applying Singular Value Decomposition to a large matrix of users/likes and

then applying regression to perform prediction. A limitation of this work is that it does not utilize

information about links between users and requires a very large training dataset.

Quercia et al. [20] developed an approach to predict the personality of Twitter users. The ap-

9

Page 20: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

proach consists of training a decision tree using a large training sets of users tagged with their

personality traits, and then use it to predict personality traits of other users. Although this approach

was successful, it uses a very limited set of information to perform predictions: number of follow-

ers, number of persons followed and list membership count.

3 Problem Definition

As outlined above, most approaches assume a large training set or full social graph. The most

common definition of the problem of inferring user profiles is the following [1, 4, 9, 11, 13, 14, 16].

Definition 1. Social graph. A social graph G is a quadruplet G = {N,L, V, A}. N is the set of

nodes in the social graph. L ⊆ N × N is a binary relation representing the link (edges) between

nodes. Let be m attributes to describe users of the social network such that V = {V1, V2, ...Vm}

contains for each attribute i, the set of possible attribute values Vi. Finally, A = {A1, A2, ...Am}

contains for each attribute i a relation assigning an attribute value to nodes, that is Ai ⊆ N × Vi.

Example. Let be a social graph with three nodes N = {John, Alice, Mary} and friend-

ship links L = {(John, Mary), (Mary, John), (Mary, Alice), (Alice, Mary)}. Consider

two attributes gender and status, respectively called attribute 1 and 2 to describe users. The set

of possible attribute values for gender and status are respectively V1 = {male, female} and V2 =

{professor, student}. The relations assigning attributes values to nodes areA1 = {(John,male),

(Alice, female), (Mary, female)} and A2 = {(John, student), (Alice, student),

(Mary, professor)}.

Definition 2. Problem of inferring user profiles in a social graph. The problem of inferring

the user profile of a node n ∈ N in a social graph G is to correctly guess the attribute values of n

using the other information provided in the social graph.

We also consider an extended problem definition where we consider additional information

available in social networks such as Facebook (views, likes and group memberships).

Definition 3. Extended social graph. An extended social graph E is a tuple

E = {N,L, V,A,G,NG, P, PG,LP, V P} where N,L, V,A are defined as previously. G is a set

10

Page 21: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

of groups that a user can be a member of. The relation NG ⊆ N ×G indicates the membership of

users to groups. P is a set of publications such as pictures, texts, videos that are posted in groups.

PG is a relation PG ⊆ P ×G, which associates a publication to the group(s) where it was posted.

LP is a relation LP ⊆ N × P , which indicates the publication(s) liked by each user (e.g. the

"likes" on Facebook). V P is a relation V P ⊆ N ×P , which indicates the publication(s) viewed by

each user (e.g. the "views" on Facebook). An observation is that LP ⊆ V P .

Example. Let be two groups G = {book_club, music_lovers} such that NG = {(John,

book_club), (Mary, book_club), (Alice, music_lovers)}. Let be two publications P = {picture1,

picture2} published in the groups PG = {(picture1, book_club), (picture2, music_lovers)}.

The publications viewed by users are V P = {(John, picture1), (Mary, picture1), (Alice,

picture2)}while the publications liked by users are LP = {(John, picture1), (Alice, picture2)}.

Definition 4. Problem of inferring user profiles in an extended social graph. The problem

of inferring the user profile of a node n ∈ N in an extended social graph E is to correctly guess the

attribute values of n using the information in the social graph.

But the above definitions assume that the full social graph may be used to perform predictions.

In this paper, we define the problem of inferring user profiles using a limited amount of information

as follows.

Definition 5. Problem of inferring user profiles using a partial (extended) social graph. Let

maxFacts ∈ N+ be a parameter set by the user. The problem of inferring the user profile of a

node n ∈ N using a partial (extended) social graph E is to accurately predict the attribute values

of n by accessing no more than maxFacts facts from the social graph. A fact is a node, group or

publication from N , G or P (excluding n).

4 The Proposed Algorithm

In this section, we present the proposed PGPI algorithm. We first describe a version PGPI-N

that infer user profiles using only nodes and links. Then, we present an alternative version PGPI-

G designed for predicting user profiles using only group and publication information (views and

11

Page 22: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

likes). Then, we explain how these two versions are combined in the full PGPI algorithm.

4.1 Inferring user profiles using nodes and links

Our proposed algorithm PGPI-N for inferring profiles using nodes and links is inspired by the

label propagation family of algorithms, which was shown to provide high accuracy [11,13]. These

algorithms suffer however from an important limitation. They are iterative algorithms that require

the full social graph for training to propagate attribute values [11].

PGPI-N adapts the idea of label propagation for the case where at most maxFacts facts from

the social graph can be accessed to make a prediction. This is possible because PGPI-N is a lazy

algorithm (it does not require training), unlike label propagation. To predict an attribute value

of a node n, PGPI-N only explores the neighborhood of n, which is restricted by a parameter

maxDistance.

The PGPI-N algorithm (Algorithm 1) takes as parameter a node ni, an attribute k to be pre-

dicted, the maxFacts and maxDistance parameters and a social graph G. It outputs a predicted

value v for attribute k of node ni. The algorithm first initializes a map M so that it contains a key-

value pair (v, 0) for each possible value v for attribute k. The algorithm then performs a breadth-first

search. It first initializes a queue Q to store nodes and a set seen to remember the already visited

nodes, and the node ni is pushed in the queue.

Then, while the queue is not empty and the number of accessed facts is less than maxFacts,

the algorithm pops the first node nj in the queue. Then, the formula Fi,j = Wi,j/dist(ni, nj) is

calculated. Wi,j and dist(ni, nj) are respectively used to weight the influence of node nj by its

similarity and distance to ni. Wi,j is defined as the number of attribute values common to ni and

nj divided by the number of attributes m. dist(x, y) is the number of edges in the shortest path

between ni and nj . Then, Fi,j is added to the entry in map M for the attribute value of nj for

attribute k. Then, if dist(x, y) ≤ maxDistance, each node nh linked to nj that was not already

visited is pushed in the queue and added to the set seen. Finally, when the while loop terminates,

the attribute value v associated to the largest value in M is returned as the predicted value for ni.

Note that in our implementation, if the maxFacts limit is reached, the algorithm do not perform a

12

Page 23: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

prediction to reduce the probability of making an inaccurate prediction.

Algorithm 1: The PGPI-N algorithminput : ni: a node, k: the attribute to be predicted, maxFacts and maxDistance: the

user-defined thresholds, G: a social graph

output: the predicted attribute value v

1 M = {(v, 0)|v ∈ Vk};

2 Initialize a queue Q;

3 Q.push(ni);

4 seen = {ni};

5 while Q is not empty and |accessedFacts| < maxFacts do

6 nj = Q.pop();

7 Fi,j ← Wi,j/dist(ni, nj);

8 Add Fi,j to the entry of value v in M such that (nj, v) ∈ Ak;

9 if dist(ni, nj) ≤ maxDistance then

10 foreach node nh 6= ni such that (nh, nj) ∈ L and nh 6∈ seen do

11 Q.push(nh);

12 seen← seen ∪ {nh};

13 end

14 end

15 end

16 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;

4.2 Inferring user profiles using groups and publications

PGPI-G is a lazy algorithm designed to predict user attribute values using only group and pub-

lication information (views and likes). The algorithm is inspired by majority voting algorithms

(e.g. [6]), which have been used for predicting user profiles based on links between users. PGPI-G

adapts this idea for groups and publications and also to handle the constraint that at mostmaxFacts

facts from the social graph can be accessed to make a prediction.

13

Page 24: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

The PGPI-G algorithm (Algorithm 2) takes as parameter a node ni, an attribute k to be pre-

dicted, the maxFacts and maxDistance parameters and an extended social graph E . It outputs

a predicted value v for attribute k of node ni. The algorithm first initializes a map M so that it

contains a key-value pair (v, 0) for each possible value v for attribute k. Then, the algorithm it-

erates over each member nj 6= ni of each group g where ni is a member, while the number of

accessed facts is less than maxFacts. For each such node nj , the formula Fgi,j is calculated to

estimate the similarity between ni and nj and the similarity between ni and g. In this formula,

commonLikes(ni, nj) and commonV iews(ni, nj) respectively denotes the number of publica-

tions liked by both ni and nj , while commonGroups(ni, nj) is the number of groups common to

ni and nj . Finally, commonPopularAttributes(ni, g) is the number of attribute values of ni that

are the same as the most popular attribute values for members of g. Then, Fgi,j is added to the entry

in map M for the attribute value of nj for attribute k. Finally, the attribute value v associated to the

largest value in M is returned as the predicted value for ni. Note that in our implementation, if the

maxFacts limit is reached, the algorithm do not perform a prediction to reduce the probability of

making an inaccurate prediction.

Algorithm 2: The PGPI-G algorithminput : ni: a node, k: the attribute to be predicted, maxFacts: a user-defined threshold, E :

an extended social graph

output: the predicted attribute value v

1 M = {(v, 0)|v ∈ Vk};

2 foreach group g|(ni, g) ∈ NG s.t. |accessedFacts| < maxFacts do

3 foreach person nj 6= ni ∈ g s.t. |accessedFacts| < maxFacts do

4 Fgi,j ← Wi,j × commonLikes(ni, nj)× commonV iews(ni, nj)×

(commonGroups(ni, nj)/|{(ni, x)|(ni, x) ∈ NG}|)×

5 commonPopularAttributes(ni, g);

6 Add Fgi,j to the entry of value v in M such that (nj, v) ∈ Ak;

7 end

8 end

9 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;

14

Page 25: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

4.3 Inferring user profiles using nodes, links, groups and publications

The PGPI-N and PGPI-G algorithms have similar design. Both of them are lazy algorithms

that update a map M containing key-value pairs and then return the attribute value v associated

to the highest value in M as the prediction. Because of this similarity, the algorithms PGPI-N

and PGPI-G can be easily combined. We name this combination PGPI. The pseudocode of PGPI

is shown in Algorithm 3, and is obtained by inserting lines 2 to 15 of PGPI-N before line 8 of

PGPI-G. Moreover, a new parameter named ratioFacts is added. It specifies how much facts of

the maxFacts facts can be respectively used by PGPI-N and by PGPI-G to make a prediction.

For example, ratioFacts = 0.3 means that PGPI-G may use up to 30% of the facts, and thus that

PGPI-N may use the other 70%. In our experiment, the best value for this parameter was 0.5.

5 Experimental Evaluation

We performed several experiments to assess the accuracy of the proposed PGPI-N, PGPI-G and

PGPI algorithms for predicting attribute values of nodes in a social network. Experiments were

performed on a computer with a fourth generation 64 bit Core i5 processor running Windows 8.1

and 8 GB of RAM.

We compared the performance of the proposed algorithms with four state-of-the-art algorithms.

The three first are Naïve Bayes classifiers [10]. Naïve Bayes (NB) infer user profiles strictly based

on correlation between attribute values. Relational Naïve Bayes (RNB) consider the probability

of having friends with specific attribute values. Collective Naïve Bayes (CNB) combines NB and

RNB. To be able to compare NB, RNB and CNB with the proposed algorithms, we have adapted

them to work with a partial graph. This is simply done by training them with maxFacts users

chosen randomly instead of the full social graph. The last algorithm is label propagation (LP) [11].

Because LP requires the full social graph and does not consider themaxFacts parameter, its results

are only used as a baseline. For algorithm specific parameters, the best values have been empirically

found to provide the best results.

Experiments were carried on a real-life dataset containing 11,247 user profiles collected from

15

Page 26: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

the Facebook social network in 2005 [23]. Each user is described according to seven attributes:

a student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year,

and high school. These attributes respectively have 6, 2, 65, 66, 66, 15 and 1,420 possible values.

The dataset is a social graph rather than an extended social graph, i.e it does not contain information

about group memberships, views and likes. But this information is needed by PGPI-G and PGPI.

To address this issue, we have generated synthetic data about group memberships, views and likes.

Our data generator takes several parameters as input. To find realistic values for the parameters, we

have observed more than 100 real groups on the Facebook network. Parameters are the following.

We set the number of groups to 500 and each group to contain between 20 and 250 members, since

we observed that most groups are small. Groups are randomly generated. But similar users are more

likely to be member of a same group. Furthermore, we limit the number of groups per person to

25. We generated 10,000 publications that are used to represent content shared by users in groups

such as pictures, text and videos. A publication may appear in three to five groups. A user has a

50% probability of viewing a publication and a 30% probability of liking a publication that he has

viewed. Lastly, a user can like and view respectively no more than 50 and 100 publications, and a

user is more likely to like a publication that users with a similar profile also like.

To compare algorithms, the prediction of an attribute value is said to be a success if the predic-

tion is accurate, a failure if the prediction is inaccurate or otherwise a no match if the algorithm was

unable to perform a prediction. Three performance measures are used. The accuracy is the number

of successful predictions divided by the total number of prediction opportunities. The coverage is

the number of success or failures divided by the number of prediction opportunities. The product

of accuracy and coverage (PAC) is the product of the accuracy and coverage and is used to as-

sess the trade-off obtained for these two measures. In the following, the accuracy and coverage are

measured as the average for all seven attributes, except when indicated otherwise.

PAC w.r.t number of facts. We have run all algorithms while varying themaxFacts parameter

to assess the influence of the number of accessed facts on the product of prediction accuracy and

coverage. Fig. II-1 shows the overall results. In this figure, the PAC is measured with respect to

the number of accessed facts. It can be observed that PGPI provides the best results when 132 to

700 facts are accessed. For less than 132 facts, CNB provides the best results followed by PGPI.

16

Page 27: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

PGPI-N provides the second best results for 226 to 306 facts. Note that no results are provided for

PGPI-N for more than 306 facts. The reason is that PGPI-N relies solely on links between nodes

to perform predictions and the dataset does not contains enough links to let us further increase the

number of facts accessed by this algorithm.

0%

10%

20%

30%

40%

50%

60%

70%

0 100 200 300 400 500 600 700

Acc

ura

cy x

co

ver

age

Number of accessed facts

PGPI PGPI-G

PGPI-N NB

RNB CNB

Figure I-1 Product of accuracy and coverage w.r.t. number of accessed facts

The algorithm providing the worst results is LP (not shown in figures). LP provides a maximum

accuracy and coverage of respectively 43.2% and 100%. This is not good considering that LP uses

the full social graph of more than 10,000 nodes. For the family of Naïve Bayes algorithms, CNB

provides the best results until 90 facts are accessed. Then, NB is the best. RNB always provides the

worst results among Naïve Bayes algorithms.

Accuracy and coverage w.r.t number of facts. The left and right parts of Fig. I-2 separately

show the results for accuracy and coverage. In terms of accuracy, it can be observed that PGPI-N,

PGPI-G and PGPI always provides much higher accuracy than NB, RNB, CNB and LP (from 9.2%

to 36.8% higher accuracy). Thus, the reason why PGPI algorithms have a lower PAC than NB for

132 facts or less is because of coverage and not accuracy. In terms of coverage, the coverage is

only shown in Fig. I-2 for PGPI-N, PGPI-G and PGPI because the coverage of the other algorithms

remains always the same (100% for NB, RNB and CNB, and 94.4% for LP). PGPI has a much

higher coverage than PGPI-N and PGPI-G, even when it accesses much less facts. For example,

PGPI has a coverage of 87% when it accesses 132 facts, while PGPI-N and PGPI-G have a similar

17

Page 28: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

coverage when using respectively 267 and 654 facts. The coverage of PGPI exceeds 95% when

using about 400 facts.

35%

40%

45%

50%

55%

60%

65%

70%

75%

80%

0 100 200 300 400 500 600 700

Acc

ura

cy

Number of accessed facts

0%

20%

40%

60%

80%

100%

0 100 200 300 400 500 600 700C

ov

era

ge

Number of accessed facts

PGPI PGPI-G PGPI-N NB RNB CNB

Figure I-2 Accuracy and coverage w.r.t. number of accessed facts

Best results for each attributes. We also analyzed each attributes individually. The best results

in terms of accuracy and coverage obtained for each attribute and each algorithm are shown in Table

II-1. The last column of the table indicates the number of accessed facts to obtain these results.

In terms of accuracy, the best accuracy was achieved by PGPI algorithms for all attributes. For

status,gender, major, minor, residence, year and school, PGPI algorithms respectively obtained a

maximum accuracy of 90.9%, 94.6%, 75.1%, 60.5%, 70.5% and 12.9%. The reason for the low

accuracy for school is that there are 1,420 possible values and that users are rarely linked to other

users with the same value.

The best results in terms of PAC are written in bold. PGPI provides the best PAC for all but one

attribute. The second best algorithm is PGPI-N, which provides the second or third best PAC for

five of seven attributes. Moreover, it is interesting to see that PGPI-N achieves this result using less

than half the amount of facts used by PGPI. The third best algorithm is PGPI-G, which provides

the second or third best PAC for four attributes.

18

Page 29: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Influence of maxFacts and ratioFacts parameters. In the previous experiments, we have

analyzed the relationship between the number of accessed facts and the PAC, accuracy and cover-

age. However, we did not discuss how to choose the maxFacts parameter and how it influences

the number of accessed facts for each algorithm.

For Naïve Bayes algorithms and LP, the maxFacts parameter determines the number of nodes

to be used for training, and thus directly determines the number of accessed facts.

For PGPI algorithms, we show the influence of maxFacts on the number of accessed facts in

Fig. I-3. It can be observed that PGPI algorithms generally explore less nodes than maxFacts. For

PGPI-N, the reason is that some social network users have less than maxFacts friends whithin

maxDistance and thus, PGPI-N cannot use more than maxFacts facts to perform a prediction.

For PGPI-G, the reason is that some users are members of just a few groups. For PGPI, the number

of accessed facts is greater than for PGPI-N but less than for PGPI-G. The reason is that PGPI com-

bines both algorithms and uses the ratioFacts parameter, as previously explained, to decide how

much facts can be used by PGPI-N and PGPI-G. The value that we used in previous experiments

for this parameter is 0.5, which means to use 50% of the facts for PGPI-G, and thus to use the other

half for PGPI-N.

075

150225300375450525600675750825900

0 150 300 450 600 750 900 1050 1200 1350 1500

Nu

mb

er

of

acce

sse

d in

form

atio

ns

maxFacts

PGPI PGPI-G

PGPI-N maxFacts

Figure I-3 Number of accessed facts w.r.t. maxFacts

A question that can be raised is how to set the ratioFacts parameter for the PGPI algorithms.

19

Page 30: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

In Fig. I-4, we show the influence of the ratioFacts parameter on the product of accuracy and

coverage. We have measured the PAC w.r.t the number of accessed facts when ratiofacts is re-

spectively set to 0.3, 0.5 and 0.7. Note that setting ratioFacts to 0 and 1 would be respectively

equivalent to using PGPI-N and PGPI-G. In Fig. I-4, it can be observed that setting ratioFacts to

0.5 allows to obtain the highest PAC. Furthermore, in general a value of 0.5 also gives a high PAC

w.r.t the number of accessed facts. A value of 0.3 sometimes gives a higher PAC than 0.5. However,

the highest PAC value for 0.3 on overall is lower than that for 0.5. A value of 0.7 almost always

gives a lower PAC.

10%

20%

30%

40%

50%

60%

70%

0 100 200 300 400 500 600 700

Acc

ura

cy x

co

ver

age

Number of accessed informations

PGPI ratioFacts = 0.3

PGPI ratioFacts = 0.5

PGPI ratioFacts = 0.7

Figure I-4 Product of accuracy and coverage w.r.t. ratioFacts

Best results when using the full social graph. We also performed an experiment using the

full social graph (maxFacts = ∞). The best algorithm in this case is PGPI with an accuracy and

coverage of 64.7% and 100%, while the second best algorithm (PGPI-G) achieves 62% and 98.4%.

Influence of the maxDistance parameter. Recall that the PGPI-N and PGPI algorithms uti-

lize a parameter called maxDistance. This parameter indicates that two nodes are in the same

neighborhood if there are no more than maxDistance edges in the shortest path between them. In

the previous experiments, we have empirically set this parameter to obtain the best results. We now

discuss the influence of this parameter on the results of PGPI-N and PGPI.

We have investigated the influence of maxDistance on the product of accuracy and coverage,

20

Page 31: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

and also on the number of accessed facts. Results are shown in Fig. I-5 for PGPI-N. It can be

observed that as maxDistance is increased, the PAC also increases until it reaches its peak at

around maxDistance = 6. After that, it remains unchanged. The main reason why it remains

unchanged is that the influence of a user on another user is divided by their distance in the PGPI-

N algorithm. Thus, farther nodes have a very small influence on each other. Another interesting

observation is that as maxDistance increases, the number of accessed facts also increases. Thus,

if our goal is to use less facts, thenmaxDistance should be set to a small value, whereas if our goal

is to achieve the maximum PAC, maxDistance should be set to a large value. Lastly, we can also

observe that the number of accessed facts does not increase after maxDistance = 7. The reason is

that the Facebook dataset is quite sparse (i.e. users have few friendship links).

30%

35%

40%

45%

50%

55%

60%

1 2 3 4 5 6 7 8 9 10

Pro

du

ct o

f ac

cura

cy a

nd

co

vera

ge

maxDistance

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 10

Nu

mb

er

of

acce

sse

d f

acts

maxDistance

Figure I-5 Product of accuracy and coverage (left) and number of accessed facts (right) w.r.t.

maxDistance for PGPI-N

Discussion. Overall, PGPI algorithms provide the best accuracy. A key reason is that they are

lazy algorithms. They use local facts close to the node to be predicted rather than relying on a gen-

eral training phase performed beforehand. A second important reason is that more rich information

is used by PGPI-G and PGPI (groups, publications). A third reason is that PGPI-N calculates the

similarity between users while label propagation does not.

It is interesting to note that although PGPI-G relies on synthetic data in our experiment, PGPI-N

does not and performs very well.

21

Page 32: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

6 Conclusion

We have proposed a lazy algorithm named PGPI that can accurately infer user profiles under

the constraint of a partial social graph. It is to our knowledge, the first algorithm that let the user

control the amount of information accessed from the social graph and the accuracy of predictions.

Moreover, the algorithm is designed to use not only node links and profiles as basis to perform

predictions but also group memberships, likes and views, if the information is available. An exten-

sive experimental evaluation against four state-of-the-art algorithms shows that PGPI can provide

a considerably higher accuracy while accessing a much smaller number of nodes from the social

graph to perform predictions. Moreover, an interesting result is that profile attributes such as status

(student/professor) and gender can be predicted with more than 90% accuracy using PGPI.

For future work, we plan to further improve the PGPI algorithms by proposing new optimiza-

tions and to perform experiments on additional datasets.

22

Page 33: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Algorithm 3: The PGPI algorithminput : ni: a node, k: the attribute to be predicted, maxFacts: a user-defined threshold, E :

an extended social graph, ratioFacts: the ratio of facts to be used by PGPI-G

output: the predicted attribute value v

1 M = {(v, 0)|v ∈ Vk};

2 foreach group g|(ni, g) ∈ NG s.t. |accessedFacts| < maxFactsxratioFacts do

3 foreach person nj 6= ni ∈ g s.t. |accessedFacts| < maxFacts do

4 Fgi,j ← Wi,j × commonLikes(ni, nj)×

commThepseudocodeofPGPIisshowninFig.3.onV iews(ni, nj)×

(commonGroups(ni, nj)/|{(ni, x)|(ni, x) ∈ NG}|)×

5 commonPopularAttributes(ni, g);

6 Add Fgi,j to the entry of value v in M such that (nj, v) ∈ Ak;

7 end

8 end

9 Initialize a queue Q;

10 Q.push(ni);

11 seen = {ni};

12 while Q is not empty and |accessedFacts| < maxFacts do

13 nj = Q.pop();

14 Fi,j ← Wi,j/dist(ni, nj);

15 Add Fi,j to the entry of value v in M such that (nj, v) ∈ Ak;

16 if dist(ni, nj) ≤ maxDistance then

17 foreach node nh 6= ni such that (nh, nj) ∈ L and nh 6∈ seen do

18 Q.push(nh);

19 seen← seen ∪ {nh};

20 end

21 end

22 end

23 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;

23

Page 34: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

algorithm status gender major minor residence year school |facts|

PGPI acc. 90.8% 87.7% 31.5% 75.1% 59.59% 66.86% 10.3% 684

PGPI cov. 100% 98.9% 99% 99% 99% 99% 99% 684

PGPI-N acc. 89.4% 90.9% 24.5% 73.9% 60.5% 66.77% 12.9% 272

PGPI-N cov. 94.4% 94.0% 94.4% 94.4% 94.4% 94.4% 94.4% 272

PGPI-G acc. 90.9% 94.6% 35.2% 70.8% 53.0% 70.5% 8.8% 689

PGPI-G cov. 96.6% 72% 72% 72% 72% 72% 72% 689

NB acc. 88.0% 51.1% 16.6% 74.4% 55.6% 26.0% 10.6% 189

NB cov. 100% 100% 100% 100% 100% 100% 100% 189

RNB acc. 80.2% 57.7% 15.0% 74.2% 55.0% 26.2% 9.8% 580

RNB cov. 100% 100% 100% 100% 100% 100% 100% 580

CNB acc. 88.0% 47.3% 9.0% 74.0% 54.8.% 26.0% 10.6% 189

CNB cov. 100% 100% 100% 100% 100% 100% 100% 189

LP acc. 83.0% 50.4% 16.7% 56.3% 49.1% 35.3% 10.1% 10K

LP cov. 94.4% 94.4% 94.4% 94.4% 94.4% 94.4% 94.4% 10K

Table I-1 Best accuracy/coverage results for each attribute

24

Page 35: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

CHAPITRE II

PGPI+: ACCURATE SOCIAL NETWORK USER PROFILING UNDER THE

CONSTRAINT OF A PARTIAL SOCIAL GRAPH

PGPI+: ALGORITHME AMÉLIORÉ D’INFÉRENCE DE PROFILS

UTILISATEURS SUR LES RÉSEAUX SOCIAUX À PARTIR D’UN GRAPHE

SOCIAL PARTIEL

Page 36: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Abstract

Algorithms for social network user profiling suffer from one or more of the following limita-

tions: (1) assuming that the full social graph is available for training, (2) not exploiting the rich

information that is available in social networks such as group memberships and likes, (3) treating

numeric attributes as nominal attributes, and (4) not assessing the certainty of its predictions. In

this paper, we address these challenges by proposing an improved algorithm named PGPI+ (Partial

Graph Profile Inference+). PGPI+ accurately infers user profiles under the constraint of a partial

social graph using rich information about users (e.g. group memberships, views and likes), handles

nominal and numeric attributes, and assesses the certainty of predictions. An experimental evalua-

tion with more than 30,000 user profiles from the Facebook and Pokec social networks shows that

PGPI+ predicts user profiles with considerably more accuracy and by accessing a smaller part of

the social graph than five state-of-the-art algorithms.

Keywords: social networks, inference, user profiles, partial graph

1 Introduction

Online social networks have become extremely popular. Various types of social networks are

used such as friendship networks (e.g. Facebook), professional networks (e.g. ResearchGate) and

interest-based networks (e.g. Flickr). An important problem for ad targeting [25] and building rec-

ommender systems [5, 17, 19, 24, 27] that relies on social networks is that users often disclose few

information publicly [1, 12]. To address this issue, an important sub-field of social network mining

is now interested in developing algorithms to infer detailed user profiles using publicly disclosed in-

formation [2,21]. Various approaches have been used to solve this problem such as relational Naïve

Bayes classifiers [14], label propagation [11,13], majority voting [6], linear regression [12], Latent-

Dirichlet Allocation [3] and community detection [16]. It was shown that these approaches can

accurately predict hidden attributes of user profiles in many cases. However, all these approaches

suffer from at least two of the following four limitations.

1. Assuming a full social graph. Many approaches assume that the full social graph is avail-

26

Page 37: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

able for training (e.g. [11]). However, in real-life, it is generally unavailable or may be very

costly to obtain or update [3, 8]. A few approaches do not assume a full social graph such

as majority-voting [6]. However, they do not let the user control the trade-off between the

number of nodes accessed and prediction accuracy, which may lead to low accuracy.

2. Not using rich information. Several algorithms do not consider the rich information that is

available on social networks. For example, several algorithms consider links between users

and user attributes but do not consider other information such as group memberships, "likes"

and "views" that are available on some social networks [1, 6, 11, 13, 14, 16].

3. Not handling numeric attributes. Many approaches treat numeric attributes (e.g. age) as

nominal attributes [1, 6], which may decrease inference accuracy. Others are designed to

handle numeric attributes but requires the full social graph, which is often unpractical [7,

11, 12, 16].

4. Not assessing certainty. Few approaches assess the certainty of their predictions. But this

information is essential to determine if a prediction is reliable and actions should be taken

based on the prediction. For example, if there is a low certainty that a user profile attribute

is correctly inferred, it may be better to not use this attribute for ad targeting, rather than

showing an ad that is targeted to a different audience [4].

To address limitations 1 and 2, the PGPI algorithm (Partial Graph Profile Inference) was re-

cently proposed [8]. PGPI lets the user select how many nodes of the social graph can be accessed

to infer a user profile, and can use not only information about friendship links and profiles but also

about group memberships, likes and views, when available. In this paper, we present an extended

version of PGPI named PGPI+ to also address limitations 3 and 4. Contributions are threefold.

First, we design a new procedure for predicting values of numeric attributes. Second, we introduce

a mechanism to assess the certainty of predictions for both nominal and numeric attributes. Third,

we introduce four optimizations that considerably improve the overall prediction accuracy of PGPI.

We report results from an extensive empirical evaluation against PGPI and four other state-of-

the-art algorithms, for more than 30,000 user profiles from the Facebook and Pokec social networks.

Results show that the proposed PGPI+ algorithm can provide a considerably higher accuracy for

27

Page 38: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

both numeric and nominal attributes while accessing a much smaller number of nodes from the

social graph. Moreover, results show that the calculated certainty well assesses the reliability of

predictions. Moreover, an interesting result is that profile attributes such as status (student/profes-

sor) and gender can be predicted with more than 95% accuracy using PGPI+.

The rest of this paper is organized as follows. Section 2, 3, 4, 5 and 6 respectively presents the

related work, the problem definition, the proposed algorithm, the experimental evaluation and the

conclusion.

2 Related work

We review recent work on social network user profile inference. Davis Jr et al. [6] inferred

locations of Twitter users by performing a majority vote over the locations of directly connected

users. A major limitation of this approach is that a single attribute is considered. Jurgens [11]

predicted locations of Twitter/Foursquare users using a label propagation approach. However, it

is an iterative algorithm that requires the full social graph since it propagates known labels to

unlabeled nodes through links between nodes. Li et al. [13] proposed an iterative algorithm to

deduce LinkedIn user profiles based on relation types. This algorithm also requires a large training

set to discover relation types.

Mislove [16] applied community detection on more than 60k user profiles with friendship links,

and then inferred user profiles based on similarity of members from the same community. Lin-

damood et al. [14] applied a Naïve Bayes classifier on 167k Facebook profiles with friendship

links, and concluded that if links or attributes are erased, accuracy of the approach can greatly

decrease. This study highlights the challenges of performing accurate predictions using few data.

Recently, Blenn et al. [1] utilized bird flocking, association rule mining and statistical analysis to

infer user profiles in a dataset of 3 millions Hyves.nl users. However, all these work assume that a

large training set is available for training and they only use profile information and social links to

perform predictions.

Chaabane et al. [3] inferred Facebook user profiles using Latent Dirichlet Allocation (LDA) and

majority voting. The approach extracts a probabilistic model from music interests and additional

28

Page 39: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

information provided from Wikipedia. But it requires a large training set, which was difficult and

time-consuming to obtain [3]. Kosinski et al. [12] also utilized information about user preferences

to infer Facebook user profiles. Kosinski et al. applied Singular Value Decomposition to a huge

matrix of users/likes and then used regression to perform prediction. A limitation of this work is

that it does not utilize information about links between users and requires a very large training

dataset.

He et al. [9] proposed an approach consisting of building a Bayesian network based on the full

social graph to then predict user attribute values. The approach considers similarity between user

profiles and links between users to perform predictions, and was applied to data collected from

LiveJournal. Recently, Dong et al. [7] used graphical-models to predict the age and gender of users.

Their study was performed with 1 billion phone and SMS data and 7M user profiles. Chaudhari [4]

also used graphical models to infer user profiles. The approach has shown high accuracy on datasets

of more than 1M users from the Twitter and Pokec social networks. A limitation of these approaches

however, it that they assume a large training set for training. Furthermore, they only consider user

attributes and links but not additional information such as likes, views and group membership.

Some of the above approaches infer numeric attributes, either by treating them as nominal

attributes [1, 3, 6], or by using specific inference procedures. However, these latter require a large

training set [7, 11, 12, 16].

Besides, current approaches generally do not assess the certainty of predictions. But this in-

formation is essential to determine if a prediction is reliable, and actions should be taken based

on this prediction. To our knowledge, only Chaudhari [4] provides this information. However, this

approach is designed to use the full social graph.

3 Problem definition

The problem of user profiling is commonly defined as follows [1, 4, 11, 13, 14, 16].

Definition 1. Social graph. A social graph G is a quadruplet G = {N,L, V, A}. N is the set

of nodes in G. L ⊆ N × N is a binary relation representing the links (edges) between nodes. Let

29

Page 40: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

be m attributes to describe users of the social network such that V = {V1, V2, ...Vm} contains for

each attribute i, the set of possible attribute values Vi. Finally, A = {A1, A2, ...Am} contains for

each attribute i a relation assigning an attribute value to nodes, that is Ai ⊆ N × Vi.

Example. Let be a social graph with three nodes N = {Tom, Amy, Lea} and friendship links

L = {(Tom, Lea), (Lea, Tom), (Lea, Amy), (Amy, Lea)}. Consider two attributes gender and

status, respectively called attribute 1 and 2 to describe users. The set of possible attribute values for

gender and status are respectively V1 = {male, female} and V2 = {professor, student}. The re-

lations assigning attributes values to nodes areA1 = {(Tom,male), (Amy, female), (Lea, female)}

and A2 = {(Tom, student), (Amy, student), (Lea, professor)}.

Definition 2. Problem of inferring user profiles in a social graph. The problem of inferring

the user profile of a node n ∈ N in a social graph G is to guess the attribute values of n using the

other information provided in G.

The problem definition can be extended to consider additional information from social networks

such as Facebook (views, likes and group memberships).

Definition 3. Extended social graph. An extended social graph E is a tuple

E = {N,L, V,A,G,NG, P, PG,LP, V P} where N,L, V,A are defined as previously. G is a set

of groups that a user can be a member of. The relation NG ⊆ N × G indicates the membership

of users to groups. P is a set of publications such as pictures, texts, videos that are posted in

groups. PG is a relation PG ⊆ P × G, which associates a publication to the group(s) where it

was posted. LP is a relation LP ⊆ N ×P indicating publication(s) liked by each user (e.g. "likes"

on Facebook). V P is a relation V P ⊆ N × P indicating publication(s) viewed by each user (e.g.

"views" on Facebook), such that LP ⊆ V P .

Example. Let be two groups G = {book_club, music_lovers} such that NG = {(Tom,

book_club), (Lea, book_club), (Amy, music_lovers)}. Let be two publications P = {picture1,

picture2} published in the groups PG = {(picture1, book_club), (picture2, music_lovers)}.

The publications viewed by users are V P = {(Tom, picture1), (Lea, picture1), (Amy, picture2)}

while the publications liked by users are LP = {(Tom, picture1), (Amy, picture2)}.

30

Page 41: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Definition 4. Problem of inferring user profiles in an extended social graph. The problem

of inferring the user profile of a node n ∈ N in an extended social graph E is to guess the attribute

values of n using the information in E .

But the above definitions assume that the full social graph may be used to perform predictions.

The problem of inferring user profiles using a limited amount of information is defined as follows

[8].

Definition 5. Problem of inferring user profiles using a partial (extended) social graph. Let

maxFacts ∈ N+ be a parameter set by the user. The problem of inferring the user profile of a

node n ∈ N using a partial (extended) social graph E is to accurately predict the attribute values

of n by accessing no more than maxFacts facts from the social graph. A fact is a node, group or

publication from N , G or P (excluding n).

The above definition can be extended for numeric attributes. For those attributes, instead of

aiming at predicting an exact attribute value, the goal is to predict a value that is as close as possible

to the real value. Moreover, in this paper, we also extend the problem to consider the certainty of

predictions. In this setting, a prediction algorithm must assign a certainty value in the [0,1] interval

to each predicted value, such that a high certainty value indicates that a prediction is likely to be

correct.

4 The proposed PGPI+ algorithm

We next present the proposed PGPI+ algorithm. Subsection 4.1 briefly introduces PGPI. Then,

subsections 4.2, 4.3 and 4.4 respectively present optimizations to improve its prediction accuracy

and coverage, and how it is extended to handle numerical attributes and assess the certainty of

predictions.

4.1 The PGPI algorithm

The PGPI algorithm [8] is a lazy algorithm designed to perform predictions under the constraint

of a partial social graph, where at most maxFacts facts from the social graph can be accessed to

31

Page 42: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

make a prediction. PGPI (Algorithm 4) takes as parameter a node ni, an attribute k to be predicted,

the maxFacts parameter, a parameter named maxDistance, and an (extended) social graph E .

PGPI outputs a predicted value v for attribute k of node ni. To predict the value of an attribute k,

PGPI relies on a map M . This map stores pairs of the form (v, f), where v is a possible value v for

attribute k, and f is positive real number called the weight of v. PGPI automatically calculates the

weights by applying two procedures named PGPI-G and PGPI-N. These latter respectively update

weights by considering the (1) views, likes and group memberships of ni, and (2) its friendship

links. After applying these procedures, PGPI returns the value v associated to the highest weight in

M as the prediction. In PGPI, half of the maxFacts facts that can be used to make a prediction are

used by PGPI-G and the other half by PGPI-N. If globally the maxFacts limit is reached, PGPI

does not perform a prediction. PGPI-N or PGPI-G can be deactivated. If PGPI-N is deactivated,

only views, likes and group memberships are considered to make a prediction. If PGPI-G is deacti-

vated, only friendship links are considered. In the following, we respectively refer to these versions

of PGPI as PGPI-N and PGPI-G (and as PGPI-N+/PGPI-G+ for PGPI+).

PGPI-N works as follows. To predict an attribute value of a node ni, it explores the neighbor-

hood of ni restricted by the parameter maxDistance using a breadth-first search. It first initializes

a queue Q and pushes ni in the queue. Then, while Q is not empty and the number of accessed

facts is less than maxFacts, the first node nj in Q is popped. Then, Fi,j = Wi,j/dist(ni, nj) is

calculated. Wi,j = Ci,j/Ci, where Ci,j is the number of attribute values common to ni and nj , and

Ci is the number of known attribute values for node ni. dist(x, y) is the number of edges in the

shortest path between ni and nj . Then, Fi,j is added to the weight of the attribute value of nj for

attribute k, in map M . Then, if dist(x, y) ≤ maxDistance, each unvisited node nh linked to nj is

pushed in Q and marked as visited. PGPI-G is similar to PGPI-N. It is also a lazy algorithm. But

it uses a majority voting approach to update weights based on group and publication information

(views and likes). Due to space limitation, we do not describe it. The reader may refer to [8] for

more details.

32

Page 43: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Algorithm 4: The PGPI algorithminput : ni: a node, k: the attribute to be predicted, maxFacts: a user-defined threshold, E :

an extended social graph

output: the predicted attribute value v

1 M = {(v, 0)|v ∈ Vk};

2 // Apply PGPI-G

3 // ...

4 // Apply PGPI-N

5 Initialize a queue Q and add ni to Q;

6 while Q is not empty and |accessedFacts| < maxFacts do

7 nj = Q.pop();

8 Fi,j ← Wi,j/dist(ni, nj);

9 Update (v, f) as (v, f + Fi,j) in M , where (nj, v) ∈ Ak;

10 if dist(ni, nj) ≤ maxDistance then for each node nh 6= ni such that (nh, nj) ∈ L and

nh is unvisited, push nh in Q and mark nh as visited ;

11 end

12 return a value v such that (v, z) ∈M∧ 6 ∃(v′, z′) ∈M |z′ > z;

4.2 Optimizations to improve accuracy and coverage

In PGPI+, we redefine the formula Fi,j used by PGPI-N by adding three optimizations. The new

formula is Fi,j = Wi,j × (Ti,j + 1)/newdist(ni, nj)× R. The first optimization is to add the term

Ti,j + 1, where Ti,j is the number of common friends between ni and nj , divided by the number

of friends of ni. This term is added to consider that two persons having common friends (forming

a triad) are more likely to have similar attribute values. The constant 1 is used so that if ni and nj

have no common friends, Fi,j is not zero.

The second optimization is based on the observation that the term dist(ni, nj) makes Fi,j de-

crease too rapidly. Thus, nodes that are not immediate neighbors but were still close in the so-

cial graph had a negligible influence on their respective profile inference. To address this issue,

dist(ni, nj) is replaced by newdist(ni, nj) = 3 − (0.2 × dist(ni, nj)), where it is assumed that

33

Page 44: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

maxDistance < 15. It was empirically found that this formula provides higher accuracy.

The third optimization is based on the observation that PGPI-G had too much influence on

predictions compared to PGPI-N. To address this issue, we multiply the weights calculated using

the formula Fi,j by a new constantR. This thus increases the influence of PGPI-N+ on the choice of

predicted values. In our experiments, we have found that settingR to 10 provides the best accuracy.

Furthermore, a fourth optimization is integrated in the main procedure of PGPI+. It is based

on the observation that PGPI does not make a prediction for up to 50% of users when maxFacts

is set to a small value [8]. The reason is that PGPI does not make a prediction when it reaches

the maxFacts limit. However, it may have collected enough information to make an accurate

prediction. In PGPI+, a prediction is always performed. This optimization was shown to greatly

increase the number of predictions.

4.3 Extension to handle numerical attributes

The PGPI algorithm is designed to handle nominal attributes. In PGPI+, we performed the

following modifications to handle numeric attributes. First, we modified how the predicted value

is chosen. Recall that the value predicted by PGPI for nominal attributes is the one having the

highest weight inM (line 12). However, for numeric attribute, this approach provides poor accuracy

because few users have exactly the same attribute value. For example, for the attribute "weight",

few users may have the same weight, although they may have similar weights. To address this issue,

PGPI+ calculates the predicted values for numeric attributes as the weighed sum of all values in

M .

Second, we adapted the weighted sum so that it ignores outliers because if unusually large

values are in M , the weighted sum provides inaccurate predictions. For example, if a young user

has friendship links to a few 20 years old friends but also a link to his 90 years old grandmother,

the prediction may be inaccurate. Our solution to this problem is to ignore values in M that have

a weight more than one standard deviation away from the mean. In our experiment, it greatly

improves prediction accuracy for numeric attributes.

34

Page 45: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Third, we change how Wi,j is calculated. Recall that in PGPI, Wi,j = Ci,j/Ci, where Ci,j is the

number of attribute values common to ni and nj , and Ci is the number of known attribute values

for node ni. This definition does not work well for numeric attributes because numeric attributes

rarely have the same value. To consider that numeric values may not be equal but still be close, Ci,j

is redefined as follows in PGPI+. The value Ci,j is the number of values common to ni and nj for

nominal attributes, plus a value CNi,j,k for each numeric attribute k. The value CNi,j,k is calculated

as (vi − vj)/αk if (vi − vj) < αk, and is otherwise 0, where αk is a user-defined constant. Because

CNi,j,k is a value in [0,1], numeric attributes may not have more influence than nominal attributes

on Wi,j .

4.4 Extension to evaluate the certainty of predictions

PGPI+ also extends PGPI with the capability of calculating a certainty value CV (v) for each

predicted value v.

As previously mentioned, few inference algorithm assess the certainty of their predictions.

However, this feature is desirable to know how good a prediction is and whether or not actions

should be taken based on the prediction. For example, in the context of ad targeting, if there is a

low certainty that a user profile attribute is correctly inferred, it may be better to not use this at-

tribute for ad targeting, rather than showing an ad that is targeted to a different audience [4]. To

address this issue, we have added the capability of calculating a certainty value CV (v) for each

value v predicted by PGPI+. A certainty value is a value in the [0, 1] interval. A high certainty value

indicates that a prediction is likely to be correct. Certainty values are calculated differently for nu-

meric and nominal attributes because different inference procedures are used to predict attribute

values.

Certainty of numeric values. Recall that values of numeric attributes are predicted using a

weighted sum. Thus, calculating the certainty value of a predicted value v for a numeric attribute k

requires to find a way to assess how "accurate" the weighted sum for calculating v is. Intuitively,

we can expect the weighted sum to be accurate if (1) the amount of information taken into account

by the weighted sum is large, and (2) if values considered in the weighted sum are close to each

35

Page 46: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

other.

These ideas are captured in our approach by using the relative standard error. Let EM =

{v1, v2, ...vp} be the set of values stored in the map M that were used to calculate the weighted

sum. The amount of information used by the weighted sum is measured as the number of updates

made by PGPI-N+/PGPI-G+ to the map M , denoted as updates. The relative standard error is de-

fined as RSE(v) = stdev(EM)/(√updates× avg(EM)). The RSE is a value in the [0,1] interval

that assesses how close the average of a sample might be to the average of the population. Be-

cause we want a certainty value rather than an error value, we calculate the certainty value of v as:

CV (v) = 1−RSE(v) = 1− [stdev(EM)/(√updates× avg(EM))] .

The proposed approach provides an assessment of the certainty of a prediction. However, a

drawback of this RSE based approach is that it is sensible to outliers. To address this issue, we

ignore values that are more than one standard deviation away from the mean when calculating

CV (v).

Certainty of nominal values. The procedure for calculating the certainty of a predicted value

v for a nominal attribute is different than for numeric attributes since the inference procedure is not

the same. Our idea is to evaluate how likely the weight z of value v in M is to be as large at it is,

compared to other weights in M .

To estimate this, we use a simulation-based approach where "larger" is defined in terms of a

number of standard deviations away from the mean. The procedure for calculating the certainty of

a value for a numeric attribute is shown in Algorithm 5. It takes as input a value v for an attribute

k and the map of values/weights M . Let ZM = {z1, z2, ...zp} denotes the weights of values in

the map M , and t(M, v) be the weight of v in map M . We initialize a variable count to 0 and

perform 1,000 simulations. During the j-th simulation, the algorithm creates a new map B so that

it contains a key-value pair (v, 0) for each possible value v for attribute k in map M . These weights

are initialized to zero. Then, the algorithm performs updates random updates to B. A random

update is performed as follows. First, a pair (v′, z′) is randomly chosen from B. Then, a value x is

randomly selected from the [0, 1] interval. Finally, (v′, z′) is replaced by (v′, z′ + x) in B. At the

end of the j-th simulation, we increase the count variable by 1 if (f(B, v)−avg(ZB)) /stdev(ZB)

36

Page 47: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

≥ (f(M, v)− avg(ZM)) /stdev(ZM). After performing the 1,000 simulations, the certainty value

is calculated as CV (v) = count/1000, which gives a value in the [0,1] interval. This value CV (v)

can be interpreted as follows. It represents what is the chance that the weight of v in map M is

as far as it is in terms of standard deviation from the average weight of values in map M for the

attribute to be predicted k.

Thus, by the above mechanisms a certainty values can be provided for predicted values of both

numeric and nominal attributes.

Algorithm 5: Certainty assessment of a nominal attribute valueinput : M : the map of values/weights, v: an attribute value, k: the attribute, updates: the

number of updates to M

output: the certainty value CV (v)

1 count = 0;

2 for 1 to 1000 do

3 B = {(v, 0)|(v, z) ∈M};

4 for 1 to updates do

5 Randomly choose a pair (v′, z′) from B;

6 x = random([0, 1]);

7 Replace (v′, z′) by (v′, z′ + x) in B;

8 end

9 if (f(B, v)− avg(ZB)) /stdev(ZB) ≥ (f(M, v)− avg(ZM)) /stdev(ZM) then

10 count = count+ 1;

11 end

12 end

13 CV (v) = count/1000;

14 return CV(v);

37

Page 48: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

5 Experimental evaluation

We compared the accuracy of the proposed PGPI+, PGPI-N+ and PGPI-G+ algorithms with the

original PGPI, PGPI-N and PGPI-G algorithms and four additional state-of-the-art algorithms for

predicting attribute values of nodes in a social network. The three first are Naïve Bayes classifiers

[10]. Naïve Bayes (NB) infer user profiles strictly based on correlation between attribute values.

Relational Naïve Bayes (RNB) consider the probability of having friends with specific attribute

values. Collective Naïve Bayes (CNB) combines NB and RNB. To be able to compare NB, RNB

and CNB with the proposed algorithms, we have adapted them to work with a partial graph. This is

done by training them with maxFacts users chosen randomly instead of the full social graph. The

last algorithm is label propagation (LP) [11]. Because LP requires the full social graph, its results

are only used as a baseline. Each algorithm was tuned with optimal parameter values.

Datasets. Two datasets are used. The first one is 11,247 user profiles collected from Facebook

in 2005 [23]. Each user is described according to seven attributes: a student/faculty status flag,

gender, major, second major/minor (if applicable), dorm/house, year, and high school, where year

is a numerical attribute. The second dataset is 20,000 user profiles from the Pokec social network

obtained at https://snap.stanford.edu/data/. It contains 17 attributes, including three

numeric attributes: age, weight and height. Because both datasets do not contain information about

groups, and this information is needed by PGPI-G and PGPI, synthetic data about groups was

generated using the generator proposed in [8], using the same parameters. This latter generator is

designed to generate group having characteristics similar to real-life groups.

Accuracy for nominal attributes w.r.t number of facts. We first ran all algorithms while vary-

ing the maxFacts parameter to assess the influence of the number of accessed facts on accuracy

for nominal attributes. The accuracy for nominal attributes is defined as the number of correctly

predicted values, divided by the number of prediction opportunities. Fig. II-1 shows the overall re-

sults for the Facebook and Pokec datasets. Note that PGPI algorithms are not shown in these tables

due to lack of space. It can be observed that PGPI+/PGPI-N+/PGPI-G+ provides the best results.

For example, on Facebook, PGPI+ and PGPI-G+ provide the best results when 66 to 700 facts are

accessed, and for less than 66 facts, PGPI-N+ provides the best results followed by PGPI+. No

38

Page 49: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

results are provided for PGPI-N+ for more than 306/6 facts on Facebook/Pokec because PGPI-N+

relies solely on links between nodes to perform predictions and the datasets do not contains enough

links. It is also interesting to note that PGPI-N+ only uses real data (contrarily to PGPI+/PGPI-G+)

and still performs better than all other algorithms. The algorithm providing the worst results is LP

(not shown in the figure). LP provides an accuracy of 43.2%/47.31% on Facebook/Pokec. This is

not good considering that LP uses the full social graph of more than 10,000 nodes. For the family of

Naïve Bayes algorithms, NB has the best overall accuracy. It can be further observed that the accu-

racy of PGPI+ algorithm is up to 34% higher than PGPI, which shows that proposed optimizations

have a major influence on accuracy.

50%

55%

60%

65%

70%

75%

0 100 200 300

Acc

ura

cy

Number of accessed facts

30%

40%

50%

60%

70%

80%

90%

100%

0 100 200 300 400 500 600 700

Acc

ura

cy

Number of accessed facts

PGPI+ PGPI-G+ PGPI-N+ PGPI PGPI-G PGPI-N NB RNB CNB

Facebook

Pokec

Figure II-1 Accuracy w.r.t. number of accessed facts for nominal attributes

Best results for each nominal attribute. To better understand the previous results, a second

experiment was performed were we analyzed the accuracy obtained for each nominal attribute

separately. The goal of this analysis is to see how well each attribute can be predicted and also if

some approaches perform better than others for some specific attributes.

The best results in terms of accuracy for each attribute and algorithm for the Facebook and

39

Page 50: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Pokec are respectively shown in Table II-1 and II-2. The last row of each table indicates the number

of accessed facts to obtain these results. Note that in these tables, the result for PGPI algorithms are

not shown because they are always worse than those of the optimized PGPI+ algorithms.

On overall, it can be observed that the best accuracy is in general achieved by PGPI+ algorithms

for all attributes. In several cases, PGPI+ algorithms provide an accuracy that is more than 10%

higher than the compared state-of-the art algorithm, and the gap can be greater than 30% in some

cases.

A second important observation is that PGPI+ algorithms allow to predict some attributes with

a very high accuracy. For example, the gender of a Facebook user can be predicted with up to 95%

accuracy by PGPI+. Another example is that PGPI-N+ can guess with up to 94% accuracy if a

Pokec user speaks a language such as spanish or italian. The high accuracy that is obtained may be

very useful for real applications of user profiling such as ad targeting.

A third observation is that some attributes are more difficult to predict. For example, most algo-

rithms have a relatively low accuracy for the school attribute on the Facebook dataset. A reason is

that unlike binary attributes such as gender, the school attribute has a very high number of possible

values (1,420 values, each corresponding to a possible high-school attended by a user). Another

reason is that users are rarely linked to other users with the same value for the school attribute.

Since in 2005, users on Facebook were mostly university students, it seems reasonable that users

may often not come from the same high-school as their Facebook friends attending the same uni-

versity. A similar phenomena occurs with the region attribute for the Pokec dataset, which indicates

the region where a given user lives.

Best results for each numeric attribute. Previous subsections presented accuracy results for

nominal attributes. In this subsection, an experiment is performed to compare the accuracy of PGPI

algorithms and PGPI+ algorithms for numeric attributes. The goal is to evaluate the improvement

provided in PGPI+ for handling numeric attributes. Furthermore, we also compared accuracy ob-

tained with the NB, RNB, CNB and LP algorithms. Because these later algorithms and PGPI treat

numeric attributes as nominal attributes, they are used as baseline. Note that we attempted to also

include the algorithm of Kosinski [12] in our comparison because it is designed for inferring nu-

40

Page 51: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

attribute PGPI+ PGPI-N+ PGPI-G+ NB RNB CNB LP

status 92.0% 92.6 % 93.0% 88.0% 80.2% 88.0% 83.0%

gender 96.1% 84.7 % 95.8% 51.1% 57.7% 47.3% 50.4%

major 33.8% 30.4% 32.8% 16.6% 15.0% 9.0% 16.7%

minor 76.0% 76.4% 76.6% 74.4% 74.2% 74.0% 56.3%

residence 64.6% 62.4% 64.4% 55.6 % 55.0% 54.8% 49.1%

school 7.4% 16.8% 7.0% 10.6% 9.8% 10.6% 10.1%

|facts| 482 226 431 189 580 189 10K

Table II-1 Best accuracy results for nominal attributes on Facebook

meric attributes. However, it relies on linear regression, and the training has failed when using a

partial social graph.

In this experiment, the accuracy is measured differently than in the experiments on nominal

attributes since the goal for numeric attributes is to predict attribute values that are as close as

possible to the real values rather than predicting the exact value. Thus, the accuracy for numeric

attributes is measured in terms of average error and standard deviation of predicted values from the

real values.

Table II-3 show the best results obtained by PGPI and PGPI+ algorithms for each numeric

attribute for the Facebook and Pokec datasets, and the number of facts accessed to obtain these

results. It can be first observed that PGPI+ performs the best on overall. For example, the PGPI-N+

algorithm performs better than the PGPI-N algorithm both in terms of average error and standard

deviation for almost all attributes. For some attributes, the improvement provided by PGPI+ over

PGPI is greater than for other attributes. For example, for the height attribute, the average error

obtained by PGPI-N+ is 10.32 instead of 14 for PGPI-N and the standard deviation is 10.32 for

PGPI-N+ instead of 36.52 for PGPI-N, a considerable improvement. The reason why PGPI+ per-

forms better than PGPI in general is that it has specific inference procedures for numeric attributes,

while PGPI treat them as nominal attributes.

41

Page 52: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

attribute PGPI+ PGPI-N+ PGPI-G+ NB RNB CNB LP

Gender 95.60% 61.40% 95.77% 52.80% 53.80% 53.60% 49.20%

English 76.35% 63.79% 76.00% 69.74% 69.74% 69.74% 65.40%

French 87.46% 84.48% 87.42% 86.91% 85.60% 86.87% 67.15%

German 62.39% 54.31% 62.85% 47.83% 48.12% 47.83% 50.00%

Italian 94.87% 94.25% 94.85% 94.65% 95.38% 95.41% 85.75%

Spanish 95.15% 94.54% 95.14% 94.38% 95.08% 94.29% 80.52%

Smoker 65.21% 62.34% 65.42% 63.43% 63.43% 63.12% 60.19%

Drink 71.65% 63.36% 71.47% 70.41% 70.41% 70.41% 49.16%

Marital status 76.57% 70.86% 76.40% 76.11% 76.02% 76.07% 69.92%

Hip-hop 86.51% 82.20% 86.47% 86.01% 85.83% 85.93% 61.82%

Rap 69.33% 63.78% 69.52% 69.08% 69.35% 69.35% 45.78%

Rock 77.93% 73.80% 78.09% 76.33% 74.93% 74.93% 53.69%

Disco 58.40% 52.50% 58.56% 50.07% 53.18% 53.46% 47.28%

Metal 86.19% 83.52% 86.15% 84.75% 84.61% 84.61% 63.79%

Region 18.60% 10.20% 18.71% 6.20% 6.20% 6.20% 10.00%

|facts| 334 6 347 375 378 278 10k

Table II-2 Best accuracy results for nominal attributes on Pokec

Assessment of certainty values. Another experiment is also performed to assessed certainty

values calculated by PGPI+, PGPI-N+ and PGPI-G+. Fig. II-2 and Fig. II-3 respectively show

the best accuracy obtained for numeric attributes year and age for each of those algorithms. Each

line represents the accuracy of predictions having at least a given certainty value. It can be seen,

that a high certainty value generally means a low average error, standard deviation and coverage

(percentage of predictions made), as expected.

For nominal attributes, the accuracy of predictions made by PGPI-N+, PGPI, PGPI-G+ having

a certainty no less than 0 and no less than 0.7 are respectively 59%, 59% and 61% on Pokec, and

91%, 91% and 93% on Facebook, which also shows that the proposed certainty assessment is a

42

Page 53: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Table II-3 Average error and standard deviation for numeric attributes

Facebook Pokec

algorithm year |facts| age weight height |facts|

PGPI+ 0.95 (0.85) 241 2.94 (4.55) 9.83 (10.32) 7.70 (11.75) 189

PGPI-N+ 0.68 (0.72) 44 3.92 (4.56) 14.60 (12.56) 10.32 (12.55) 6

PGPI-G+ 0.99 (0.89) 252 2.89 (4.45) 9.83 (10.37) 7.71 (11.76) 290

PGPI 0.46 (0.93) 148 2.55 (4.80) 11.67 (11.53) 8.75 (12.43) 251

PGPI-N 0.46 (0.87) 103 4.35 (5.11) 17.28 (15.61) 14.0 (36.52) 6

PGPI-G 0.39 (0.93) 223 2.20 (4.78) 10.75 (10.86) 8.35 (12.45) 344

NB 1.30 (0.81) 1,743 5.78 (4.34) 14.13 (12.40) 13.71 (11.17) 8

RNB 1.30 (0.81) 1,746 3.47 (4.71) 13.74 (13.91) 15.65 (13.92) 83

CNB 1.30 (0.81) 1,746 3.33 (4.40) 13.38 (14.34) 15.30 (12.47) 84

LP 1.12 (1.30) 11,247 5.07 (5.38) 18.42 (59.35) 19.29 (15.80) 20,000

good indicator of accuracy.

Best results using the full social graph. We also compared the accuracy of the algorithms

using the full social graph. The best accuracy obtained for each algorithm on the Facebook and

Pokec datasets is shown in Table II-4. It can be observed that the proposed PGPI+ algorithms

provide an accuracy that is considerably higher than the accuracy of the compared algorithms, even

when using the full social graph.

6 Conclusion

We proposed an improved algorithm named PGPI+ for user profiling in online social networks

under the constraint of a partial social graph and using rich information. PGPI+ extends the PGPI

algorithm with new optimizations to improve its prediction accuracy and coverage, to handle nu-

merical attributes, and assess the certainty of predictions. An experimental evaluation with more

than 30,000 user profiles from the Facebook and Pokec social networks shows that PGPI+ predicts

43

Page 54: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

PGPI+

b)

0.3 0.5 0.7

2,5

2,7

2,9

3,1

3,3

3,5

3,7

3,9

0 100 200 300 400

ave

rage

err

or

Number of accessed facts

4

4,2

4,4

4,6

4,8

5

5,2

5,4

0 100 200 300 400

stan

dar

d d

evia

tio

n

Number of accessed facts

80828486889092949698

100

0 100 200 300 400

cove

rage

Number of accessed facts

0,6

0,7

0,8

0,9

1

1,1

1,2

1,3

0 200 400 600 800

ave

rage

err

or

Number of accessed facts

0,61,11,62,12,63,13,64,14,65,15,6

0 200 400 600 800

stan

dar

d d

evia

tio

n

Number of accessed facts

20

30

40

50

60

70

80

90

100

0 200 400 600 800

cove

rage

Number of accessed facts

PGPI-N+

0.70.750.8

0.850.9

0.951

1.05

0 100 200

aver

age

err

or

Number of accessed facts

0.61.11.62.12.63.13.64.14.6

0 100 200

stan

dar

d d

evi

atio

n

Number of accessed facts

2030405060708090

100

0 100 200

cove

rage

Number of accessed facts

3.83.853.9

3.954

4.054.1

4.154.2

3 4 5 6

aver

age

erro

r

Number of accessed facts

4.6

4.65

4.7

4.75

4.8

4.85

4.9

3 4 5 6

stan

dar

d d

evia

tio

n

Number of accessed facts

88

90

92

94

96

98

100

3 4 5 6co

vera

ge

Number of accessed facts

b)

0.3 0.5 0.7

PGPI-G+

b

0.3 0.5 0.7

0.5 0.7 0.9

1

1,1

1,2

1,3

1,4

1,5

1,6

1,7

0 200 400 600 800

aver

age

erro

r

Number of accessed facts

00,51

1,52

2,53

3,54

4,55

0 200 400 600 800

stan

dar

d d

evi

atio

n

Number of accessed facts

0102030405060708090100

0 200 400 600 800co

vera

ge

Number of accessed facts

2,52,72,93,13,33,53,73,94,14,34,5

0 100 200 300 400

aver

age

err

or

Number of accessed facts

3,54

4,55

5,56

6,57

7,58

8,5

0 100 200 300 400

stan

dar

d d

evia

tio

n

Number of accessed facts

50556065707580859095

100

0 100 200 300 400

cove

rage

Number of accessed facts

a)

0.3 0.5 0.7

2,5

2,7

2,9

3,1

3,3

3,5

3,7

3,9

0 100 200 300 400

ave

rage

err

or

Number of accessed facts

4

4,2

4,4

4,6

4,8

5

5,2

5,4

0 100 200 300 400

stan

dar

d d

evia

tio

n

Number of accessed facts

80828486889092949698

100

0 100 200 300 400

cove

rage

Number of accessed facts

0,6

0,7

0,8

0,9

1

1,1

1,2

1,3

0 200 400 600 800

aver

age

erro

r

Number of accessed facts

0,61,11,62,12,63,13,64,14,65,15,6

0 200 400 600 800

stan

dar

d d

evia

tio

n

Number of accessed facts

20

30

40

50

60

70

80

90

100

0 200 400 600 800

cove

rage

Number of accessed facts 0.5 0.7 0.9

Figure II-2 Average error, standard deviation and coverage w.r.t. certainty for the "year" attribute

algorithm Facebook Pokec algorithm Facebook Pokec

PGPI+ 96.6 73.8 PGPI-G 62.8 56.2

PGPI-N+ 96.4 73.9 NB 48.67 57.48

PGPI-G+ 84.9 65.9 RNB 50.11 56.37

PGPI 63.8 62.0 CNB 50.11 56.40

PGPI-N 63.8 62.1 LP 48.03 47.31

Table II-4 Best accuracy for nominal attributes using the full social graph

user profiles with considerably more accuracy and by accessing a smaller part of the social graph

than five state-of-the-art algorithms. Moreover, an interesting result is that profile attributes such as

status (student/professor) and gender can be predicted with more than 95% accuracy using PGPI+.

44

Page 55: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

PGPI+

a)

0.3 0.5 0.7

2,5

2,7

2,9

3,1

3,3

3,5

3,7

3,9

0 100 200 300 400

ave

rage

err

or

Number of accessed facts

4

4,2

4,4

4,6

4,8

5

5,2

5,4

0 100 200 300 400

stan

dar

d d

evi

atio

n

Number of accessed facts

80828486889092949698

100

0 100 200 300 400

cove

rage

Number of accessed facts

0,6

0,7

0,8

0,9

1

1,1

1,2

1,3

0 200 400 600 800

ave

rage

err

or

Number of accessed facts

0,61,11,62,12,63,13,64,14,65,15,6

0 200 400 600 800

stan

dar

d d

evi

atio

n

Number of accessed facts

20

30

40

50

60

70

80

90

100

0 200 400 600 800

cove

rage

Number of accessed facts 0.5 0.7 0.9

PGPI-N+

0.70.750.8

0.850.9

0.951

1.05

0 100 200

ave

rage

err

or

Number of accessed facts

0.61.11.62.12.63.13.64.14.6

0 100 200

stan

dar

d d

evi

atio

n

Number of accessed facts

2030405060708090

100

0 100 200

cove

rage

Number of accessed facts

3.83.853.9

3.954

4.054.1

4.154.2

3 4 5 6

aver

age

erro

r

Number of accessed facts

4.6

4.65

4.7

4.75

4.8

4.85

4.9

3 4 5 6

stan

dar

d d

evia

tio

n

Number of accessed facts

88

90

92

94

96

98

100

3 4 5 6

cove

rage

Number of accessed facts

a)

0.3 0.5 0.7

0.5 0.7 0.9

PGPI-G+

a)

0.3 0.5 0.7

0.5 0.7 0.9

1

1,1

1,2

1,3

1,4

1,5

1,6

1,7

0 200 400 600 800

aver

age

err

or

Number of accessed facts

00,51

1,52

2,53

3,54

4,55

0 200 400 600 800

stan

dar

d d

evia

tio

n

Number of accessed facts

0102030405060708090100

0 200 400 600 800

cove

rage

Number of accessed facts

2,52,72,93,13,33,53,73,94,14,34,5

0 100 200 300 400

aver

age

err

or

Number of accessed facts

3,54

4,55

5,56

6,57

7,58

8,5

0 100 200 300 400

stan

dar

d d

evi

atio

n

Number of accessed facts

50556065707580859095

100

0 100 200 300 400co

vera

ge

Number of accessed facts

a)

0.3 0.5 0.7

2,5

2,7

2,9

3,1

3,3

3,5

3,7

3,9

0 100 200 300 400

ave

rage

err

or

Number of accessed facts

4

4,2

4,4

4,6

4,8

5

5,2

5,4

0 100 200 300 400

stan

dar

d d

evia

tio

n

Number of accessed facts

80828486889092949698

100

0 100 200 300 400co

vera

ge

Number of accessed facts

0,6

0,7

0,8

0,9

1

1,1

1,2

1,3

0 200 400 600 800

aver

age

erro

r

Number of accessed facts

0,61,11,62,12,63,13,64,14,65,15,6

0 200 400 600 800

stan

dar

d d

evia

tio

n

Number of accessed facts

20

30

40

50

60

70

80

90

100

0 200 400 600 800

cove

rage

Number of accessed facts 0.5 0.7 0.9

Figure II-3 Average error, standard deviation and coverage w.r.t. certainty for the "age" attribute

There are several opportunities to improve the proposed algorithms. For future work, we plan to

propose new optimizations for PGPI+ and to extend the proposed algorithms with text analysis ca-

pability and to consider other types of information for making predictions. We also wish to perform

experiments with other social networks and compare accuracy with more algorithms. Lastly, we

also intend to develop algorithms for community detection [18,22,26] that are optimized for using

a small amount of facts from social networks.

45

Page 56: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

CONCLUSION GÉNÉRALE

Dans ce travail, nous avons proposé deux algorithmes novateurs pour inférer les profils utili-

sateurs dans les réseaux sociaux en utilisant un graphe social partiel. Ces algorithmes sont à notre

connaissance, les premiers à laisser l’utilisateur décider de la quantité d’information à accéder du

graphe social pour faire une prédiction.

Le premier algorithme PGPI, permet l’inférence de profils utilisateurs en l’absence d’un graphe

social complet et ne requiert aucun entraînement pour être appliqué. Il est composé de deux procé-

dures nommées PGPI-N (Partial Graph Profile Inference- Node) et PGPI-G (Partial Graph Profile

Inference- Groupe). Elles permettent respectivement l’inférence de profils utilisateurs sur la base

(1) des liens d’amitiés, et (2) des adhésions aux groupes, "J’aime" et "Vues" (lorsque l’information

est disponible). Une évaluation expérimentale a été présentée comparant la performance de PGPI

à quatre algorithmes d’inférence de profils utilisateurs de la littérature (NB, CNB, RNB et LP).

Un jeu de donées réel provenant du réseau social Facebook a été utilisé. Dans cette expérience,

trois mesures de performance ont été employées : l’exactitude, le nombre de noeuds accedés dans

le graphe social et la couverture. Les résultats ont montrés que PGPI obtient une exactitude allant

jusqu’à 34% supérieure à celle des méthodes proposés, tout en utilisant moins d’information pour

prédire. Il est aussi intéressant de noter que PGPI peut prédire certains attributs avec une exactitude

supérieure à 90%.

Bien que PGPI procure une très haute exactitude et accède à moins de noeud dans le graphe

social, PGPI traite les attributs numériques comme des attributs nominaux et ne tiens pas compte

de la certitude des prédictions.

Afin de remédier à ces limites, nous avons introduit une version améliorée de PGPI nommée

PGPI+. Elle introduit de nouvelles optimisations pour augmenter l’exactitude des prédictions et la

couverture de PGPI. De plus, PGPI+ traite les attributs numériques de façon adaptée et offre un

mécanisme pour estimer la certitude des prédictions. Grâce à ces nouveautés, PGPI+ offre un gain

considérable en termes d’exactitude et couverture des prédictions.

46

Page 57: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Nous avons réalisé une expérience pour comparer la performance de PGPI+ avec PGPI et celles

des quatre mêmes algorithmes de la littérature, en utilisant des profils utilisateurs des réseaux

sociaux Pokec et Facebook. Les résultats montrent que PGPI+ offre une plus haute exactitude

pour tous les jeux de données utilisés, tout en accèdant à moins de noeuds dans le graphe social.

Nous avons aussi comparé la meilleure exactitude de PGPI/PGPI+ pour les attributs numériques en

termes d’erreur moyenne et d’écart-type des valeurs prédites par rapport aux valeurs réelles. Les

résultats indiquent que PGPI+ donne en général une inférence plus exacte. Finalement, des résul-

tats expérimentaux montrent également que la certitude calculée pour les prédictions est un bon

indicateur de l’exactitude des prédictions.

Cette recherche ouvre la voie à plusieurs possibilités de travaux futurs. Nous envisageons entre

autres de comparer PGPI+ avec d’autres algorithmes d’inférence de profils utilisateurs spécialisés

pour le traitement d’attributs numériques ou l’estimation de la certitude des prédictions. De plus,

nous prévoyons améliorer l’algorithme PGPI+ en proposant de nouvelles optimisations pour le

traitement des attributs numériques. Finalement, des expériences avec les données d’autres réseaux

sociaux sont envisagées.

47

Page 58: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

RÉFÉRENCES

[1] Blenn, N., Doerr, C., Shadravan, N., Van Mieghem, P.: How much do your friends know

about you?: reconstructing private information from the friendship graph. In: Proc. of the

Fifth Workshop on Social Network Systems, pp. 1–6. ACM (2012)

[2] Bonchi, F., Castillo, C., Gionis, A., Jaimes, A.: Social network analysis and mining for busi-

ness applications. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–37

(2011)

[3] Chaabane, A., Acs, G., Kaafar, M.A.: You are what you like! information leakage through

users’ interests. In: Proc. of the 19th Annual Network and Distributed System Security Sym-

posium, The Internet Society (2012)

[4] Chaudhari, G., Avadhanula, V., Sarawagi, S.: A few good predictions: selective node labeling

in a social network. In: Proc. of the 7th ACM international conference on Web search and

data mining, pp. 353–362. ACM (2014)

bibitemchen Chen, P. T., Hsieh, H. P.: Personalized mobile advertising: Its key attributes,

trends, and social impact. Technological Forecasting and Social Change, 79(3), 543–557

(2012)

[5] Dahimene, R., Constantin, C., Du Mouza, C.:Recland: A recommender system for social

networks. In: Proc. 23rd ACM Intern. Conf. on Information and Knowledge Management,

pp. 2063–2065 (2014)

[6] Davis Jr, C. A. et al.: Inferring the location of twitter messages based on user relationships.

Transactions in GIS, 15(6), pp. 735–751 (2011)

[7] Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, V. N.: Inferring user demographics and social

strategies in mobile social networks. In: Proc. of the 20th ACM international conference on

Knowledge discovery and data mining, pp. 15–24. ACM (2014)

[8] Dougnon, Y. R., Fournier-Viger, P., Nkambou, R.: Inferring User Profiles in Social Networks

using a Partial Social Graph. In: Proc. 28th Canadian Conference on Artificial Intelligence,

Springer, LNAI 9091, pp. 84–99 (2015)

48

Page 59: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

[9] He, J., Chu, W. W., Liu, Z. V.: Inferring privacy information from social networks. In: Proc.

of 2006 IEEE International Conference on Intelligence and Security Informatics. pp. 154–

165. Springer, Heidelberg (2006)

[10] Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Preventing private information infer-

ence attacks on social networks. IEEE Transactions on Knowledge and Data Engineering.

25(8), pp. 1849–1862 (2013)

[11] Jurgens, D.: That’s what friends are for: Inferring location in online social media platforms

based on social relationships. In: Proc. of the 7th International AAAI Conference on Weblogs

and Social Media, pp. 273–282, AAAI Press (2013)

[12] Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from

digital records of human behavior. National Academy of Sciences, 110(15), pp. 5802–5805

(2013)

[13] Li, R., Wang, C., Chang, K. C. C. User profiling in an ego network: co-profiling attributes

and relationships. In: Proc. of the 23rd international conference on World wide web, pp. 819–

830. ACM (2014)

[14] Lindamood, J., Heatherly, R., Kantarcioglu, M., Thuraisingham, B.: Inferring private infor-

mation using social network data. In: Proc. of the 18th international conference on World

wide web, pp. 1145–1146. ACM (2009)

[15] Compendium of global digital statistics. We are Social Ltd. (2015)

[16] Mislove, A., Viswanath, B., Gummadi, K. P., Druschel, P.: You are who you know: inferring

user profiles in online social networks. In: Proc. of the 3rd ACM international conference on

Web search and data mining, pp. 251–260. ACM (2010)

[17] Mazhari, S., Fakhrahmad, S. M., Sadeghbeygi, H.: A user-profile-based friendship rec-

ommendation solution in social networks. Journal of Information Science, 41(3), 284–

295(2015)

[18] Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in so-

cial media. Data Mining and Knowledge Discovery, 24(3), pp. 515–554 (2012)

[19] Petrovic, I., Perkovic, P., Stajduhar, I.: A profile-and community-driven book recommender

system. In: 38th International Convention on Information and Communication Technology,

49

Page 60: RAÏSSA YAPAN DOUGNONFinalement, je tiens également à remercier les membres du jury, qui ont accepté d’évaluer mon travail : Usef Faghihi, Mustapha Kardouchi et Philippe Fournier-Viger

Electronics and Microelectronics, pp. 631–635 (2015)

[20] Quercia, D., Kosinski, M., Stillwell, D., Crowcroft, J.: Our Twitter profiles, our selves: Pre-

dicting personality with Twitter. In: Proc. of the 3rd IEEE international conference on social

computing, pp. 180–185, IEEE Press (2011)

[21] Scott, J., Carrington, P. J.: The SAGE handbook of social network analysis. SAGE publica-

tions (2011)

[22] Takaffoli, M., Sangi, F., Fagnan, J., Zäıane, O. R.: Community evolution mining in dynamic

social networks. Procedia-Social and Behavioral Sciences, 22, pp. 49–58 (2011)

[23] Traud, A. L., Mucha, P. J., Porter, M. A.: Social structure of Facebook networks. Physica A:

Statistical Mechanics and its Applications. 391(16), pp. 4165–4180 (2012)

[24] Tiroshi, A., Kuflik, T., Kay, J., Kummerfeld, B.: Recommender systems and the social web.

In: Proc. 19th international conference on Advances in User Modeling, pp. 60-70 (2012)

[25] Tucker, C. E.: Social networks, personalized advertising, and privacy controls. Journal of

Marketing Research, 51(5), 546–562 (2014)

[26] Wakita, K.,Tsurumi, T.: Finding community structure in mega-scale social networks. In:

Proc. 16th international conference on World Wide Web, pp. 1275–1276 (2007)

[27] Zhou, X., Xu, Y., Li, Y., Josang, A., Cox, C.: The state-of-the-art in personalized recom-

mender systems for social networking. Artificial Intelligence Review, 37(2), 119–132 (2012)

50