41
Projet ARESOS Reconstruction, Analyse et Accès aux Données dans les Grands Réseaux Socio-Sémantiques Mission pour l'Interdisciplinarité du CNRS - Défi Masses de Données Scientifiques – MASTODONS Patrick GALLINARI - UPMC Paris 6 - UMR 7606

Mission pour l'Interdisciplinarité du CNRS - Défi Masses … ·  · 2015-02-03Mission pour l'Interdisciplinarité du CNRS - Défi Scientifiques –MASTODONS ... (Google, Facebook,

  • Upload
    doque

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Projet ARESOS Reconstruction, Analyse et Accès aux Données dans les Grands

Réseaux Socio-Sémantiques

Mission pour l'Interdisciplinarité du CNRS - Défi Masses de Données Scientifiques – MASTODONS

Patrick GALLINARI - UPMC Paris 6 - UMR 7606

Participants

• CAMS UMR 9557 - INSMI, EHESS, Paris • CSI - UMR 7185 - INSHS, Ecole des Mines, Paris • GIS Institut des Systèmes Complexes de Paris Ile-de-

France, (Fédération de 18 instituts et universités), INSHS, Paris

• IRISA, UMR 6074 - INS2I, IRISA, U. de Rennes 1 • IRIT, UMR 5505 - INS2I, U. Toulouse 3 • IXXI, INS2I, ENS Lyon • LATTICE, UMR 8094 - INSHS, ENS/ U. Paris 3 • LIG, UMR 5217 - INS2I, U. Joseph Fourrier, Grenoble • LIP6, UMR 7606 - INS2I, U. Pierre et Marie Curie, Paris

23/01/2015 Défi MASTODONS - Projet ARESOS 2

Context

• Community • ICWSM International AAAI Conf. on Web and Social Media

• “blends social science and computational approaches to answer important and challenging questions about human social behavior through social media while advancing computational tools for vast and unstructured data. “

• Analysis of large socio-semantic networks • Production and diffusion of content on media

• Human at the centre of the process

• Characterization • Interactions

• Individual + Social links • Structure of social interactions

• Content • multi-scale : micro, meso, macro, temporal

• Dynamic of conversations and concepts • multi-scale • multi-sources

23/01/2015 Défi MASTODONS - Projet ARESOS 3

Context: diversity of information sources and time scales

23/01/2015 Défi MASTODONS - Projet ARESOS 4

Context: everything has become « social »

23/01/2015 Défi MASTODONS - Projet ARESOS 5

ARESOS themes

• Representation and access to social content • Who speaks about what and how? Natural language

processing, data processing, identification of roles, information flows, topic evolution, conversation following, sentiment analysis

• Dynamicity : social-semantic structures and diffusion phenomena • Discovery of latent structures, morphogenesis, content

diffusion - Co-evolution structure and semantic

• Social information retrieval • Analysis of microblogs, collaborative recommendation

23/01/2015 Défi MASTODONS - Projet ARESOS 6

Présentation

• Focus on • Emergence of socio-semantic structures

• Science evolution

• Representation learning

• Machine learning

• Natural Language Processing

23/01/2015 Défi MASTODONS - Projet ARESOS 7

Emergence of socio-semantic structures: Science evolution

ISC PIF – LIP6 – IRISA

D. Chavalarias, et al.

23/01/2015 Défi MASTODONS - Projet ARESOS 8

Example of challenges: Analysis of socio semantic networks

23/01/2015 Défi MASTODONS - Projet ARESOS 9

Quantitative epistemology

• Challenges • What is the structure of knowledge spaces ?

• How does science evolve

• Which are the driving forces

23/01/2015 Défi MASTODONS - Projet ARESOS 10

Space: science mapping

23/01/2015 Défi MASTODONS - Projet ARESOS 11

Time: evolution of p-values

23/01/2015 Défi MASTODONS - Projet ARESOS 12

Space and Time Science evolution – dynamics at the meso level

23/01/2015 Défi MASTODONS - Projet ARESOS 13

Topic Dynamic – meso level

23/01/2015 Défi MASTODONS - Projet ARESOS 14

Phylomemies: reconstruction of science dynamics

23/01/2015 Défi MASTODONS - Projet ARESOS 15

Phylomemies: reconstruction of science dynamics

23/01/2015 Défi MASTODONS - Projet ARESOS 16

Scaling (CAMS/ IRISA/ LIP6)

• Currently : • About 30M documents processed but only about 100k

for NLP.

• Maps build from about 5000 expressions (i.e. domain centered semantic maps with 5000 expressions)

• Target : • NLP on 30M documents (extraction of relevant key-

phrases)

• Maps build from about 1M expressions

23/01/2015 Défi MASTODONS - Projet ARESOS 17

• Feasibility Study on the WoS database (30M items 1990-2013)

• Internship1: Spark/Map-Reduce implementation for key-phases extraction and clustering (finding maximal cliques),

• Internship 2: Experimentation on an Hadoop cluster at Rennes and LIP6 using algebra on Resilient Distributed Datasets (selection, union, join, . . . )

23/01/2015 Défi MASTODONS - Projet ARESOS 18

Learning Representations IRIT-LATTICE-LIP6

Learning representations

• Objective: learning robust and meaningful representations from data

• Handcrafted versus learned representation • Very often complex to define what are good representations

• General methods that can be used for • Different application domains • Multimodal data • Multi-task learning

• Learning the latent factors behind the data generation • Unsupervised feature learning

• Several families of techniques • Algebraic and statistical models • (Deep) Neural networks

23/01/2015 Défi MASTODONS - Projet ARESOS 20

Learning representations - Success story

• Very active recent domain, technology adopted (sometimes already operational) by big actors (Google, Facebook, Msoft ..)

• Success in many academic benchmarks for a large series of different problems • Image / scene labeling

• Speech recognition

• Natural language processing

• Language translation

• etc

23/01/2015 Défi MASTODONS - Projet ARESOS 21

Learning Language models (Mikolov et al. 2013)

• Simple neural networks language model • Word2Vec software

• Analogical reasoning • Paris – France + Italy = Rome • Discovery of 3 way relations

• Semantic, syntactic

23/01/2015 Défi MASTODONS - Projet ARESOS 22

Neural image caption generator (Vinyals et al. 2015)

• Objective • Learn a textual description of an image

• i.e. using an image as input, generate a sentence that describes the objects and their relation!

• Model • Inspired by a translation approach but the input is an

image • Use a Recurrent to generate the textual description, word by

word, provided a learned description of an image via a deep Convolutional Neuural Network

23/01/2015 Défi MASTODONS - Projet ARESOS 23

Neural image caption generator (Vinyals et al. 2015)

23/01/2015 Défi MASTODONS - Projet ARESOS 24

Work in Aresos on representation learning

• Social networks • Relational classification

• Information diffusion

• Natural language processing • Semantic compositionality

23/01/2015 Défi MASTODONS - Projet ARESOS 25

Relational Classification: heterogeneous graphs

• Label items with corresponding tags, classes, … • Several methods for homogeneous graphs

• label propagation, graph metrics, etc.

• Does not extend to heterogeneous graphs and multiple links

• Claim • Correlation among labels from different items

23/01/2015 Défi MASTODONS - Projet ARESOS 26

Relational Classification: heterogeneous graphs

• Correlations among labels from different node types • Exemple from DBLP data (Author domains x

Conferences) • Authors: 4 labels

• Conferences: 20 labels

• P(author domain| conference) reveals correlations

23/01/2015 Défi MASTODONS - Projet ARESOS 27

Relational classification: heterogeneous graphs

• Correlations among labels from different node types • Exemple from DBLP data (Author domains x

Conferences) • Authors: 4 labels

• Conferences: 20 labels

• P(author domain| conference) reveals correlations

23/01/2015 Défi MASTODONS - Projet ARESOS 28

Relational classification: representation learning

• Instead of working in a discrete space, solve the problem in a continuous representation space • Project the whole heterogeneous graph in a common

continuous space

• Solve simultaneously • Learn representations for all items constrained by the graph

relations

• Learn classifiers for the different node types in this new space

• This exploits • Graph proximity of items from different types

• Label correlation

29

Relational classification: representation learning

• Example

• visualisation of the latent space for DBLP

Class centroids All nodes, 1 color = 1 class

23/01/2015 Défi MASTODONS - Projet ARESOS 30

Content Information diffusion

• Objective • Predict information diffusion cascades

• State of the art • Models, often inspired from earlier work in

epidemiology and social science • Graph propagation model

• e.g. Independent Cascade model, Linear Threshold model

• General assumptions

• Close world, nodes do operate the same way, no features (e.g. content) associated to nodes

23/01/2015 Défi MASTODONS - Projet ARESOS 31

Learning representation for content information diffusion

• Learn propagation models directly from observed cascades, without any network assumption • All the factors influencing this diffusion (External influences,

node roles, etc) are directly extracted from the data

• Representation learning • Propagation is modeled in a latent space, where the diffusion

process follows a simple formalization • Map the discrete problem onto a continuous space

• The latent space is learned from the cascade data

• Additional benefits: inference (here diffusion prediction) is extremely fast

23/01/2015 Défi MASTODONS - Projet ARESOS 32

Information diffusion

• Visualization • Digg dataset

• User post stories that are digged • Diggs = likes • Cascades = stories and diggs • 1 digg = contamination • 1 month crawling • 5 k users, 71 k links • 150 k Training cascades • 66 k test cascades

• Latent space of size 2 • User clusters • Color points = 4 observed

Test cascades

23/01/2015 Défi MASTODONS - Projet ARESOS 33

Natural language processing Tensor model of semantic compositionality (Van de Cruys, Poibeau, Korhonen)

• Compositionality • the meaning of a complex expression is a function of the

meaning of the parts and the way they are combined

• Distributional hypothesis of meaning • Words that appear in the same context tend to be semantically

similar

• How to reconcile the principe of compositionality with distributional semantics?

• Objective • Model compositionality as a multi-way interaction between

latent factors learned from the data • Task

• Learn three way interactions (verb, subject, object) VSO from knowledge basis

23/01/2015 Défi MASTODONS - Projet ARESOS 34

Natural language processing Tensor model of semantic compositionality (Van de Cruys, Poibeau, Korhonen)

• Method • Make use of tensor decomposition to capture 3 ways

dependencies

23/01/2015 Défi MASTODONS - Projet ARESOS 35

Natural language processing Tensor model of semantic compositionality (Van de Cruys, Poibeau, Korhonen)

• What is it good for • Learn knowledge basis in a continuous representation

form • E.g. Freebase, Wikipedia, ect

• Infer new knowledge • Analogical reasoning properties

• Firt step towards complex NLP/ RI tasks • E.g. Question Answering

• Evaluation here • Compute similarity scores between SVO phrases

23/01/2015 Défi MASTODONS - Projet ARESOS 36

Ressources

• Corpora • Base de données dynamique Twitter

• tina.iscpif.fr/bigdata

• Platforms • http://Gargantext.org

• reconstruction de réseaux socio-sémantiques et de cartographie

• http://graphbrain.algopol.fr/ • Parsing de corpus socio-textuels pour construction de graphes

socio-sémantiques à destination des sociologues

• Markovian segmentation – labeling of natural text

23/01/2015 Défi MASTODONS - Projet ARESOS 37

Platforms

23/01/2015 Défi MASTODONS - Projet ARESOS 38

Corpora

• Base de données dynamique Twitter

23/01/2015 Défi MASTODONS - Projet ARESOS 39

Corpora

23/01/2015 Défi MASTODONS - Projet ARESOS 40

• Merci

• http://mastodons.lip6.fr/

23/01/2015 Défi MASTODONS - Projet ARESOS 41