Word embeddings et leurs applications (Meetup TDS, 2016-06-30)

Word embeddingset leurs applications

Toulouse Data Science30/06/2016

Camille Pradel

1

Plan

• Représentations symboliques du sens des mots

• Représentations vectorielles basées sur la similarité distributionnelle• Co-occurences et réduction de dimensions

• Réseaux de neurones

• Propriétés remarquables

• Applications

Représentations symboliques du sens des mots

• WordNet : 117 659 synsets (synonym set), un groupe de mots interchangeables, dénotant un sens ou un usage particulier

1. car, auto, automobile, machine, motorcar -- (4-wheeled motor vehicle; usually propelled by an internal combustion engine; he needs a car to get to work)

2. car, railcar, railway car, railroad car -- (a wheeled vehicle adapted to the rails of railroad; three cars had jumped the rails)

3. car, gondola -- (car suspended from an airship and carrying personnel and cargo and power plant)

4. car, elevator car -- (where passengers ride up and down; the car was on the top floor)

5. cable car, car -- (a conveyance for passengers or freight on a cable railway; they took a cable car to the top of the mountain)

• Relations sémantiques entre les synsets

• Relations d'hyperonymie/hyponymie

car, auto, automobile, machine, motorcar

-> motor vehicle, automotive vehicle

-> vehicle

-> conveyance, transport

-> instrumentality, instrumentation

-> artifact, artefact

-> object, physical object

-> entity, something


• Relations sémantiques entre les synsets

• Relation de méronymie/holonymiecar, auto, automobile, machine, motorcar

HAS PART: accelerator, accelerator pedal, gas pedal, gas, throttle, gun

HAS PART: air bag

HAS PART: auto accessory

HAS PART: automobile engine

HAS PART: automobile horn, car horn, motor horn, horn


• Limites• Manque de nuances

Adept = expert = good = practiced = proficient = skillful

• Pas à jourwicked, badass, nifty, crack, ace, wizard, genius, ninjia

• Subjective

• Chère à construire

• Compromis couverture VS exhaustivité

• Difficile de déduire une mesure de similarité entre les mots


Représentations basées sur la similarité distributionnelle

Hypothèse distributionnelle: les mots qui apparaissent dans des contextes similaires ont tendance à être similaires.

Le chat s’allonge sur le paillasson.Le chaton s’allonge sur le paillasson.Le chien s’allonge sur le paillasson.

Mon chat m’a griffé.Mon chaton m’a griffé.

garantie pour conclure un prêt avec une autre banque, dans un autre pays. "Le blanchisseurcentrale européenne perdra sa crédibilité. Cette banque centrale, enfin, sera contrainte de serrerdes emprunts en marks. Voici que cette banque fait maintenant ouvertement part de sa

Philosophical InvestigationWittgenstein, L

1953

Sur le plan thématique:

Sur le plan sémantique:

Matrices de co-occurrences

I like deep learning

I like NLP

I enjoy flying

• Limites:• Augmente en taille avec le vocabulaire• Consomme de l’espace de stockage• Vecteurs creux -> modèles peu robustes

Réduction de dimensions

• Enregistrer les informations les plus importantes dans un nombre réduit de dimensions (en général entre 25 et 1000)• Analyse en composante principale (PCA)

• Décomposition en valeur singulière (SVD)

• …

Réseaux de neurones

• SVD• Complexité quadratique -> passage à l’échelle compliqué

• Difficile de prendre en compte de nouveaux documents

apprendre directement des vecteurs de mots de dimension réduite• Learning representations by back-propagating errors (Rumelhart et al., 1986)

• A neural probabilistic language model (Bengio et al., 2003)

• NLP (almost) from Scratch (Collobert & Weston, 2008)

• word2vec (Mikolov et al. 2013)


• Utilisation d’un réseau de neurones pour apprendre une tâche fictive• Prédire le contexte d’un mot

• Prédire un mot en fonction de son contexte

Démo: ronxin.github.io/wevi/

Efficient estimation of word representations in vector space

Mikolov, T., Chen, K.,Corrado, G. & Dean, J.

2013

https://ronxin.github.io/wevi/

Représentation naïve sous forme vectoriellechaque mot est considéré comme une symbole atomique

One-hotencoding

Chien [ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 ]Bateau [ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 ]

Hôtel [ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 … 0 0 0 ]Motel [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 … 0 0 0 ]

Taille du vocabulaire


An improved model of semantic similarity based

on lexical co-occurrenceRohde, D. L., Gonnerman,

L. M., and Plaut, D. C. 2006

Emergence de motifs syntaxiques

Propriétés remarquables

An improved model of semantic similarity based

on lexical co-occurrenceRohde, D. L., Gonnerman,

L. M., and Plaut, D. C. 2006

Emergence de motifs sémantiques



• Relations linéaires

• Analogies syntaxiquesapple − apples ≈ car − cars ≈ family − families

• Analogies sémantiquesshirt − clothing ≈ chair − furniture

king − man ≈ queen − woman

Efficient estimation of word representations in vector spaceMikolov, T., Chen, K., Corrado, G., & Dean, J.

2013

GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, and

Christopher D. Manning. 2014

Capture du genre





Superlatifs





Compagnie - CEO


Applications

Distances, analogies, OK. Quoi d’autre?

Applications

• Identifier l’intrus dans une listehttps://github.com/dhammack/Word2VecExample

math shopping reading science

eight six seven five three owe nine

breakfast cereal dinner lunch

england spain france italy greece germany portugal australia

https://github.com/dhammack/Word2VecExample

Applications

• Dice utilise des modèles de vecteurs de mots pour rapprocher des mots-clés liés• Analytics -> Business Intelligence

Implementing Conceptual Search in Solr using LSA and Word2Vec - Simon Hughes

http://fr.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom

http://fr.slideshare.net/lucidworks/implementing-conceptual-search-in-solr-using-lsa-and-word2vec-presented-by-simon-hughes-dicecom

Applications

A Word is Worth a Thousand Vectors - Chris Moodyhttp://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

ITEM_3469

pregnant

ITEM_3469 ITEM_3469 ITEM_3469

http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

Applications

• Embeddings à plus grande échelle• Traduction statistique

• Robots conversationnels

• Description d’image

et de vidéos

https://vimeo.com/146492001

https://vimeo.com/146492001

Sources "pédagogiques"

• CS224d: Deep Learning for Natural Language Processing - Richard Socher• http://cs224d.stanford.edu/syllabus.html• https://www.youtube.com/watch?v=T8tQZChniMk (Lecture 2)

• word2vec parameter learning explained - Xin Rong• http://arxiv.org/abs/1411.2738

• Grounding distributional semantics in the visual world - Marco Baroni• http://clic.cimec.unitn.it/marco/publications/lectures/marco-grounding-ds-vl-2015.pdf

• Understanding Dimensionality Reduction- Principal Component Analysis And Singular Value Decomposition - Priya Rana• http://hpc-asia.com/understanding-dimensionality-reduction-principal-component-analysis-and-singular-

value-decomposition/

• A Beginner’s Guide to word2vec AKA What’s the Opposite of Canada? - Will Critchlow• https://www.distilled.net/resources/a-beginners-guide-to-word2vec-aka-whats-the-opposite-of-canada/

• Page Wikipedia sur WordNet• https://fr.wikipedia.org/wiki/WordNet

http://cs224d.stanford.edu/syllabus.html

https://www.youtube.com/watch?v=T8tQZChniMk

http://arxiv.org/abs/1411.2738

http://clic.cimec.unitn.it/marco/publications/lectures/marco-grounding-ds-vl-2015.pdf

http://hpc-asia.com/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/

https://www.distilled.net/resources/a-beginners-guide-to-word2vec-aka-whats-the-opposite-of-canada/

https://fr.wikipedia.org/wiki/WordNet