42
Symbolic and statistical Analyses of meta- data using the “Semana” platform — a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre de Recherche et d’Etude de l’Art Préhistorique UMR 5608: Travaux et Recherches Archéologiques sur les Cultures, les Espaces et les Sociétés QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. CASK Sorbonne 2008, Paris, June 13th

Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Embed Size (px)

Citation preview

Page 1: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Symbolic and statistical Analyses of meta-data

using the “Semana” platform —

a bundle of tools for the KDD research

Georges Sauvet (CNRS, Toulouse)

Centre de Recherche et d’Etude de l’Art Préhistorique

UMR 5608: Travaux et Recherches Archéologiques sur les Cultures, les Espaces et les Sociétés

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

CASK Sorbonne 2008, Paris, June 13th

Page 2: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

SEMANA and Data MiningSEMANA and Data Mining

sampling

Data coding

KDD techniques(Rough Set,

FCA, statistical analysis, etc.)

interpretation

Datawarehouse

After B. Wüthrich, 1998

“SEMANA”, a bundle of tools aimed at makink these tasks easier

Page 3: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Architecture of the SEMANA platform

Architecture of the SEMANA platform A software bundle written in Transcript®, the programming language of Revolution®

Standalone applications for Macintosh and Windows

Dynamic DB BuilderDynamic DB Builder

Data sheetsData codingData storage

Formal Concept Analysis

Formal Concept Analysis Statistical toolsStatistical tools

Galois lattice“central concepts”

Correlation MatrixCorrespondence Factor Analysis, Hierarchical Classifications

(Wille, Ganter)

(Benzecri)

Tables (various formats)

“Multi-valued tables” “One-valued tables”

Tree Builder AssistantTree Builder Assistant

Aid to code structuration

Rough Set TheoryRough Set Theory Decision LogicDecision Logic

Upper approx.Lower approx.Reducts, CoreDiscriminating power

Minimal rulesAttribute strength

(Pawlak)

(Bolc, Cytowski and Stacewicz)

Attribute EditorAttribute Editor

DiscretizationLogical scaling …

Page 4: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Working with the SEMANA platform Working with the SEMANA platform

Three illustrations:

“Ten-ta-to”: the proximal deictic adjectives in Polish

The category of Aspect in Polish

Representations of women in Palaeolithic Art

SEMANA is twofold:

1) Tools for Intelligent Database Designing => Dynamic DB Builder

• providing statistical information about the use of AV • suggesting iterative restructuration of AV

2) Tools for KDD research : integration of RST, FCA, Statistical Data Analyses

Page 5: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Case 1:

the Proximal Deictic Adjectives in Polish

Case 1:

the Proximal Deictic Adjectives in Polish

Page 6: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

The proximal deictic adjectives in Polish The proximal deictic adjectives in Polish

Case = {Nominative, Accusative, Genitive, Dative, Instrumental, Locative}

Number = {singular, plural}

Gender = In Polish Linguistics (cf. SALONI, Z. 1976), up to 7 gender classes have been proposed:

In Polish School Grammar, the adjective declension consists in the amalgamation of three “morphological categories”.

Singular :1. feminine2. neuter3. animal masculine (“animal” corresponds to the feature “animate” in

other European languages descriptions)

4. non animal masculinePlural :• personal masculine (“personal” corresponds to the feature “human”)

• non personal masculine1. “pluralia tantum” (defective nouns with no singular form).

Page 7: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

The proximal deictic adjectives in Polish The proximal deictic adjectives in Polish

The root of these adjectives is a single phoneme t-.

13 forms are used: ten, ta, to, tym, tymi, tych, te, te*,temu, tej, tego, ta*,ci

Examples (only Nominative case)

Polish English translationSingular Plural

Masculine ten dom te domy this/these house(s)ten pies te psy this/these dog(s)ten pan ci panowie this/these sir(s)

Feminineta deska te deski this/these board(s)ta gęś te gęsi this/these goose/geeseta pani te panie this/these lady/ladies

Neuterto pióro te pióra this/these feather(s)to kurczę te kurczęta this/these chicken(s)to dziecko te dzieci this/these child/children... ... ...

Page 8: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

The proximal deictic adjectives in Polish The proximal deictic adjectives in Polish

In order to elucidate the problem of Gender in Polish noun morphology,

H. and A. Wlodarczyk have built a database of usages of the proximal

deictic adjectives.

As the 7 “sub-genders” of Polish School Grammars neither correspond to

any known semantic or ontological categories nor to any known

grammatical sub-gender in other languages, they proposed to split the

“sub-genders” of the Gender attribute into three attributes :

gender = {feminine, neuter, masculine)animacy = {animate, inanimate}humanity = {human, non_human}

Page 9: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO: database first version TENTATO: database first version

sample

morpheme

attribute, value(features chosen for each entry)

Page 10: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

An AV Table is automatically collected

TENTATO: database first version TENTATO: database first version Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0

Attributes = 5 (with resp. 6,2,3,2,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 144 Apparent Saturation Index : 75%==================================================The following pairs of attributes could be merged:[hum|ina] Confidence index = 99.9%[hum|nhu] Confidence index = 99.9%[ina|nhu] Confidence index = 99.9%==================================================STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36

Case A 18 Case D 18 Case G 18 Case I 18 Case L 18 Case N 18

Gnd fem 36 Gnd masc 36 Gnd neu 36

Hum hum 36 Hum nhum 72

Nb plur 54 Nb sing 54 ==================================================Non-Attested Pairs of Values = 1ina,hum,2,4--------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%--------------------------------------------------

Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0

Attributes = 5 (with resp. 6,2,3,2,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 144 Apparent Saturation Index : 75%==================================================The following pairs of attributes could be merged:[hum|ina] Confidence index = 99.9%[hum|nhu] Confidence index = 99.9%[ina|nhu] Confidence index = 99.9%==================================================STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36

Case A 18 Case D 18 Case G 18 Case I 18 Case L 18 Case N 18

Gnd fem 36 Gnd masc 36 Gnd neu 36

Hum hum 36 Hum nhum 72

Nb plur 54 Nb sing 54 ==================================================Non-Attested Pairs of Values = 1ina,hum,2,4--------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%--------------------------------------------------

The program suggests the possibility to merge these attributes

The program indicates that the pair {inanimate-human} does not exist (for obvious reason)

Page 11: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO (Version 1): Formal Concept Analysis TENTATO (Version 1): Formal Concept Analysis

TENTATO

Version 1

simplified lattice

complete lattice

Inanimate depends on non human

Human depends on animate

Test of dependenceTest of dependenceTotal Dependence

ina => nhu (36/36)

hum => an (36/36)

High probability (>90%):

none

Total Dependence

ina => nhu (36/36)

hum => an (36/36)

High probability (>90%):

none

Page 12: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO: second version TENTATO: second version

Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0

Attributes = 4 (with resp. 3,6,3,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 108 Apparent Saturation Index : 100%==================================================No attributes could be merged================================================== STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36

CAS accusative 18 CAS dative 18 CAS genetive 18 CAS instrumental 18 CAS locative 18 CAS nominative 18

GND feminine 36 GND masculine 36 GND neuter 36

NBR plural 54 NBR singular 54 ==================================================Non-Attested Pairs of Values = 0-------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%

Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0

Attributes = 4 (with resp. 3,6,3,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 108 Apparent Saturation Index : 100%==================================================No attributes could be merged================================================== STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36

CAS accusative 18 CAS dative 18 CAS genetive 18 CAS instrumental 18 CAS locative 18 CAS nominative 18

GND feminine 36 GND masculine 36 GND neuter 36

NBR plural 54 NBR singular 54 ==================================================Non-Attested Pairs of Values = 0-------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%

In a second trial, the attributes

ANIMACY ({ANI}=[animate|inamimate]) and HUMANITY ({HUM}=[human|nhuman])

are merged into a three-valued attribute :

{ANY}=[nhuman|inanimate|human]

No attribute merging is possible; all pairs of values are attested.

Page 13: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO: Formal Concept Analysis TENTATO: Formal Concept Analysis

TENTATO

Version 1

simplified lattice

complete lattice

TENTATO

Version 2

simplified lattice

complete lattice

All the attributes at the same level : no hierarchy

Total Dependence

none

High probability (>90%):

none

Total Dependence

none

High probability (>90%):

none

Total Dependence

ina => nhu (36/36)

hum => an (36/36)

High probability (>90%):

none

Total Dependence

ina => nhu (36/36)

hum => an (36/36)

High probability (>90%):

none

Test of dependence =>

Inanimate depends on non humanHuman depends on

animate

Page 14: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: Rough Set Theory and “Minimal Rules” TENTATO-2: Rough Set Theory and “Minimal Rules”

r1 (9) : CASdat,NBRplu --> tymr2 (3) : CASins,GNDmas,NBRsin --> tymr3 (3) : CASins,GNDneu,NBRsin --> tymr4 (3) : CASloc,GNDmas,NBRsin --> tymr5 (3) : CASloc,GNDneu,NBRsin --> tym

r6 (9) : CASins,NBRplu --> tymi

r7 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tychr8 (9) : CASgen,NBRplu --> tychr9 (9) : CASloc,NBRplu --> tych

r10 (3) : CASacc,GNDneu,NBRsin --> tor11 (3) : CASnom,GNDneu,NBRsin --> to

r12 (3) : CASacc,ANYina,NBRplu --> ter13 (3) : CASacc,ANYnhu,NBRplu --> ter14 (3) : CASacc,GNDfem,NBRplu --> ter15 (3) : CASacc,GNDneu,NBRplu --> ter16 (3) : CASnom,ANYina,NBRplu --> ter17 (3) : CASnom,ANYnhu,NBRplu --> ter18 (3) : CASnom,GNDfem,NBRplu --> ter19 (3) : CASnom,GNDneu,NBRplu --> te

r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> tenr21 (3) : CASnom,GNDmas,NBRsin --> ten

r22 (3) : CASdat,GNDmas,NBRsin --> temur23 (3) : CASdat,GNDneu,NBRsin --> temu

r24 (3) : CASdat,GNDfem,NBRsin --> tejr25 (3) : CASgen,GNDfem,NBRsin --> tejr26 (3) : CASloc,GNDfem,NBRsin --> tej

r27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tegor28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tegor29 (3) : CASgen,GNDmas,NBRsin --> tegor30 (3) : CASgen,GNDneu,NBRsin --> tego

r31 (3) : CASacc,GNDfem,NBRsin --> te*

r32 (3) : CASnom,GNDfem,NBRsin --> ta

r33 (3) : CASins,GNDfem,NBRsin --> ta*

r34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci

The 108 distinct objects of the DB can be described by only 34 morphological rules.Note that CAS and NBR are required in every rule, GND in 26/34 and ANY in only 9/34.

A procedure derived from Rough Set Theory allows us to calculate the “minimal rules” (i.e. the values of the attributes which condition the morpheme to be used)

Page 15: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: Statistical analysis TENTATO-2: Statistical analysis

The Multi-valued Table is unfolded in a One-value Table...

…and the One-value Table is transformed in a Burt’s Table…

A Burt’s Table is a square symmetrical table giving the number of cooccurrences of the attributes

Page 16: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: Correspondence Factor Analysis (CFA) TENTATO-2: Correspondence Factor Analysis (CFA)

BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej tem ten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 tem 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45

BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej tem ten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 tem 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45

Numbers in the Table are considered as coordinates of points in a N-dimensional space.

•• •••••• •

•••••• •••

• ••••• •••• ••

••• •••

•• •••• •••• •••••• •

••••••

••••

z

x

y

F1

F2

F3

CFA calculates the axes of inertia of the cloud of points (F1, F2, F3 …)

and displaysprojections in planes [F1,F2], [F1,F3], etc.

CFA is implemented in“Semana”

Page 17: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: Correspondence Factor Analysis (CFA) TENTATO-2: Correspondence Factor Analysis (CFA)

CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |

CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |

Note that the number (singular/plural) has the highest contrib. to axis 1

Note that the quality of the description of attribute “animacy” is very poor: these elements have no contribution to the first 4 factors.

Contribution of object J to the definition of factor 1

Contribution of factor 1 to the description of object J

Coordinate of object J on factor 1

Output by “Stat-3”

Page 18: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: CFA representation in plane [1,2] TENTATO-2: CFA representation in plane [1,2]

Output by “Stat-3”

Axis 1

Axis 2

Axis 2 separates syntactic relators (CASE) => {nom,acc} vs {gen,loc,dat, ins}

Axis 2 separates syntactic relators (CASE) => {nom,acc} vs {gen,loc,dat, ins}

Axis 1 separates NUMBER => singular vs pluralAxis 1 separates NUMBER => singular vs pluralANIMACY & GENDER are not differenciated on axes 1 and 2 ANIMACY & GENDER are not differenciated on axes 1 and 2

Morphemes are widely spread over plane [1,2]Morphemes are widely spread over plane [1,2]

Page 19: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: Axis 1 separates quantifiers TENTATO-2: Axis 1 separates quantifiers

Output by “Stat-3”

Axis 1

Axis 2

plural

singular

Morphemes strictly associated to singular:

=> ta, to, ten, te*, tego, tej, temu, ta*

Morphemes strictly associated to singular:

=> ta, to, ten, te*, tego, tej, temu, ta*

One exception:

tym may be either singular or plural

One exception:

tym may be either singular or plural

Morphemes strictly associated to plural:

=> ci, te, tych, tymi

Morphemes strictly associated to plural:

=> ci, te, tych, tymi

Page 20: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

TENTATO-2: Axis 2 separates syntactic relators TENTATO-2: Axis 2 separates syntactic relators

Output by “Stat-3”

Axis 1

Axis 2

ins

nom

acc

loc

gen

dat

Morphemes strictly associated to

genitive, locative, dative and/or instrumental:

=> tej, tych, temu, tymi, ta*, tymi

Morphemes strictly associated to

genitive, locative, dative and/or instrumental:

=> tej, tych, temu, tymi, ta*, tymi

One exception:

tego may be either

accusative or genitive

One exception:

tego may be either

accusative or genitive

Morphemes strictly associated to nominative and/or accusative:

=> ta, to, ten, te*, ci, te

Morphemes strictly associated to nominative and/or accusative:

=> ta, to, ten, te*, ci, te

Page 21: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Output by “Stat-3”

Axis 1

Axis 3

ins

acc

loc

gen

dat

Morphemes tymi, ta*strictly associated to instrumental

Morphemes tymi, ta*strictly associated to instrumental

One exception: tym may be either instrumental or locative

One exception: tym may be either instrumental or locative

Morphemes tych, tego, tej

strictly associated to

genitive or locative

Morphemes tych, tego, tej

strictly associated to

genitive or locative

TENTATO-2: Axis 3 separates {gen, loc} vs {inst]} TENTATO-2: Axis 3 separates {gen, loc} vs {inst]}

nom

Page 22: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Morphemes tego,to, ten, temu, cistrictly associated to masculine or neutral

Morphemes tego,to, ten, temu, cistrictly associated to masculine or neutral

Output by “Stat-3”

Axis 1

Axis 4

TENTATO-2: Axis 4 separates gender {fem} vs (mas, neu} TENTATO-2: Axis 4 separates gender {fem} vs (mas, neu}

fem

mas

neu

One exception: tym may be associated to any gender

One exception: tym may be associated to any gender

Morphemes ta*, te*,tej, tastrictly associated to feminineMorphemes ta*, te*,tej, tastrictly associated to feminine

Note that the attribute[ANIMACY]={human, nhuman, inanimate}

is still not differenciated on axis 4.

Note that the attribute[ANIMACY]={human, nhuman, inanimate}

is still not differenciated on axis 4.

Page 23: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Output by “Stat-3”

Axis 1

Axis 9

TENTATO-2: Animacy appears only on axis 9 !!! TENTATO-2: Animacy appears only on axis 9 !!!

hum

nhu

ina

Morpheme cistrictly associated to humanMorpheme cistrictly associated to human

Page 24: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Axis (% inertia)

Axis 1 (13.05%) ……………………………………………………………………………………………………….

Axis 2 (12.81%) ……………………………………………………………………………………………………….

Axis 3 (11.27%) ……………………………………………………………………………………………………….

Axis 4 (10.0%) ……………………………………………………………………………………………………….

……………….. …………………………………………………………………………………………………….

Axis 9 (4.35%) ……………………………………………………………………………………………………….

TENTATO-2: CFA and “Minimal Rules” (RST) TENTATO-2: CFA and “Minimal Rules” (RST)

NUMBER CASE GENDER ANIMACY (36/36 rules) (36/36 rules) (26/36 rules) (9/36 rules)

singular plural

nom, acc gen,loc,dat,inst

gen,loc (dat) inst

feminine masculine

human nhum,ina

The relative strength of the attributes is revealed The relative strength of the attributes is revealed bothboth by their contribution by their contribution to the axes of inertia in Factor Analysis and by their weight in Minimal Rules.to the axes of inertia in Factor Analysis and by their weight in Minimal Rules.

Page 25: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Case 2:

the category of Aspect in Polish

Case 2:

the category of Aspect in Polish

Page 26: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

A Database built with “Dynamic DB-Builder” A Database built with “Dynamic DB-Builder”

A classical data sheet to fill for each specimen… Attributes and values are chosen in a list…

… and the resulting AVs appear in a field

the grammatical form of each specimen is used as index

Page 27: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

A test of consistency A test of consistency

Each specimen is characterized by a set of AV and by its grammatical form (used as index).It may be written as a rule :

if {given set of AV} then index

This allows index inconsistencies to be detected(a test of consistency is provided in Semana)

the grammatical form of each specimen is used as index

Page 28: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

A test of consistency A test of consistency

Each specimen is characterized by a set of AV and by its grammatical form (used as index).It may be written as a rule :

if {given set of AV} then index

This allows index inconsistencies to be detected(a test of consistency is provided in Semana)

9 different forms applying to exactly the same situation ?

the grammatical form of each specimen is used as index

This is a warning to the expert:

probably the AV do not describeproperly the different aspectual situations!

Page 29: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect using Dynamic DB Builder Polish Aspect using Dynamic DB Builder

All specimens are automatically collected in a contingency table… and statistics are reported.

In this initial version, there was

more than 2 millions of theoretical

combinations and 9 pairs of

attributes could be merged!

Page 30: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect using Dynamic DB BuilderPolish Aspect using Dynamic DB Builder

DB version Distinct objects Number of attributes

Number of theor. combin.

Number of “merging attributes”

HW-Aspect-V1 61 12 2,064,384 9

HW-Aspect-V2 60 11 1,032,192 9

HW-Aspect-V3 77 11 829,000 6

HW-Aspect-V4 79 9 408,240 1

HW-Aspect-V5 79 8 136,080 1

HW-Aspect-V6 69 8 45,360 1

HW-Aspect-V7 74 8 61,440 0

HW-Aspect-V8 78 7 58,320 0

Improvements by « trials and errors »

Page 31: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

From Dynamic DB Builder to STAT-3From Dynamic DB Builder to STAT-3

The multi-valued table is transformed into a one-valued table for STAT analyses

Page 32: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis

Factor Analysis of the contingency table shows a clear Gutmann’s effect

(i.e. a sequential order of the attributes)

axis 1

axis 2

Page 33: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis

Ascending Hierarchical Classification shows two well-defined classes

Page 34: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis

A clear partition in two classes according to the

attribute [VAL] = {perfective | imperfective}

perfective

imperfective

Page 35: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis

Gutmann’s effect shows that attributes are sequentially ordered

attribute MCMP (morph. comp.) : pip > ip > pp > pi >ii

attribute MOD : parallel > sequential > trans > resume > stop > interrupt > keep > OffAndOn

Page 36: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Polish Aspect: Correspondence Factor AnalysisPolish Aspect: Correspondence Factor Analysis

VAL perfectiveperfective imperfectiveimperfective

MCMP pip ip pp pi ii 0 0 0 100 100

CRE defnb nRe ndefnb 0 30 89

MOD par seq trans resume stop inter keep OaO 0 0 0 0 35 0 60 100

ANA after finish enter start end before nan begin run 0 0 0 0 33 44 69 40 84

ITS decr incr strong weak 0 0 28 54

TYP ordPr event state refPr 29 17 75 67

VAL perfectiveperfective imperfectiveimperfective

MCMP pip ip pp pi ii 0 0 0 100 100

CRE defnb nRe ndefnb 0 30 89

MOD par seq trans resume stop inter keep OaO 0 0 0 0 35 0 60 100

ANA after finish enter start end before nan begin run 0 0 0 0 33 44 69 40 84

ITS decr incr strong weak 0 0 28 54

TYP ordPr event state refPr 29 17 75 67

Distribution of features along the perfective-to-imperfective path (% association with imperfective)

All these features require imperatively perfective

Page 37: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Case 3 :

Images of the Woman in Palaeolithic Art

Case 3 :

Images of the Woman in Palaeolithic Art

Page 38: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Images of the Woman in Palaeolithic ArtImages of the Woman in Palaeolithic Art

Customized DB-builder: for each figure, AV are selected with ‘check box’ buttons

Raphaëlle Bourrillon, PhD, Univ.Toulouse-Le Mirail

Page 39: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Images of the Woman in Palaeolithic ArtImages of the Woman in Palaeolithic Art

CFA and HAC show three classes of representations

Realist and fatty

Realist and slim

Schematic / abstract

Page 40: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Detailed study of the schematic women representationsDetailed study of the schematic women representations

CFA and HAC split the schematic feminine figures into five sub-classes

Schematic / abstract

Page 41: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

Detailed study of the schematic women representationsDetailed study of the schematic women representationsFormal concept

analysis

Page 42: Symbolic and statistical Analyses of meta-data using the Semana platform a bundle of tools for the KDD research Georges Sauvet (CNRS, Toulouse) Centre

SEMANA : a bundle of tools for KDD research at hand in a single box

SEMANA : a bundle of tools for KDD research at hand in a single box

with applications in many domains (within and out of Linguistics!)

FROM PREPROCESSING …

… TO MINING

Building /Editing DB

- Structuration of AV - Statistics - AV edition (merging, splitting, etc.)

- Edition/conversion of tables in various formats

Complementary KDD procedures (RST, FCA ...)

… with special emphasis on the powerful tools of statistical data analyses (CFA, HAC)