12

Click here to load reader

Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

Embed Size (px)

Citation preview

Page 1: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

C

Correspondence Analysis

Hervé Abdi1 and Michel Béra21School of Behavioral and Brain Sciences, TheUniversity of Texas at Dallas, Richardson, TX,USA2Centre d’Étude et de Recherche en Informatiqueet Communications, Conservatoire National desArts et Métiers, Paris, France

Synonyms

Dual scaling; Hayashi’s quantification theory typeIII; Homogeneity analysis; Method of simulta-neous linear regressions; Optimal scaling; Recip-rocal averaging

Glossary

CA Correspondence analysisComponent A linear combination of the

variables of a data table also calleddimension or factor

Dimension See componentFactor See componentGSVD Generalized singular value

decompositionMCA Multiple correspondence analysisPCA Principal component analysisSVD Singular value decomposition

Definition

Correspondence analysis (CA) is a generalizationof principal component analysis tailored to handlenominal variables. CA is traditionally used toanalyze contingency tables, but is also oftenused with data matrices that comprise only non-negative data. CA decomposes the chi-square sta-tistics associated to the data table into two sets oforthogonal components that describe, respec-tively, the pattern of associations between theelements of the rows and between the elementsof the columns of the data table. When the datatable is a set of observations described by a set ofnominal variables, CA becomes multiple corre-spondence analysis (MCA).

Introduction

Correspondence analysis (CA; Cordier 1965;Benzécri 1973; Beh and Lombardo 2015; Lebartand Fénelon 1975; Lebart et al. 1984; Escofier andPagès 1990; Greenacre 1984, 2007; Abdi andValentin 2007; Hwang et al. 2010; Abdi 2003;Abdi and Williams 2010b) is an extension ofprincipal component analysis (PCA; for details,see Abdi and Williams 2010a) tailored to handlenominal variables. Originally, CAwas developedto analyze contingency tables in which a sampleof observations is described by two nominal vari-ables, but it was rapidly extended to the analysisof any data matrices with nonnegative entries.

# Springer Science+Business Media LLC 2017R. Alhajj, J. Rokne (eds.), Encyclopedia of Social Network Analysis and Mining,https://doi.org/10.1007/978-1-4614-7163-9_140-2

Page 2: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

The origin of CA can be traced to the early workof Pearson (1901) or Fisher, but the modern ver-sion of correspondence analysis and its geometricinterpretation came from the 1960s in France andis associated with the French school of “data anal-ysis” (analyse des données) and was developedunder the leadership of Jean-Paul Benzécri. As atechnique, it was often discovered (andrediscovered), and so variations of CA can befound under several different names such as“dual scaling,” “optimal scaling,” “homogeneityanalysis,” or “reciprocal averaging.” The multipleidentities of correspondence analysis are a conse-quence of its large number of properties: Corre-spondence analysis can be defined as an optimalsolution for a lot of apparently different problems.

Key Points

CA transforms a data table into two sets of newvariables called factor scores (obtained as linearcombinations of, respectively, the rows and col-umns): one set for the rows and one set for thecolumns. These factor scores give the best repre-sentation of the similarity structure of, respec-tively, the rows and the columns of the table. Inaddition, the factor scores can be plotted as mapsthat optimally display the information in the orig-inal table. In these maps, rows and columns arerepresented as points whose coordinates are thefactor scores and where the dimensions are alsocalled factors, components (by analogy withPCA), or simply dimensions. Interestingly, thefactor scores of the rows and the columns havethe same variance and, therefore, the rows andcolumns can be conveniently represented in onesingle map.

In correspondence analysis, the total variance(often called inertia) of the factor scores is propor-tional to the independence chi-square statistic ofthis table and, therefore, the factor scores in CAdecompose this w2 into orthogonal components.

Historical Background

Correspondence analysis can be traced back to thework of Karl Pearson (1901, 1906) and seems tohave been independently discovered andrediscovered many times by different authors, inpart because CA is the solution to different opti-mization problems (see for more details and ref-erences the historical note of Lebart and Saporta2014). For example (among others), Haley in1935, Fisher in 1940, Maung in 1941, Guttmanin 1941, Burt in 1950, Hayashi in 1950, andKendall and Stuart in 1961, all proposed methodsthat were essentially equivalent to CA. However,the modern approach of CA was first exposed inthe 1965 Ph.D. dissertation of Brigitte Cordier(later better known as Brigitte Escofier; part ofher dissertation was also published as Escofier1969) written under the direction of Jean-PaulBenzécri. Since then, CA has been declined inmultiple different but related methods and isbecoming a very popular method particularlyadapted to very large data sets such as thosefound in Big Data problems.

Correspondence Analysis: Theory andPractice

NotationsMatrices are denoted by uppercase bold letters,vectors are denoted by lowercase bold, and theirelements are denoted by lowercase italic. Matrices,vectors, and elements from the same matrix all usethe same letter (e.g., A, a, a). The transpose oper-ation is denoted by ⊤; the inverse operation isdenoted by !1. The identity matrix is denoted I,vectors or matrices of ones are denoted 1, andmatrices or vectors of zeros are denoted 0. Whenprovided with a square matrix, the diag operatorgives a vector with the diagonal elements of thismatrix. When provided with a vector, the diagoperator gives a diagonal matrix with the elementsof the vector as the diagonal elements of thismatrix. When provided with a square matrix, thetrace operator gives the sum of the diagonal ele-ments of this matrix.

2 Correspondence Analysis

Page 3: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

The data table to be analyzed by CA is acontingency table (or at least a data table withnonnegative entries) with I rows and J columns.It is represented by the I " J matrix X, whosegeneric element xi,j gives the number of observa-tions that belong to the ith level of the first nom-inal variables (i.e., the rows) and the jth level ofthe second nominal variables (i.e., the columns).The grand total of the table is noted N.

ComputationsThe first step of the analysis is to transform the datamatrix into a probability matrix (i.e., a matrixcomprising nonnegative numbers and whose sum isequal to 1) denotedZ and computed asZ= N !1X.We denote r the vector of the row totals of Z (i.e.,r = Z1, with 1 being a conformable vector of 1 s),c the vector of the column totals (i. e., c= Z⊤1), andDc= diag{c},Dr= diag{r}. The factor scores areobtained from the following generalized singularvalue decomposition (GSVD; for details on the sin-gular value decomposition, see Abdi 1988, 2007a,2007b, Good 1969, Takane 2002, Hotelling 1933,Eckart and Young 1936, and Stewart 1993):

Z! rc⊤! "

¼ PDQ⊤with

P⊤D!1r P ¼ Q⊤D!1

c Q ¼ : (1)

Note that the subtraction of the matrix rc⊤ fromZ is equivalent to a double centering of matrixZ (Abdi 2007c, 2007e; Abdi et al. 1997). Thematrix P (respectively, Q) contains the left(respectively, right) generalized singular vectorsof (Z ! rc⊤), and the diagonal elements of thediagonal matrix D give its singular values. Thesquared singular values, which are called eigen-values, are denoted l‘ and stored into the diagonalmatrixL. Eigenvalues express the variance extra-cted by the corresponding factor, and their sum iscalled the total inertia (denoted I ) of the datamatrix. With the so-called triplet notation(Escoufier 2007) that is sometimes used as a gen-eral framework to formalize multivariate tech-niques, CA is equivalent to the analysis of thetriplet Z! rc⊤ð Þ,D!1

c ,D!1r

! ".

From the GSVD, the row and (respectively)column factor scores are obtained as

F ¼ D!1r PD and G ¼ D!1

c QD: (2)

Note that the factor scores of a given set (i.e.,the rows or the columns) are pairwise orthogonalwhen they describe different dimensions and thatthe variance of the factor scores for a given dimen-sion is equal to the eigenvalue associated with thisdimension. So, for example, the variance of therow factor scores is computed as

F⊤DrF ¼ DP⊤D!1r DrD

!1r PD

¼ DP⊤D!1r PD ¼ D2 ¼ L: (3)

What Does Correspondence AnalysisOptimize?In CA, the criterion that is maximized is the var-iance of the factor scores (see Lebart et al. 1984;Greenacre 1984). For example, the row first factorf1 is obtained as a linear combination of the col-umns of the matrix (Z ! rc⊤) taking into accountthe constraints imposed by the matrices D!1

r andD!1

c . Specifically, this means that we are searchingfor the vector q1 containing the weights of thelinear combination such as f1 is obtained as

f1 ¼ D!1r Z! rc⊤! "

D!1c q1, (4)

such that

f1 ¼ arg maxf

f⊤Drf, (5)

under the constraint that

q⊤1D!1c q1 ¼ 1: (6)

The subsequent row factor scores will maxi-mize the residual variance under the orthogonalityconstraint imposed by the matrixDr i:e:, f⊤2Dr f1 ¼ 0

! ":

How to Identify the Elements Important for aFactorIn CA, the rows and the columns of the table havea similar role (and variance) and therefore we canuse the same statistics to identify the rows and thecolumns important for a given dimension.

Correspondence Analysis 3

Page 4: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

Because the variance extracted by a factor (i.e., itseigenvalue) is obtained as the weighted sum of the

factor scores for this factor of either therows or columns of the table, the importance of arow (respectively, a column) is reflected by theratio of its squared factor score to the eigenvalueof this factor. This ratio is called the contributionof the row (respectively, column) to the factor.Specifi-cally, the contributions of row i tocomponent ‘ and of column j to component ‘ areobtained, respectively, as

ctri, ‘ ¼rif

2i, ‘

l‘and ctrj, ‘ ¼

cjg2j, ‘l‘

(7)

(with ri being the ith element of r and cj being thejth element of c). Contributions take valuesbetween 0 and 1, and their sums for a given factorare equal to 1 for either the rows or the columns.A convenient rule of thumb is to consider thatcontributions larger than the average (i.e., 1

I forthe rows and 1

J for the columns) are important fora given factor.

How to Identify the Important Factors for anElementThe factors important for a given row or columnare identified by computing statistics calledsquared cosines. These statistics are obtained bydecomposing the squared distance of an elementalong the factors of the analysis. Specifically, thevector of the squared (w2) distance from the rowsand columns to their respective barycenter (i.e.,average or center of gravity) is obtained as

dr ¼ diag FF⊤# $and dc ¼ diag GG⊤# $

:

(8)

Recall that the total inertia ( I ) in CA is equalto the sum of the eigenvalues and that, in CA, thisinertia can also be computed as the weighted sumof the squared distances of the rows or the col-umns to their respective barycenter. Formally, theinertia can be computed as

I ¼XL

l

l‘ ¼ r⊤dr ¼ c⊤dc: (9)

The squared cosines between row i and com-ponent ‘ and column j and component ‘ areobtained, respectively, as

cos 2i, ‘ ¼f 2i, ‘d2r, i

and cos 2j, ‘ ¼g2j, ‘d2c, j

: (10)

(with dr2, i and dc

2, j being, respectively, the ith

element of dr and the jth element of dc). Thesum of the squared cosines over the dimensionsfor a given element is equal to 1, and so the cosinecan be seen as the proportion of the variance of anelement that can be attributed to a givendimension.

Correspondence Analysis and theChi-Square TestCA is intimately related to the independence w2

test. Recall that the (null) hypothesis stating thatthe rows and the columns of a contingencytable X are independent can be tested by comput-ing the following w2 criterion:

w2 ¼ NXI

i

XJ

j

zi, j ! ricj! "2

ricj¼ Nf2, (11)

where f2 is called the mean square contingencycoefficient (for a 2 " 2 table, it is called thesquared coefficient of correlation associated withthe w2 test; in this special case it takes valuesbetween 0 and 1). For a contingency table, underthe null hypothesis, the w2 criterion follows a w2

distribution with (I ! 1)(J ! 1) degrees of free-dom and can, therefore, be used (under the usualassumptions) to test the independence hypothesis.

The total inertia of the matrix (Z ! rc⊤) (underthe constraints imposed on the SVD by the matri-ces D!1

r and D!1c ) can be computed as the sum of

the eigenvalues of the analysis or directly from thedata matrix as

4 Correspondence Analysis

Page 5: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

I ¼ trace D!1

2c Z! rc⊤ð Þ⊤D!1

r Z! rc⊤ð ÞD!12

c

n o

¼XI

i

XJ

j

zi, j! ricj! "2

ricj¼f2:

(12)

This shows that the total inertia is proportional tothe independence w2 (specifically I = N !1 w2)and therefore that the factors of CA perform anorthogonal decomposition of the independence w2

where each factor “explains” a portion of thedeviation to independence.

The Transition FormulaIn CA, rows and columns play a symmetric roleand their factor scores have the same variance. Asa consequence of this symmetry, the row factorscores (respectively, the column factor scores) canbe derived from the column factor scores(respectively, the row factor scores). This can beseen by rewriting Equation 2, taking into accountEquations 3 to 6. For example, the factor scoresfor the rows can be computed as

F ¼ D!1r PD

¼ D!1r Z! rc⊤ð ÞD!1

c Q

¼ D!1r Z! rc⊤ð ÞGD!1

¼ D!1r ZGD!1 ! D!1

r rc⊤GD!1:

(13)

Because the matrix (Z ! rc⊤) contains deviationsto its row and column barycenters, the row andcolumn sums are equal to 0 and therefore thematrix D!1

r rc⊤GD!1 is equal to 0 and Equation13 can be rewritten as

F ¼ D!1r ZGD!1: (14)

So, if we denote by R the matrix R ¼ D!1r Z in

which each row (whose sum is 1) is called a rowprofile and C the matrix C ¼ D!1

c Z⊤ in whicheach column (whose sum is 1) is called a columnprofile, the transition formulas are

F ¼ RGD!1 andG ¼ CFD!1: (15)

This shows that the factor scores of an elementof one set (e.g., a row) are computed as thebarycenter of the expression of this element inthe other set (e.g., the columns) followed by anexpansion (as expressed by the D!1 term) that isinversely proportional to the singular value ofeach factor.

In CA, the Singular Values Are Never Largerthan OneNote that, together, the two transition formulasfrom Equation 15 imply that the diagonal termsof D!1 are larger than one (because otherwise therange of each set of factor scores would be smallerthan the other one which, in turn, would imply thatall these factor scores are null) and, therefore, thatthe singular values in CA are always equal to orsmaller than one.

Distributional EquivalenceAn important property of correspondence analysisis to give the same results when two rows(respectively, two columns) that are proportionalare merged together. This property, called distri-butional equivalence (or also “distributionalequivalency”), also applies approximately: Theanalysis is only changed a little when two rows(or columns) that are almost proportional aremerged together.

How to Interpret Point ProximityIn a CAmap when two row (respectively, column)points are close to each other, this means that thesepoints have similar profiles, and when two pointshave the same profile, they will be located exactlyat the same place (this is a consequence of thedistributional equivalence principle). The proxim-ity between row and column points is more deli-cate to interpret because of the barycentricprinciple (see section on “The Transition For-mula”): The position of a row (respectively, col-umn) point is determined from its barycenter onthe column (respectively, row), and therefore, theproximity between a row point and one columnpoint cannot be interpreted directly.

Correspondence Analysis 5

Page 6: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

Asymmetric Plot: How to Interpret Row andColumn ProximityCA treats rows and columns symmetrically, andso their roles are equivalent. In some cases, how-ever, rows and columns can play different roles,and this symmetry can be misleading. As an illus-tration, in the example used below (see section“Example: The Colors of Sounds”), the partici-pants were asked to choose the color that wouldmatch a given piece of music. In this framework,the colors can be considered as a dependent vari-able and the pieces of music as an independentvariable. In this case the roles are asymmetric andthe plots can reflect this asymmetry by normaliz-ing one set such that the variance of its factorscores is equal to 1 for each factor. For example,the normalized to 1 column factor scores, denoted~Gwould be computed as (comparewith Equation 2)

~G ¼ D!1c Q: (16)

In the asymmetric plot obtained with F and ~G,the distances between rows and columns can nowbe interpreted meaningfully: The distance from arow point to a column point reflects their associ-ation (and a row is positioned exactly at thebarycenter of the columns).

Centered Versus Noncentered AnalysisCA is obtained from the generalized singularvalue decomposition of the centered matrix(Z ! rc⊤), but it could be also obtained from thesame singular value decomposition of matrix Z.In this case, the first pair of singular vectors isequal to r and c and their associated singular valueis equal to 1. This property is easily verified fromthe following relations:

cD!1c Z ¼ 1Z ¼ r and

rD!1r Z⊤ ¼ 1Z⊤ ¼ c: (17)

Because in CA, the singular values are neverlarger than one, r and c having a singular value of1 are the first pair of singular vectors of Z. There-fore, the generalized singular value decomposi-tion of Z can be developed as

Z ¼ rc⊤ þ Z! rc⊤! "

¼ rc⊤ þ PDQ⊤: (18)

sup

This shows that the ‘th pair of singular vectorsand singular value of (Z ! r ⊤) are the (‘ + 1)thpair of singular vectors and singular value of Z.

Supplementary ElementsOften in CA we want to know the position in theanalysis of rows or columns that were not actuallyanalyzed. These rows or columns are called illus-trative, supplementary, or out of sample rows orcolumns (or supplementary observations or vari-ables). In contrast with the appellation of supple-mentary (which are not used to compute thefactors), the active elements are used to computethe factors.

The projection formula uses the transition for-mula (see Equation 15) and is specific to corre-spondence analysis. Specifically, let i⊤ be anillustrative row and jsup be an illustrative columnto be projected (note that in CA, prior to projec-tion, an illustrative row or column is rescaled suchthat its sum is equal to 1). The coordinates of theillustrative rows (denoted fsup) and columns(denoted gsup) are obtained as

fsup ¼ i⊤sup1% &!1

i⊤supGeD!1

and

gsup ¼ j⊤sup1% &!1

j⊤supFeD!1

(19)

(Note that the scalar terms

i⊤sup1% &!1

and j⊤sup1% &!1

are used to ensure that

the sum of the elements of isup or jsup is equal to1; if this is already the case, these terms aresuperfluous).

Programs and Packages

CA is implemented in most statistical packages(e.g., SAS, XLSTAT, SYSTAT) with R giving themost comprehensive implementation. Severalpackages in R are specifically dedicated to corre-spondence analysis and its variants. The most

6 Correspondence Analysis

Page 7: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

popular are the packages CA, FactoMineR(Husson et al. 2011), ade4, vegan and ExPosition(Beaton et al. 2014, this last package was used toanalyze the example presented below). MATLABprograms are also available for download fromwww.utdallas.edu/~herve.

Example: The Colors of Sounds

To illustrate CAwe have collected data (as part ofa science project!) from 22 participants who werepresented with 9 “pieces of music” and asked toassociate 1 of 10 colors to each piece ofmusic. The pieces of music were (1) the musicof a video game (video), (2) a jazz song (jazz) (3) acountry and Western song (country), (4) a rapsong (rap), (5) a pop song (pop), (6) an extract ofthe opera Carmen (opera), (7) the low F noteplayed on a piano, (8) the middle F note playedon the same piano, and, finally, (9) the high F notestill played on the same piano. The data are shownin Table 1 where the columns are the pieces ofmusic, the rows are the colors, and the numbers atthe intersection of the rows and the columns givethe number of times the color in the row wasassociated with the piece of music in the column.A graphics representation of the data from Table 1is given by the “heat map” displayed in Fig. 1.

A CA of Table 1 extracted eight components.We will, here, only consider the first two compo-nents which together account for 65% of the totalinertia (with eigenvalues of 0.287 and 0.192,respectively). The factor scores of the observa-tions (rows) and variables (columns) are shownin Tables 2 and 3, respectively. The correspondingmap is displayed in Fig. 2.

We can see from Figs. 2 (symmetric plot) and 3(asymmetric plot) that the first component isolatesthe color black from all the other colors and thatblack is mostly associated with two pieces ofmusic: rap and the low F. The association withthe low note reflects a standard associationbetween pitch and color (high notes are perceivedas bright and low notes as dark); by contrast, theassociation of rap music with the black color is

likely to reflect a semantic association. Thesquared cosines show that the first componentaccounts for most of the variance of the blackcolor (93%, see Table 2). The second componentseparates the colors brown and (to a lesser extent)green from the other colors (in particular purple)and that brown and green as associated with Popand Middle F. On the other side of the seconddimension, we find the color purple and a quartetof pieces of music (Video, High F, Opera, andJazz).

Key Applications

CA, being a very versatile method, has beenapplied to many different domains such as sensoryevaluation, brain imaging, genetic, sociology,psychology, ecology, graph theory, etc. (see Behand Lombardo 2015, for a recent review ofapplications).

Future Directions

CA is still mostly a descriptive method, futuredirections of research are likely to integrate com-putationally based cross-validation methods toidentify significant dimensions and significantcontributions for rows and columns (see, e.g.,Beaton et al. 2016). Another likely developmentwould be to develop adapted sparsificationmethods (such as the LASSO, see, e.g., Allenet al. 2014). Also, the development of multi-block and multi-table extensions of CA willmake CA even more adapted to the analysis oflarge heterogonous data sets.

Conclusion

CA is a very versatile and popular technique inmultivariate analysis. In addition to the basicspresented here, CA includes numerous variantssuch as multiple correspondence analysis(to analyze several nominal variables, see Abdi

Correspondence Analysis 7

Page 8: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

Correspondence Analysis, Table 1 Twenty-two partic-ipants associated one of ten colors to nine pieces ofmusic. The column labeled xi+ gives the total number ofchoices made for each color. N is the grand total of the datatable. The vector of mass for the rows, r, is the proportion of

choices made for each color (ri = xi+/N). The row labeledx+j gives the total number of times each piece of music waspresented (i.e., it is equal to the number of participants).The centroid row, cT, gives these values as proportions(cj = x+j/N)

Color Video Jazz Country Rap Pop Opera Low F High F Middle F xi+ r

Red 4 2 4 4 1 2 2 4 1 24 0.121Orange 3 4 2 2 1 1 0 3 2 18 0.091Yellow 6 4 5 2 3 1 1 3 0 25 0.126Green 2 0 5 1 3 3 3 1 5 23 0.116Blue 2 5 0 1 4 1 2 1 3 19 0.096Purple 3 3 1 0 0 3 0 2 1 13 0.066White 0 0 0 0 1 4 1 5 3 14 0.071Black 0 2 0 11 1 3 10 1 1 29 0.146Pink 2 1 1 0 2 4 0 2 0 12 0.061Brown 0 1 4 1 6 0 3 0 6 21 0.106x+j 22 22 22 22 22 22 22 22 22 N = 198 1.000cT 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11

purple

black

Vid

eo

Jazz

Cou

ntry

Rap

Ope

ra

Low

F

Hig

h F

Mid

dle

F

Pop

yellow

white

blue

green

brown

pink

orange

Color Key

Number of participants choosing a color (white is none)0 2 4 6 8 10

Value

red

Correspondence Analysis, Fig. 1 CAThe Colors of Music. A heat map of the data from Table 1

8 Correspondence Analysis

Page 9: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

and Valentin 2007, Lebart et al. 1984, andGreenacre 2007), discriminant correspondenceanalysis (to assign observation to a priori definedgroups, see, e.g., Nakache et al. 1977; Saporta andNiang 2006; Abdi 2007d), and multi-block

correspondence analysis (when the variables arestructured into blocks, see Abdi and Valentin2007, Lebart et al. 1984, Greenacre 2007, Abdiet al. 2013, Beaton et al. 2016, and Williams et al.2010).

Correspondence Analysis, Table 2 CA The Color ofMusic. Factor scores, contributions, mass, mass" squaredfactor scores, inertia to barycenter, and squared cosines for

the rows. For convenience, squared cosines and contribu-tions have been multiplied by 1000 and rounded

F1 F2 Ctr1 Ctr2 ri ri " F21

! "ri " F2

2

! "ri " d2r, i

% &cos 21! "

cos 22! "

Red !0.026 0.299 0 56 0.121 0.000 0.011 0.026 3 410Orange !0.314 0.232 31 25 0.091 0.009 0.005 0.030 295 161Yellow !0.348 0.202 53 27 0.126 0.015 0.005 0.057 267 89Green !0.044 !0.490 1 144 0.116 0.000 0.028 0.048 5 583Blue !0.082 !0.206 2 21 0.096 0.001 0.004 0.050 13 81Purple !0.619 0.475 87 77 0.066 0.025 0.015 0.050 505 298White !0.328 0.057 26 1 0.071 0.008 0.000 0.099 77 2Black 1.195 0.315 726 75 0.146 0.208 0.014 0.224 929 65Pink !0.570 0.300 68 28 0.061 0.020 0.005 0.053 371 103Brown 0.113 !0.997 5 545 0.106 0.001 0.105 0.108 12 973� – – 1000 1000 – 0.287 0.192 0.746

l1 l2 I39% 26%t1 t2

Correspondence Analysis, Table 3 CA The Colors ofMusic. Factor scores, contributions, mass, mass" squaredfactor scores, inertia to barycenter, and squared cosines for

the columns. For convenience, squared cosines and contri-butions have been multiplied by 1000 and rounded

G1 G2 ~G1~G2 ctr1 ctr2 cj cj " G2

1

! "cj " G2

2

! " d2c, j

% &cos 21! "

cos 22! "

Video !0.541 0.386 !1.007 0.879 113 86 0.111 0.032 0.017 0.071 454 232Jazz !0.257 0.275 !0.478 0.626 25 44 0.111 0.007 0.008 0.069 105 121Country !0.291 !0.309 !0.541 !0.704 33 55 0.111 0.009 0.011 0.066 142 161Rap 0.991 0.397 1.846 0.903 379 91 0.111 0.109 0.017 0.133 822 132Pop !0.122 !0.637 !0.227 !1.450 6 234 0.111 0.002 0.045 0.064 26 709Opera !0.236 0.326 !0.440 0.742 22 61 0.111 0.006 0.012 0.079 78 149Low F 0.954 !0.089 1.777 !0.203 351 5 0.111 0.101 0.001 0.105 962 8High F !0.427 0.408 !0.795 0.929 70 96 0.111 0.020 0.018 0.074 271 249Middle F !0.072 !0.757 !0.134 !1.723 2 330 0.111 0.001 0.064 0.084 7 759� – – – – 1000 1000 – 0.287 0.192 0.746

l1 l2 I39% 26%t1 t2

Correspondence Analysis 9

Page 10: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

purpleHigh.F

Video Rap

black

Low.F

Operared

yellow

white

blue

Country

Pop

Middle.F

green

brown

Jazzpink orange

CorrespondenceAnalysis, Fig. 2 CATheColors ofMusic. Symmetric plot: Theprojections of the rows andthe columns are displayedin the same map.l1 = 0.287, t1 = 39;l2 = 0.192, t2 = 26. In thisplot the proximity betweenrows and columns cannot bedirectly interpreted

purple

High.F

Video

Rap

black

Low.F

Opera

redyellow

white

blue

Country

Pop

Middle.F

green

brown

Jazz

pinkorange

CorrespondenceAnalysis, Fig. 3 CATheColors ofMusic. Asymmetric plot:The projections of the rowsand the columns aredisplayed in the same map.The inertia of theprojections of the columnfactor scores is equal to1 for each dimension andthe inertia of the projectionsof the row factor scores arel1 = 0.287, t1 = 39;l2 = 0.192, t2 = 26. In thisplot the proximity betweenrows and columns can bedirectly interpreted

10 Correspondence Analysis

Page 11: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

Cross-References

▶Clustering Algorithms▶Data Mining▶Distance and Similarity Measures▶Eigenvalues, Singular Value Decomposition▶Matrix Algebra, Basics of▶Matrix Decomposition▶Network Analysis in French Sociology andAnthropology

▶Network Models▶ Principal Component Analysis▶ Probability Matrices▶ Similarity Metrics on Social Networks

References

Abdi H (1988) A generalized approach for connectionistauto-associative memories: interpretation, implicationsand illustration for face processing. In: DemongeotJ (ed) Artificial intelligence and cognitive sciences.Manchester University Press, Manchester, pp 149–164

Abdi H (2003) Multivariate analysis. In: Lewis-Beck M,Bryman A, Futing T (eds) Encyclopedia for researchmethods for the social sciences. Sage, Thousand Oaks,pp 699–702

Abdi H (2007a) Singular value decomposition (SVD) andgeneralized singular value decomposition(GSVD). In:Salkind NJ (ed) Encyclopedia of measurement andstatistics. Sage, Thousand Oaks, pp 907–912

Abdi H (2007b) Eigen-decomposition: eigenvalues andeigenvectors. In: Salkind NJ (ed) Encyclopedia of mea-surement and statistics. Sage, Thousand Oaks,pp 304–308

Abdi H (2007c) Discriminant correspondence analysis. In:Salkind NJ (ed) Encyclopedia of measurement andstatistics. Sage, Thousand Oaks, pp 270–275

Abdi H (2007d) Metric multidimensional scaling. In:Salkind NJ (ed) Encyclopedia of measurement andstatistics. Sage, Thousand Oaks, pp 598–605

Abdi H (2007e) Z-scores. In: Salkind NJ (ed) Encyclopediaof measurement and statistics. Sage, Thousand Oaks,pp 1057–1058

Abdi H, Valentin D (2007) Multiple correspondence anal-ysis. In: Salkind NJ (ed) Encyclopedia of measurementand statistics. Sage, Thousand Oaks, pp 651–657

Abdi H, Williams LJ (2010a) Principal component analy-sis. Wiley Interdiscip Rev Comput Stat 2:433–459

Abdi H, Williams LJ (2010b) Correspondence analysis. In:Salkind NJ (ed) Encyclopedia of research design. Sage,Thousand Oaks

Abdi H, Valentin D, O’Toole J (1997) A generalized auto-associator model for face processing and sex categori-zation: from principal components to multivariate

analysis. In: Levine D (ed) Optimality in biologicaland artificial networks. Lawrence Erlbaum, Mahwah,pp 317–337

Abdi H, Williams LJ, Valentin D (2013) Multiple factoranalysis: principal component analysis for multi-tableand multi-block data sets. Wiley Interdiscip RevComput Stat 5:149–179

Allen G, Grosencik L, Taylor J (2014) A generalized leastsquares matrix decomposition. J American Stat Assoc,Theory Methods 109:145–159

Beaton B, Chin Fatt CR, Abdi H (2014) An ExPosition ofmultivariate analysis with the Singular Value Decom-position in R. Computational Statistics & Data Analy-sis 72:176–189

Beaton D, Dunlop J, ADNI Abdi H (2016) Partial leastsquares-correspondence analysis: a framework tosimultaneously analyze behavioral and genetic data.Psychol Methods 21:621–651

Beh EJ, Lombardo R (2015) Correspondence analysis:theory, practice and new strategies. Wiley, London

Benzécri J-P (1973) L’analyse des données, vol 1,2. Dunod, Paris

Cordier B (1965) L’analyse des correspondances. PhDdissertation, University of Rennes

Eckart C, Young G (1936) The approximation of a matrixby another of a lower rank. Psychometrika 1:211–218

Escofier B (1969) L’analyse factorielle descorrespondances. Cahiers du Bureau Universitaire deRecherche Opérationnelle 13:25–59

Escofier B, Pagès J (1990) Analyses factorielles simples etmultiples: objectifs, méthodes, interprétation. Dunod,Paris

Escoufier Y (2007) Operators related to a data matrix: asurvey. In: COMPSTAT: 17th symposium proceedingsin computational statistics, Rome, Italy, 2006. PhysicaVerlag, New York, pp 285–297

Good I (1969) Some applications of the singular valuedecomposition of a matrix. Technometrics 11:823–831

Greenacre MJ (1984) Theory and applications of corre-spondence analysis. Academic, London

Greenacre MJ (2007) Correspondence analysis in practice,2nd edn. Chapman & Hall/CRC, Boca Raton

Hotelling H (1933) Analysis of a complex of statisticalvariables into principal components. J Educ Psychol25:417–441

Husson F, Lê S, Pagès J (2011) Exploratory multivariateanalysis by example using R. Chapman & Hall/CRC,Boca Raton

Hwang H, Tomiuk MA, Takane Y (2010) Correspondenceanalysis, multiple correspondence analysis and recentdevelopments. In: Millsap R, Maydeu-Olivares A (eds)Handbook of quantitative methods in psychology.Sage, London

Lebart L, Fénelon JP (1975) Statistique et informatiqueappliquées. Dunod, Paris

Lebart L, Saporta G (2014) Historical elements of corre-spondence analysis and multiple correspondence anal-ysis. In: Blasius J, Greenacre M (eds) Visualization andverbalization of data. CRC Press, Boca Raton

Correspondence Analysis 11

Page 12: Correspondence Analysis Definitionherve/abdi-AB2018_CA.pdf · Glossary CA Correspondence analysis ... # Springer Science+Business Media LLC 2017 R. Alhajj, ... Correspondence Correspondence

Lebart L, Morineau A, Warwick KM (1984) Multivariatedescriptive statistical analysis: correspondence analysisand related techniques for large matrices. Wiley,London

Nakache JP, Lorente P, Benzécri JP, Chastang JF(1977) Aspect pronostics et thérapeutiques del’infarctus myocardique aigu. Les Cahiers de l’Analysedes Données 2:415–534

Pearson K (1901) On lines and planes of closest fit tosystems of points in space. Philos Mag 6:559–572

Pearson K (1906) On certain points connected with scaleorder in the case of a correlation of two characterswhich for some arrangement give a linear regressionline. Biometrika 13:25–45

Saporta G, Niang N (2006) Correspondence analysis andclassification. In: Greenacre M, Blasius J (eds) Multiple

correspondence analysis and related methods. Chap-man & Hall, Boca Raton, pp 371–392

Stewart GW (1993) On the early history of the singularvalue decomposition. SIAM Rev 35:551–566

Takane Y (2002) Relationships among various kinds ofeigenvalue and singular value decompositions. In:Yanai H, Okada A, Shigemasu K, Kano Y, MeulmanJ (eds) New developments in psychometrics. Springer,Tokyo, pp 45–56

Williams LJ, Abdi H, French R, Orange JB(2010) A tutorial on multi-block discriminant corre-spondence analysis (MUDICA): a new method foranalyzing discourse data from clinical populations.J Speech Lang Hear Res 53:1372–1393

12 Correspondence Analysis