5
LOCAL COMPONENT ANALYSIS : A NEURAL MODEL FOR INFORMATION RETRIEVAL Alain LELU CNRS - Institut de 1'Information Scientifique et Technique SERID / Departement Recherche et Produits Nouveaux 26 rue Boyer, 75971 Paris CEDEX 20 France Tel. 33-1 43 58 35 59 p.455 Abstract : In this paper, we present a neural model implementing an associative mapping of highly multidimensional data - as encountered in information retrieval systems - upon a structured set of half-axes. Our non supervised learning rule uses a local reconsti- tution of the data by the network, and an analytic expression of the gain coefficient. We will show first that our model can be configured for a stochastic approximation of principal component analysis and cor- respondence analysis. Then we will discuss its applica- tion - in its associative mapping configuration - to real world documentary databases. Finally an example of such an application will be presented. Our starting point is an application problem : how to enhance information retrieval systems in order to allow for "convivial" browsing in documentary data basis ? Querying documentary systems in natural language appears today to be the high road for success in this context. Of course, a high degree of tolerance of these systems to typographical and spelling mistakes, to the use of synonyms, is extremely useful in the man-machine relationship, especially when our goal is to make these systems available to ever wider groups of users. But it appears that the search for a somewhat anthropomorphic dlalogue with the computer is premature as long as a certain number of representational problems are not resolved : how to supply the human interlocutor with an overall view of the editorial positionb) adopted by the administrators and indexers of the documentary base, whether implicitly or not (1.2) ? This interlocu- tor should have an idea of what he might, flnd in th system. In short, he should know with whom and with what he is dealing ! What tools could he be glven to help him find his way during his search so that he will progress at all times in the directions that are most relevant from his point of view ? or might not We will present in this paper a neural model trying to cope with those representational problems - for us "representation" has the weak, "visual" and fuzzy connotation it has in the field of Data Analysis, and not the "hard" symbolic meaning it has in the field of Artificial Intelligence. 1 - Factors and neurons Principal components analysis consists in reducing the space of representation of a set of observations : the goal is to shift from a description of each object by numerous describers to a description by a few factorial values only, i. e. to reduce the number of dimensions in the representation space (the axes of such a reduced space are known as the "factors") . Formally it is possible to represent simultan- eously the rows and columns of a data table in this space ; the factorial value uj of the j-th "pattern" of data (for example a row of numbers in a table containing the values of descriptive variables for a particular "individual") is given by the classical transition formula between the normalized factorial values wi of the columns and those of the rows : uj = (UTA) fi xij WL where A is a singular value of the matrix XX'. Let us denote m a "synaptic weight" vector of arbitrary norm Iml : m /!mi = (WI) Thus the transition formula is in fact a very simple transfer function, i.e. the inner product : rl = <m. x> This leads us to consider the first p factors as where n = ujimiJA being the responses of p simply defined neurons (with linear response and without threshold), all subjected to the successive input rows {xij), of "synaptic weights" ma. each creating na in output, this output being the sum of the inputs weighted by the synaptic weights (see figure 1). input = incaing data raw by rm figure 1 11-43

[IEEE International Joint Conference on Neural Networks - Washington, DC, USA (1990.06.17-1990.06.21)] International Joint Conference on Neural Networks - Local component analysis:

Embed Size (px)

Citation preview

Page 1: [IEEE International Joint Conference on Neural Networks - Washington, DC, USA (1990.06.17-1990.06.21)] International Joint Conference on Neural Networks - Local component analysis:

LOCAL COMPONENT ANALYSIS : A NEURAL MODEL FOR INFORMATION RETRIEVAL

Alain LELU CNRS - Institut de 1'Information Scientifique e t Technique

SERID / Departement Recherche e t Produits Nouveaux 26 rue Boyer, 75971 Paris CEDEX 20 France

Tel. 33-1 43 58 35 59 p.455

Abstract : In this paper, we present a neural model implementing an associative mapping of highly multidimensional da ta - as encountered in information retrieval systems - upon a structured set of half-axes. Our non supervised learning rule uses a local reconsti- tution of t h e data by t h e network, and a n analytic expression of the gain coefficient. W e will show f i r s t t h a t our model can be configured for a stochastic approximation of principal component analysis and cor- respondence analysis. Then we will discuss its applica- tion - in its associative mapping configuration - t o real world documentary databases. Finally a n example of such a n application will be presented.

Our starting point is an application problem : how to enhance information retrieval systems in order to allow for "convivial" browsing in documentary data basis ?

Querying documentary systems in natural language appears today to be the high road for success in th i s context. Of course, a high degree of tolerance of these systems to typographical and spelling mistakes, to t h e use of synonyms, is extremely useful in t h e man-machine relationship, especially when our goal is to make these systems available to ever wider groups of users. But i t appears t h a t the search for a somewhat anthropomorphic dlalogue with the computer is premature as long as a certain number of representational problems are not resolved : how t o supply t h e human interlocutor with an overall view of the editorial posi t ionb) adopted by the administrators and indexers of the documentary base, whether implicitly or not (1.2) ? This interlocu- tor should have an idea of what he might, flnd in t h system. In short, he should know with whom and with what he is dealing ! What tools could he be glven to help him find his way during his search so that he will progress at a l l times in the directions tha t are most relevant from his point of view ?

or might not

We will present in this paper a neural model trying to cope with those representational problems - for us "representation" has the weak, "visual" and fuzzy connotation i t has in the field of Data Analysis,

and not the "hard" symbolic meaning it has in the field of Artificial Intelligence.

1 - Factors and neurons

Principal components analysis consists in reducing the space of representation of a set of observations : the goal is t o shif t from a description of each object by numerous describers t o a description by a few factorial values only, i. e. to reduce the number of dimensions in t h e representation space ( the axes of such a reduced space are known as t h e "factors") .

Formally it is possible t o represent simultan- eously the rows and columns of a data table in th i s space ; t h e factorial value u j of t h e j-th "pattern" of da ta (for example a row of numbers in a table containing t h e values of descriptive variables for a particular "individual") is given by t h e classical transition formula between t h e normalized factorial values wi of t h e columns and those of t h e rows :

u j = (UTA) fi xij WL

where A is a singular value of t h e matrix XX'.

Let us denote m a "synaptic weight" vector of arbitrary norm I m l :

m / ! m i = (WI)

Thus t h e transition formula is in fact a very simple transfer function, i.e. t h e inner product :

r l = <m. x>

This leads us t o consider t h e f i rs t p factors as

where n = u j i m i J A

being t h e responses of p simply defined neurons (with linear response and without threshold), all subjected to the successive input rows {xij), of "synaptic weights" ma. each creating na in output, this output being t h e sum of t h e inputs weighted by t h e synaptic weights (see figure 1).

input = incaing data raw by rm

figure 1

11-43

Page 2: [IEEE International Joint Conference on Neural Networks - Washington, DC, USA (1990.06.17-1990.06.21)] International Joint Conference on Neural Networks - Local component analysis:

2 - Learning process and normalization

If we want to enhance th i s model in order to produce real factorial analysis, fas t dynamics must coexist with slow dynamics, i. e. output derived from inputs must coexist with the process of upgrading synaptic weights which converge towards the factors. Such a process is in fact achieved by the stochastic approximation method of factor analysis (3 , 4) , a method t h a t inspired us our model for one factor :

Starting from the "iterated power" algorithm applied to the da ta matrix X = {x} we obtain the following formula for updating the weight vector mk :

mt = mt-i + ( l / T t ) n x - (1 - i t - i / n ) mt-1

where ( 1 h ) is a gain coefficient. Under the constraint of norm Im; to be a constant with a desired value igf it yields :

I 6m = (1h) Q (x - Q m / : g i g )

T := T + Q z / ~ I I l ~ P

! with T(O) = ;x ( i ) iP

This algorithm, like all those written in the form 6m = ar)x - pnzrn, allows m t o converge stochastically (5) towards t h e first eigenvector of the matrix XXT, which norm is m, and here !mi. In our model, the gain (UT) is expressed analytically in order to opti- mize t h e convergence process. The scalar T is an estimate of the sum of t h e squares of the projections of the data vectors upon t h e f i rs t factorial axis.

As presented here, the algorithm cannot supply uncorrelated factors : our aim will therefore be to introduce into t h e network s t ructural couplings between several neurons. The result of this action may then be equivalent to orthogonalization, depending on the con- figuration of those couplings.

3 - Orthoprorialhation and lateral inhibition

We shal l expose f i rs t an enhancement of our model by a feedback mechanism :

3 - 1 - A model with a Dredictor network :

Using principal component analysis, given K fac- tors, t h e original da ta can be reconstituted according to the formula :

2 = I k Uk W k f h k

With our notations, and when the convergence of the : m i is achieved, it yields :

~

X = (l/il$!*)xk m k Ilk

Thus 2 can be interpreted as the output vector of a "dual" network whose input vector is y = {qk) ; in fact the weights of this network a re the same as those of the "direct" net.

We can improve our model with the following feedback "delta" learning rule :

6 m k = (I/Tk) Qk (x - 2) i lk := lk + n k P / i m : 2 -.

which includes our original law, in the case of a network of a single neuron.

3 - 2 - A "neural" interpretation of the Gram-Schmidt orthogonalization procedure :

It is easy t o verify t h a t the Gram-Schmidt ortho- gonalization procedure can be obtained with our model under t h e following conditions :

- each weight vector is introduced successively a s a data vector - so Tk = Ilk - and updates only itself.

- the reconstitution of the data used when upda- ting the factor k is limited to the contribution of this factor and t h a t of its predecessors.

This exact procedure can be approximated stocha- stically in the process of updating the weight vectors by using a filtered vector one : if nk is the "dominant" component of this Vector, the components n k t l t o n~ are se t to 0.

instead of the original

Another procedure is more accurate but has less generalization potential : i t consists in updating the first factor at t h e f i rs t data "sweep", then the second one a t the second sweep, and so on.

Our simulations with the first procedure on real data a re encouraging : even if the orthogonalization is not perfect and the behaviour of the algorithm is sensitive to the nature of data, a structure which can be interpreted emerges as the data are introduced. Our aim is not to design a new principal component algorithm but ra ther to derive a more local method adapted t h e description of very large data tables.

4 - Metric and weights suited to correspondence tables

Good resul ts for qualitative da ta are obtained from correspondence analysis of co-occurence matrices, given the underlying chi-square distance measure used ; this distance indeed has been demonstrated (6 ) to be strongly related to Shannon's entropy. In the scope of our model, we can simulate a correspondence analysis (7) :

- transforming our raw data (fi., !J and v.. are respectively the sums of columns, rows, and general sum of the data matrix X) :

- and then weighting our factors :

f(i) = m(ij) .Sg, g(j) = <m, x> Jfj

This transformation of the data can be interpreted as some sort of "contrast enhancement" of the input patterns.

5 - Generalization : a neural model for "partial half- factors"

5 - 1 - Cooperative/comDetitive learn iw for non-linear mapping

It is possible to experiment with other intercon- nection structures between cells : if each cell "excites" i t s nearby neighbors and "inhibits" more dis tant ones ("mexican hat" interaction), a network of numerous cells geographically arranged in space creates a global "map" of learned data ; this map may be a highly non-linear projection of the multidimensional data s t ructure on a portion of the surface. This approach is inspired by Kohonen ( 8 ) . who lmplemented

11-44

Page 3: [IEEE International Joint Conference on Neural Networks - Washington, DC, USA (1990.06.17-1990.06.21)] International Joint Conference on Neural Networks - Local component analysis:

simple versions of such associative mapping : "positive" mexican hat , a(x-m) learning rule.

We have designed a model in which 2 i s a kind of "local estimation" of the input vector x. which tends to generalise the previous pseudo-Gram-Schmidt orthogo- nalization method beyond hierarchically derived axes :

where D is the domain of influence of the "dominant" neuron, whose output is maximum when input vector x is read, and fik i s a "filtered" version of the component r)k of the output vector y, transformed by the "mexican hat" coefficients (see figure 2).

6 - 2 - Strict orthogonalization or not ? Axes or half- axes ?

Our practice of data analysis techniques suggests us a few remarks :

- Strict orthogonalization of the factors is an artliicial constraint imposed on data representation ;

figure 3

U !i

fipure 2

principal component analysis is not the unique solution to the factor analysis problem : see the Oblimax, Varimax. ... methods ; the relevant point i s to "catch" local areas with a relative higher density of data vectors x ;

- Factor representation tends to privilege couples of oppositions. Observed structures of data are often different : for example, three poles may induce unstable and arbitrary rotations between two eigen- vectors with neighbouring eigenvalues. In our opinion, the most adequate representation is a se t of projec- tions upon half-axes issued from the center of gravity of the data cloud. This type of representation can also be considered a s a se t of overlapping clusters of descriptors and described objects (see figure 3). I t i s In accordance with the fuzziness of most of natural classes, proven by the well-known instability of clus- tering techniques with respect to small variations in their control parameters (number of classes, thresholds, metrics. . . . ).

terns

This is why our model includes the extraction of half-axes by introducing a 0 threshold upon the trans- fer function of the network ; these arguments also justify i t s weak orthogonalization constraints, due to our learning scheme.

In this way, our delta learning law tracks the local peaks of the smooth landscape drawn by the inertia of the data projections upon haif-axes (see figure 4).

This algorithm is suboptimal ; simulations on a small data s e t show t h a t if we freeze all the gains 1 / ~ af ter the f i rs t sweep upon the data , and then carry out a dozen extra sweeps, the system will converge towards a s e t of stable "important" half axes ; these axes are : - clearly interpretable, - independent of the order of the patterns, and of the initial random values of the synaptic weights, - independent of the interaction structure.

This las t point may seem a bit paradoxal : in fact the relative position of the axes on the global map is not independant of the inter-neuron structure. We could derive t h e same s e t of axes without any interaction between axes, but then we would have no longer be able to obtain a global significant map.

11-45

Page 4: [IEEE International Joint Conference on Neural Networks - Washington, DC, USA (1990.06.17-1990.06.21)] International Joint Conference on Neural Networks - Local component analysis:

/fb

Inertia along half-axes issued

in t h e 3 D space. from G : example of 4 points

Ca. Cb, Cc are the half-axes with maximal inertia.

figure 4

A difficult problem with our model is to know how many local peaks should be found in a given s e t of data , and how to be sure not to "miss" any of them . For the moment we only have an empirical answer. We have carried out simulations on a big documentary database and compared our results with those obtained with other techniques. When a "sufficient" number of neurons a re used, say between one tenth and one third of the dimensions of the patterns, very few local peaks are missed, as f a r as we know. This experience is reported in the next section.

4 - APPLICATION TO INFORMATION RETRIEVAL

Our NEURONAD neural network simulation program has been applied to a documentary base containing 1200 bibliographic notices indexed by the 500 most frequent terms in this corpus ; this was a small subset of the PASCAL database relative to polymer research from 1972 to 1975, already analyzed with the LEXIMAPPE clustering technique (9). About 30 half-axes emerged, which were meaningfully agregated along chains or small clusters ; these axes have been interpreted by specialists of the domain, and most of them cross-check the content of LEXIMAPPE clusters. Computation on a PC AT personal computer took a few seconds for "learning" each docu- ment. Another example which i s easier to Interpret but shows the properties of our system, concerns a picture database : the corpus is a sample of 1000 photos indexed by 144 key-words, each photo having on the average 2.1 indexing terms. The results obtained are satisfactory : meaningfully along some ten axes in a single "sweep". The axes are both global and local (i. e. relative to weak subsets of the base) a s can be seen in figure 5 ; the inter-neuron triangularly arranged neurons.

key-words and documents were classified

structure was a "tiling" of 19

A s a conclusion, highly multidimensional documen- ta ry spaces a re a n entangled mesh of associations

between overlapping objects containing both global and local structures. Their representation can be achieved using a global map of spatially interconnected neurons "switched on" in a low number for each pattern of data. This map - which will as a rule not be linear - will yield an optimal representation defining the appropriate projection for a global interpretation of the documentary space. Each half-axis emphasizes the positions of a fuzzy subset of terms and documents, and t h e half-axes need not be orthogonal.

The user may then orient himself and move within

of data properties by

the global map as well a s within local maps, assisted in this by t h e constantly increasing graphic and real time capabilities of personal computers and work- stations for multiwindow display of words and images (10). A t any moment, he can scan these elements along any axis of his choice, or browse around the environ- ment of a term or a document, given any context embodied in one axis or another.

The process of documentary indexing will result in real time "learning" of each new documentary unit by the simulated network a t each step. Assimilation of each unit will resul t in a slight modification of the various "points of view" embodied in the neurons. This operation of "synaptic weights" updating is, of course, intrinsically adapted to massively parallel data pro- cessing architectures.

References :

(1) A. Lelu, D. Rosenblatt - "A neural network model for information retrieval" - "Euro 88 - ESPCI, Paris, June 6-9 1988

(2) A. Lelu. D. Rosenblatt - "Representation e t parcours d'un espace documentaire. Analyse des donnees, reseaux neuronaux et banques d'images" - Les Cahiers de 1'Analyse des donnees. vol.XI, 1986, "4, pp.453-470 - Dunod - Paris

(3) J. P. Benzecri - "Approximation stochastique dans une algebre non normee commutative" - Bull. Soc. Math. France N o 97 - Paris - 1969

(4) L. Lebart - "Sur les calculs impliques par la description de certains grands tableaux" - Annales de I'INSEE Ne 22-23 - Paris - 1976

(5) E. Oja - "A Simplified Neuron Model as a Principal Component Analyzer" - Journal of Mathematical Biology, 15 (1982)

(6) J. P. Benzecri e t coll. - "L'analyse des donnees" (2 tomes) - Dunod - Paris - 1973

(7) M. Volle, D. Domenges - "Analyse factorielle spherique : une exploration" - Annales de I'INSEE, NO35 - Paris 1979

( 8 ) T. Kohonen - "Self-organization and associa- tive memory" - Springer-Verlag - New York - 1984

(9) J.P. Cosrtial - "Qualitative models, quantita- t ive tools and network analysis" - forthcoming in Scientometrics. Special Issue : Amsterdam Workshop of Dec. 1987

(10) A. Lelu - "Browsing through Image Databases via Data Analysis and Neural Networks" - Communication a R I A 0 88 (User-Oriented Con- tent-Based Text and Image Handling) - MIT - 21-24 Mars 1988

11-46

Page 5: [IEEE International Joint Conference on Neural Networks - Washington, DC, USA (1990.06.17-1990.06.21)] International Joint Conference on Neural Networks - Local component analysis:

el u m rl

k

m

4 o r o m

rl E m a

r c 4 % O U

m E U

0 0 0

g a s k rl P

c k m

m $4

* .rl $4

U .r(

4 8

11-47