48
Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization 2011.4.12

Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Lecture 7 Mathematics of Data:

Introduction, PCA, MDS, and Graph Realization

姚 远 �

2011.4.12  

Page 2: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Golden Era of Data  

  Bradley Efron said, the next 30 years is the golden era of statistics, as it is the golden era of Data

  Unfortunately, traditional statisticians are not dealing with massive data, but modern massive data are met everyday by biologists, computational scientists, financial engineers, and IT engineers

  So we are witnessing a new era of data science, where our understanding of nature is driven by data, a collection of empirical evidence

  Data is science, a way to ‘learn’ laws from empirical evidence

Page 3: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Types of Data  

  Data as vectors in Euclidean spaces   images, videos, speech waves, gene expr., financial data   most of statistical methods deal with this type of data

  Data as points in metric spaces   Conformational dynamics (RMSD distance)  Computational geometry deals with this type of data in 3D

  Data as graphs   internet, biological/social networks   data where we just know relations (similarity,…)   data visualization   Computer Science likes this type of data

  and there are more coming …

Page 4: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

A Dictionary between Machine Learning vs. Statistics  

Machine  Learning   Sta.s.cs  

Supervised  Learning   Regression  Classifica5on  Ranking  …  

Unsupervised  Learning   Dimensionality  Reduc5on  Clustering  Density  es5ma5on  …  

Semi-­‐supervised  Learning   ✗  

Page 5: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Contribu5ons  

•  Machine  learning  and  computer  science  society  had  proposed  many  of  scalable  algorithms  to  deal  with  massive  data  sets  

•  Those  algorithms  are  oDen  found  conformal  to  consistency  theory  of  sta5s5cs  

•  Any  role  for  mathema5cs?  

Page 6: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Basic  Tenet  in  Data  Analysis  

•  The  basic  tenet  in  data  analysis  is  to  understand  

The  varia5on  of  data,  

in  terms  of  signal  vs.  noise    

•  Geometry  and  topology  begin  to  enter  this  Odyssey,  a  dawn  or  a  dusk?    

•  Explore  all,  then  you  know  the  answers  …  

Page 7: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Geometric and Topological Methods

  Algebraic geometry for graphical models   Bernd Sturmfels (UC Berkeley)   Mathias Drton (U Chicago)

  Differential geometry for graphical models   Shun-ichi Amari (RIKEN, Japan)   John Lafferty (CMU)

  Integral geometry for statistical signal analysis  Jonathan Taylor (Stanford)

  Spectral kernel embedding:   LLE (Roweis, Saul etal), ISOMAP (Tennenbaum etal), Laplacian eigenmap (Niyogi, Belkin etal), Hessian LLE (Donoho etal), Diffusion map (Coifman, Singer etal)

  Computational topology for data analysis   Herbert Edelsbrunner (Institute of Sci. Tech. Austria), Afra Zomorodian (Dartmouth), Gunnar Carlsson (Stanford) et al.

Page 8: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Two aspects in those works

  characterizing local or global constraints (symmetry) in model spaces or data

  algebraic and differential geometry for graphical models   spectral analysis for discrete groups (Risi Kondor)

  characterizing nonlinear distribution (sparsity) of data   spectral kernel embedding (nonlinear dimensionality reduction)   integral geometry for signal analysis   computational topology for data analysis

Geometry and topology may play a role as

Let’s focus on the second aspect.

Page 9: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Geometric and Topological Data Analysis

  General area of geometric data analysis attempts to give insight into data by imposing a geometry (metric) on it

  manifold learning: global coordinate preserving local structure   metric learning: find a metric accounting for similarity

  Topological method is to study invariants under metric distortion   clustering as connected components   loops, holes

  Between them, lies in Hodge Theory, a bridge over geometry and topology

  manifold learning, topological data analysis   statistical ranking

Page 10: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Manifold  learning:  Spectral  Embeddings  

– Principle  Component  Analysis  (PCA)  

– Mul5-­‐Dimensional  Scaling  (MDS)  – Locally  Linear  Embedding  (LLE)  –  Isometric  map  (ISOMAP)  

– Laplacian  Eigenmaps  – Diffusion  map  – …  

Page 11: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Dimensionality Reduction

•  Data  are  concentrated  around  low  dimensional  Manifolds  –  Linear:MDS,  PCA  

–  NonLinear: ISOMAP,  LLE,  …  

Page 12: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Generative Models in Manifold Learning

Page 13: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Principal Component Analysis (PCA)

•  Principal  Component  Analysis  (PCA)  

One Dimensional Manifold

Xp×n = [X1 X2 ... Xn ]

Page 14: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Distances  &  Mappings  

•  Given  an  Euclidean  embedding,  it’s  easy  to  calculate  the  distances  between  the  points  :  

•  MDS  operates  the  other  way  round:    •  Given  the  “distances”  [data]  find  the  embedding  map  [configura5on]  

which  generated  them  –  …  and  MDS  can  do  so  when  all  but  ordinal  informa5on  has  been  je\soned  

(fruit  of  the  “non-­‐metric  revolu5on”)  –  even  when  there  are  missing  data  and  in  the  presence  of  considerable  

“noise”/error  (MDS  is  robust).  •  MDS thus provides at least

–  [exploratory] a useful and easily-assimilable graphic visualization of a complex data set (Tukey: “A picture is worth a thousand words”)  

Page 15: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

PCA => MDS

•  Here  we  are  given  pairwise  distances  instead  of  the  actual  data  points.  –  First  convert  the  pairwise  distance  matrix  into  the  dot  product  matrix    

–  ADer  that  same  as  PCA.  

If we preserve the pairwise distances do we preserve the structure??

Page 16: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

How to get dot product matrix from pairwise distance matrix?

k

ijd

kjd

kid

αj

i

Page 17: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Origin centered MDS •  MDS—origin  as  one  of  the  points  and  orienta5on  arbitrary.  

Centroid as origin

Page 18: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Mul5dimensional  Scaling  (MDS)  

Classical MDS (squared distance dij): Young/Householder 1938, Shoenberg 1938

Page 19: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Example:  50-­‐node  2-­‐D  Sensor  Network  Localiz5on  

Page 20: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Example II City Map

Page 21: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Example  III:  Molecular  Conforma5on  

Page 22: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Matlab  Commands  •  STATS  toolbox  provides  the  following:  

– cmdscale  -­‐  classical  MDS  

– mdscale  –  nonmetric  MDS  

Page 23: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Theory  of  Classical  MDS:  a  few  concepts  

•  An  n-­‐by-­‐n  matrix  C  is  posi5ve  semi-­‐definite  (psd)  if  for  all  v∈Rn,  v’Cv  ≥  0.  

•  An  n-­‐by-­‐n  matrix  C  is  condi5onally  nega5ve  definite  (c.n.d)  if  for  all  v∈Rn  such  that  sumi  vi=0,  v’Cv  ≤  0.  

Page 24: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Young-­‐Householder-­‐Shoenberg  Lemma  

•  Let  x  be  a  signed  distribu5on,  i.e.  x  obeying  sumi  xi=1  while  xi  can  be  nega5ve  

•  Householder  centering  matrix:  H(x)  =  I  –  1x’;  

•  Define  B(x)  =  -­‐  1/2  H(x)  C  H’(x),  for  any  C  •  Theorem  [Young/Householder,  Schoenberg  1938b]  For  any  signed  distribu5on  x,  

B(x) p.s.d. iff C c.n.d.

Page 25: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Proof  

•   “⇐”    first  observe  that  if  B(x)  is  p.s.d.,  then  B(y)  is  also  p.s.d.  for  any  other  signed  distribu5on  y,  in  view  of  the  iden5ty  B(y)  =  H(y)B(x)H’(y),  itself  a  consequence  of  H(y)  =  H(y)H(x).  Also,  for  any  z,  z’B(x)z  =  −y’Cy/2,  where  the  vector  y  =  H’(x)z  obeys  sumi  yi  =  0  for  any  z,  showing  necessity.    

•   “⇒”  Also,  y  =  H’(x)y  whenever  sumi  yi  =  0,  and  hence  y’B(x)y  =  −y’Cy/2,  thus  demonstra5ng  sufficiency.  

Page 26: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Classical  MDS  Theorem  

•  Let  D  be  n-­‐by-­‐n  symmetric  matrix.  Define  a  zero  diagonal  matrix  C  to  be  Cij  =  Dij  –  0.5  Dii  –  0.5  Djj.  Then  we  have  – B(x)  :=  -­‐0.5  H(x)  D  H’(x)  =  -­‐0.5  H(x)  C  H’(x)  – Cij  =  Bii(x)  +  Bjj(x)  –  2Bij(x)  – D  is  c.n.d.  iff  C  is  c.n.d.  –  If  C  is  c.n.d.,  then  C  is  isometrically  embeddable,  i.e.  Cij  =  sumk(Yik-­‐  Yjk)2  where    

Y= U Λ1/2, with eigendecomp B(x) =U Λ U’

Page 27: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Proof  •  the  first  iden5ty  in  follows  from  H(x)1  =  0;    •  the  second  one  from  Bii(x)  +  Bjj(x)  −  2Bij(x)  =  Cij  −  0.5Cii  −  0.5Cjj,  itself  a  consequence  of  the  defini5on  Bij(x)  =  −0.5Cij  +  γi  +  γj  for  some  vector  γ;    

•  the  next  asser5on  follows  from  u’Du  =  u’Cu  whenever  sumi  ui  =  0;    

•  the  last  one  can  be  shown  to  amount  to  the  second  iden5ty  by  direct  subs5tu5on.    

Page 28: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Remarks  •  P.S.D.  B(x)  (or  c.n.d.  C)  defines  a  unique  squared  Euclidean  distance,  which  sa5sfies:  

•  Bij(x)  =  -­‐  0.5  (Dij  –  Dik  –  Djk),  with  freedom  x  to  choose  center  

•  B  =  Y  Y’,  the  scalar  product  matrix  of  n-­‐by-­‐d  Euclidean  coordinate  matrix  Y  

Page 29: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Gaussian  Kernels  •  Theorem.  Let  Dij  be  a  squared  Euclidean  distance.  Then  for  any  λ≥0,  Bij(λ)=exp(-­‐λDij)  is  p.s.d.,  and  Cij(λ)  =  1  –  exp(-­‐λ  Dij)  is  a  squared  Euclidean  distance  (c.n.d.  with  zero  diagonal).  

•  So  Gaussian  kernels  are  p.s.d.  and  1  –  gaussian  kernel  is  a  squared  Euclidean  distance.  

Page 30: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Schoenberg  Transform  •  A  Schoenberg  transforma5on  is  a  func5on  φ(D)  from  R+  to  R+  of  the  form  (Schoenberg  1938a)  

•  where  g(λ)dλ  is  a  non-­‐nega5ve  measure  on  [0,  ∞)  such  that                              

φ(d) =1− exp(−λd)

λ0

∫ g(λ)dλ

g(λ)λ

dλ1

∫ < ∞

Page 31: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Schoenberg  Theorem  [1938a]  •  Let   D   be   a   n   x   n  matrix   of   squared   Euclidean  distances.  Define  the  components  of   the  n  x  n  matrix   Cij   as   Cij   =   φ(Dij).   Then   is   a   squared  Euclidean   distance   iff   φ   is   the   Schoenberg  Transforma5on,  

φ(d) =1− exp(−λd)

λ0

∫ g(λ)dλ

Page 32: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Nonmetric MDS •  Instead  of  pairwise  distances  we  

can  use  paiwise  “dissimilari5es”.  

•  When  the  distances  are  Euclidean  MDS  is  equivalent  to  PCA.  

•  Eg.  Face  recogni5on,  wine  tas5ng  

•  Can  get  the  significant  cogni5ve  dimensions.  

Page 33: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Generalized MDS: Quadratic equality and inequality

system

Page 34: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Generalized MDS: anchor points

Page 35: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Existing Work

•  Barvinok,  Pataki,  Alfakih/Wolkowicz  and  Laurent  used  SDP  models  to  show  that  the  problem  is  solvable  in  polynomial  5me  if  the  dimension  of  the  localiza5on  is  not  restricted.  

•  However,  if  we  require  the  realiza5on  to  be  in  Rd  for  some  fixed  d,  then  the  problem  becomes  NP–complete  (e.g.,  Saxe  1979,  Aspnes,  Goldenberg,  and  Yang  2004).  

Page 36: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Key Problems

Page 37: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Nonlinear Least Squares

A difficult global optimization problem.

Page 38: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Matrix Represenation I

A difficult global optimization problem.

Page 39: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Matrix Represenation II

Page 40: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

SDP Relaxation [Biswas-Ye 2004]

Page 41: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

SDP Standard Form [Biswas-Ye 2004]

Page 42: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

SDP Dual Form [Biswas-Ye 2004]

Page 43: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Unique Localizability

UL is called universal rigidity.

Page 44: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

ULP is localizable in polynomial time [So-Ye 2005]

Page 45: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

UL Graphs

Page 46: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

d-lateration Graphs

Page 47: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Open Problems

Page 48: Lecture 7 Mathematics of Data: Introduction, PCA, MDS, and Graph Realization · 2015. 11. 28. · Graph Realization 姚 远 2011.4.12! Golden Era of Data! Bradley Efron said, the

Reference  

•  Schoenberg,  I.  J.  (1938a)  Metric  Spaces  and  Completely  Monotone  Func5ons.  The  Annals  of  Mathema5cs  39,  811–841  

•  Schoenberg,  I.  J.  (1938b)  Metric  Spaces  and  Posi5ve  Definite  Func5ons.  Transac5ons  of  the  American  Mathema5cal  Society  44,  522–536  

•  Francois  Bavaud  (2010)  On  the  Schoenberg  Transforma5ons  in  Data  Analysis:  Theory  and  Illustra5ons,  arXiv:1004.0089v2  

•  T.  F.  Cox  and  M.  A.  A.  Cox,  Mul5dimensional  Scaling.  New  York:  Chapman  &  Hall,  1994.  

•  Yinyu  Ye,  Semidefinite  Programming  and  its  Low-­‐Rank  Theorems,  ICCM  2010