1
C S SK S2 S1 β(S) α(SK) α(S2) α(S1) T\T(S) T(SK) T(S2) T(S2) T(S) α(S) ? H1 H2 H3 H2 H3 H4 H3 H4 H5 H4 H5 H6 H58 H59 H60 X2 H2 X1 H1 X3 H3 X4 H4 X5 H5 X59 H59 X60 H60 H1 H2 H2H3 H3H4 H4H5 H5H6 H3 H4 H5 H59 H60 H58H59 Spectral Learning of General Latent-Variable Probabilistic Graphical Models: A Supervised Learning Approach Borui Wang Computer Science Department, Stanford University, Stanford, CA Introduction In this project, I designed, proved and investigated a new type of spectral l earning algorithm for learning general latent-variable probabilistic graphical models through reducing spectral learning down to a pipeline of supervised learning procedures called “two-stage regressions.” I focused on a particular form of latent- variable probabilistic graphical models here - namely, the latent junction trees. This modeling choice is justified by the fact that any probabilistic graphical model can be turned into a corresponding junction tree, and the topology as well as many of the nice theoreti cal properti es of the junction trees can offer us great convenien ce during the l earning and inference processes of our spectral algorithm. (an example of turning a latent-variable probabilistic graphical model into a corresponding latent junction tree) Main Algorithm Model Construction Proof of the Main Algorithm Experiment & Results To test the performance of this new spectral learning algorithm on real datasets, I implemented and experimented wi th the algorithm using Matlab on the task of DNA Splicing Sites Classification [1]. This task is a computational biology one of classifying DNA splicing sites using a generative approach through learning second-order hidden Markov models. The inputs of this learning task are DNA sequences that are each 60-bases long, and th e outputs are one of the three possible classification labels of the inpu t DN A sequence - exon/intron (EI) boundaries, intron/exon (IE) boundaries, or neither (N). This dataset contains a total of 3190 labeled instances of DNA splicing sites, of which 767 are EI sites (25%), 768 are IE sites (25%), and 1655 are Neither (50%). For each individual DNA splicing site sequence of length 60, we construct a length-60 second-order Hidden Markov Model to model its generative process, and we then turn this second- order HMM into a corresponding latent-variable junction tree: Category of DNA Splicing Site Accuracy Rate Exon/Intron (EI) 79.1% Intron/Exon (IE) 74.5% Neither (N) 85.8% Overall 81.5% The current experimental results are: References [1] G. Towell, M. Noordewier, and J. Shavlik, Molecular Biology Data Set, UCI ML Repository [2] Ahmed Hefny, Carlton Downey, Geoffrey J. Gordon. Supervised Learning for Dynamical System Learning, NIPS 2015 [3] Ankur P. Parikh, Le Song, Mariya Ishteva, Gabi Teodoru, and Eric P. Xing. A Spectral Algorithm for Latent Junction Trees, UAI 2012 A H J G E D F B C ? 600 60 A T A C G Main Algorithm Input: the topology of the latent-variable proba- bilistic graphical model G, a training set of N i.i.d. samples {x d 1 , ..., x d |O| } N d=1 , a set of observed evidence {X i = x i } i2E (where E denotes the set of index for the set of observable variables that are observed as evidence), and the index Q of the query node X Q . Output: Estimated conditional probability distri- bution b P [X Q |{X i = x i } i2E ]. The Learning Algorithm: Step 2 - Model Specification: For each separa- tor set S in the junction tree T , pick features for the inside subtree φ S = f S [(S)] and the outside subtree S = h S [β(S)]. Step 1 - Model Construction: Convert the latent-variable graphical model G into an appropri- ate corresponding latent-variable junction tree T . Step 3 - Stage 1A Regression (S1A): At each non-leaf separator set S in the latent junction tree T , learn a (possibly non-linear) regression model to estimate ¯ φ S = E[φ S | S ]. Step 4 - Stage 1B Regression (S1B): At each non-leaf separator set S in the latent junction tree T , where S is connected with K “childen” separator sets {S 1 ,S 2 , ..., S K } below it, learn a (possibly non-linear) regression model to estimate ¯ K k=1 φ Sk = E[ K k=1 φ Sk | S ]. Main Algorithm Step 5 - Stage 2 Regression (S2): At each non-leaf separator set S in the latent junction tree T , use the feature expectations estimated in S1A and S1B to train a linear regression model to pre- dict ¯ K k=1 φ Sk = W S S ¯ φ S , where W S is a linear operator associated with S. Step 6 - Root Covariance Estimation: At the root clique C r of the latent junction tree T , calculate: T Cr = b E[ S2γ(Cr) f S ((S))] N P d=1 S2γ(Cr) f S ((S) d ) N The Inference Algorithm: Step 1 - Design Tensor Construction: Φ S ( :, I (a 1 ), ..., I (a N[(S)] ) ) = f S (a 1 , ..., a N[(S)] ). Step 2 - Initial Leaf Message Generation: (X)= ( ~ 1, if X is not observed e x , if X is observed to have value x m C!Cp = Φ S {δ(C)} X2δ(C) (X) Step 3 - Message Passing: m C!Cp = W S S1 m C1!C S2 ... SK m CK!C = W S Sj j2{1,2,...,K} m Cj !C m C!Ck = W S S m Cp!C Sj j2{1,2,...,K}\k m Cj !C Step 4 - Computing Query Result: b P[δ(C Q ) |{X i = x i } i2E ] / ( m Cp!CQ SQ Φ S Q ) ( Φ SQ SQ m CQ!Cp ) Model Construction Model Feature Specification Stage 1A Regression Root Covariance Estimation The Learning Algorithmic Pipeline Stage 2 Regression Stage 1B Regression Initial Leaf Message Generation Message Passing Design Tensor Construction Computing Query Results The Inference Algorithmic Pipeline: 5 1 4 4 0 1 3 3 0 0 2 2 2 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 Root E[(Φ S1 {(S1)} I (S1) )(Φ S2 {(S2)} I (S2) )...(Φ SK {(SK)} I (SK) ) | S ] Stage 2 Regression: E[Φ S {(S)} I (S) | S ] + So the new linear operator that we learn from the stage 2 regression would be: W S = P[ K [ i=1 (S i ) | (S)]{(S1)} Φ S1 {(S2)} Φ S2 {(S3)} ... {(SK)} Φ SK {(S)} Φ S Root Covariance Estimation: T Cr = b P[ [ S2γ(Cr) (S)] {(S)} S2γ(Cr) Φ S Lemma 1. All the upward messages that each non-leaf non-root clique node C sends to its parent node C p is of the form: m C!Cp = b P[{X i = x i } Xi2(C) | (S)] {(S)} Φ S Lemma 2. All the downward messages that each non-leaf clique node C sends to any of its children node C i is of the form: m C!Ci = b P[{X j = x j } Xj 2μ(Ci) , (S i )] {(Si)} Φ Si Message Passing Inference:

Spectral Learning of General Latent-Variable Probabilistic …cs229.stanford.edu/proj2016/poster/Wang-SpectralLearning... · 2017. 9. 23. · Borui Wang Computer Science Department,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • C

    S

    SK

    S2

    S1

    β(S)

    α(SK)α(S2)

    α(S1)

    T\T(S)

    T(SK)

    T(S2)

    T(S2)

    T(S)

    α(S)

    ?H1 H2 H3

    H2 H3 H4

    H3 H4 H5

    H4 H5 H6

    H58 H59 H60

    X2 H2 X1 H1

    X3 H3

    X4 H4

    X5 H5

    X59 H59 X60 H60

    H1 H2H2 H3

    H3 H4

    H4 H5

    H5 H6

    H3

    H4

    H5

    H59 H60

    …H58 H59

    Spectral Learning of General Latent-Variable Probabilistic Graphical Models: A Supervised Learning ApproachBorui Wang

    Computer Science Department, Stanford University, Stanford, CA

    IntroductionIn this project, I designed, proved and investigated anew type of spectral l earning algorithm for learninggeneral latent-variable probabilistic graphical modelsthrough reducing spectral learning down to a pipelineof supervised learning procedures called “two-stageregressions.” I focused on a particular form of latent-variable probabilistic graphical models here - namely,the latent junction trees. This modeling choice isjustified by the fact that any probabilistic graphicalmodel can be turned into a corresponding junctiontree, and the topology as well as many of the nicetheoreti cal properti es of the junction trees can offerus great convenience during the l earning andinference processes of our spectral algorithm.

    (an example of turning a latent-variable probabilistic graphical model into a corresponding latent junction tree)

    Main Algorithm

    ModelConstruction

    Proof of the Main Algorithm

    Experiment & ResultsTo test the performance of this new spectral learningalgorithm on real datasets, I implemented andexperimented wi th the algorithm using Matlab on thetask of DNA Splicing Sites Classification [1]. Thistask is a computational biology one of classifyingDNA splicing sites using a generative approachthrough learning second-order hidden Markovmodels. The inputs of this learning task are DNAsequences that are each 60-bases long, and the outputsare one of the three possible classification labels ofthe inpu t DNA sequence - exon/intron (EI)boundaries, intron/exon (IE) boundaries, or neither(N). This dataset contains a total of 3190 labeledinstances of DNA splicing sites, of which 767 are EIsites (25%), 768 are IE sites (25%), and 1655 areNeither (50%). For each individual DNA splicing sitesequence of length 60, we construct a length-60second-order Hidden Markov Model to model itsgenerative process, and we then turn this second-order HMM into a corresponding latent-variablejunction tree:

    Category of DNASplicingSite Accuracy Rate

    Exon/Intron (EI) 79.1%

    Intron/Exon(IE) 74.5%

    Neither (N) 85.8%

    Overall 81.5%

    The current experimental results are:

    References

    [1] G. Towell, M. Noordewier, and J. Shavlik,Molecular Biology Data Set, UCI ML Repository[2] Ahmed Hefny, Carlton Downey, Geoffrey J.Gordon. Supervised Learning for Dynamical SystemLearning, NIPS 2015[3] Ankur P. Parikh, Le Song, Mariya Ishteva, GabiTeodoru, and Eric P. Xing. A Spectral Algorithm forLatent Junction Trees, UAI 2012

    A

    H

    J

    G

    E

    D

    F B C

    ?

    60060

    A T A C GMain Algorithm

    Input: the topology of the latent-variable proba-bilistic graphical model G, a training set of N i.i.d.

    samples {xd1, ..., xd|O|}Nd=1, a set of observed evidence{Xi = xi}i2E (where E denotes the set of index forthe set of observable variables that are observed as

    evidence), and the index Q of the query node XQ.

    Output: Estimated conditional probability distri-bution

    bP [XQ | {Xi = xi}i2E ].

    The Learning Algorithm:

    Step 2 - Model Specification: For each separa-

    tor set S in the junction tree T , pick features forthe inside subtree �S = fS [↵(S)] and the outsidesubtree S = hS [�(S)].

    Step 1 - Model Construction: Convert the

    latent-variable graphical model G into an appropri-ate corresponding latent-variable junction tree T .

    Step 3 - Stage 1A Regression (S1A): At each

    non-leaf separator set S in the latent junction treeT , learn a (possibly non-linear) regression model toestimate

    ¯�S = E[�S | S ].

    Step 4 - Stage 1B Regression (S1B): At each

    non-leaf separator set S in the latent junctiontree T , where S is connected with K “childen”separator sets {S1, S2, ..., SK} below it, learn a(possibly non-linear) regression model to estimate

    ¯K⌦k=1

    �Sk = E[K⌦k=1

    �Sk | S ].

    Main Algorithm

    Step 5 - Stage 2 Regression (S2): At each

    non-leaf separator set S in the latent junction treeT , use the feature expectations estimated in S1Aand S1B to train a linear regression model to pre-

    dict

    ¯K⌦k=1

    �Sk = WS ⇥S ¯�S , where WS is a linearoperator associated with S.Step 6 - Root Covariance Estimation: At

    the root clique Cr of the latent junction treeT , calculate: T Cr = bE[ ⌦

    S2�(Cr)fS(↵(S))] ⇡

    NPd=1

    ⌦S2�(Cr)

    fS(↵(S)d)

    N

    The Inference Algorithm:Step 1 - Design Tensor Construction:

    �S�:, I(a1), ..., I(aN[↵(S)])

    �= fS(a1, ..., aN[↵(S)]).

    Step 2 - Initial Leaf Message Generation:

    ⇣(X) =

    (~

    1, if X is not observed

    e

    x

    , if X is observed to have value x

    mC!Cp = �S† ⇥{�(C)}

    ⇥⌦

    X2�(C)⇣(X)

    Step 3 - Message Passing:

    mC!Cp = WS ⇥S1 mC1!C ⇥S2 ...⇥SK mCK!C= WS ⇥Sj

    j2{1,2,...,K}mCj!C

    mC!Ck = WS ⇥S mCp!C ⇥Sjj2{1,2,...,K}\k

    mCj!C

    Step 4 - Computing Query Result:

    bP[�(CQ) | {Xi = xi}i2E ] /�mCp!CQ ⇥SQ �

    S†Q����SQ ⇥SQ mCQ!Cp

    ModelConstruction

    Model FeatureSpecification

    Stage 1A Regression

    Root Covariance Estimation

    The Learning Algorithmic Pipeline

    Stage 2 Regression

    Stage 1B Regression

    Initial LeafMessage Generation

    Message Passing

    Design Tensor Construction

    ComputingQuery Results

    The Inference Algorithmic Pipeline:

    5

    1 4 4

    013 3

    00 22 2 1

    0 1 10 0 0 1 0

    0 0 0 0 0

    0

    Root

    E[(�S1⇥{↵(S1)}I↵(S1))⌦(�S2⇥{↵(S2)}I↵(S2))⌦...⌦(�SK⇥{↵(SK)}I↵(SK)) | S ]

    Stage 2 Regression:

    E[�S ⇥{↵(S)} I↵(S) | S ]+

    So the new linear operator that we learn from the stage 2 regression

    would be: WS = P[K[i=1

    ↵(Si) | ↵(S)]⇥{↵(S1)}�S1⇥{↵(S2)}�S2⇥{↵(S3)}...⇥{↵(SK)} �SK ⇥{↵(S)} �S

    Root Covariance Estimation: T Cr = bP[ [S2�(Cr)

    ↵(S)]⇥{↵(S)}S2�(Cr)

    �S

    Lemma 1. All the upward messages that each non-leaf non-rootclique node C sends to its parent node Cp is of the form:

    mC!Cp = bP[{Xi = xi}Xi2⌘(C) | ↵(S)]⇥{↵(S)} �S†

    Lemma 2. All the downward messages that each non-leaf clique nodeC sends to any of its children node Ci is of the form:

    mC!Ci = bP[{Xj = xj}Xj2µ(Ci),↵(Si)]⇥{↵(Si)} �Si

    Message Passing Inference: