locked
error occurs when constructing a lda model with simple corpus (Migrated from community.research.microsoft.com) RRS feed

  • Question

  • xgear posted on 02-21-2009 4:16 AM

    setting of the model:

    number of documents in corpus:12

    number of topics:3

    number of words(terms) in corpus:12

    for simplicity,suppose that each document is composed of 2 words

     

    Normal 0 7.8 磅 0 2 false false false MicrosoftInternetExplorer4

    The whole corpus is show in the table below ,with each line representing a document.

    Original corpus

    After indexing ,the whole corpus is denoted as

    university test

    teacher student

    teacher university

    university student

    economy bank

    economy money

    stock economy

    money stock

    goverment policy

    goverment president

    goverment military

    president policy

    0, 3

    0, 1

    0, 2

    1, 2

    4, 7

    4, 5

    4, 6

    5, 6

    8, 11

    8, 9

    8, 10

    9, 10

     

    after runing the program,the consle window show:

    Compile model.....complilation failed.

     

    then a "transform chain" window shows information "can only indexed by loop variables,not index0",it seems the position where the error occurs is near(in) "two using nest" of source code

     

    By the way,can jagged array provide a array of a array,where the length of last array is not fixed,so I can remove the limit that each docoment is composed of 2 words.

     

     

    Your help is appreciated!

    Normal 0 7.8 磅 0 2 false false false MicrosoftInternetExplorer4

     

    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

            static void Main(string[] args)

            {

     

                int  M = 12;//number of documents in corpus

                int  K = 3;//number of topics

                int V = 12; //number of words(terms) in corpus

                int Nm = 2;//suppose that each document is composed of 2 words

     

                Range CorpusSize = new Range(M);

                Range TopicsNum = new Range(K);

                Range WordsNum = new Range(V);

                Range DocSize = new Range(Nm);

     

                double[] alpha={ 0.5, 0.5, 0.5 };

                double[] beta = { 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 };

     

                VariableArray<Vector> theta = Variable.Array<Vector>(CorpusSize);

                VariableArray<Vector> phi = Variable.Array<Vector>(TopicsNum);

                theta[CorpusSize] = Variable.Dirichlet(alpha).ForEach(CorpusSize);

                phi[TopicsNum] = Variable.Dirichlet(beta).ForEach(TopicsNum);

                VariableArray2D<int> W = Variable.Array<int>(CorpusSize, DocSize);

                VariableArray2D<int> Z = Variable.Array<int>(CorpusSize, DocSize);

                using (Variable.ForEach(CorpusSize))

                {

                    using (Variable.ForEach(DocSize))

                    {

                        Z[CorpusSize, DocSize] = Variable.Discrete(theta[CorpusSize]);

                        W[CorpusSize, DocSize] = Variable.Discrete(phi[Z[CorpusSize, DocSize]]);

                    }

                }

                W = Variable.Observed(new int[,] { { 0, 3 }, { 0, 1 }, { 0, 2 }, { 1, 2 }, { 4, 7 }, { 4, 5 }, { 4, 6 }, { 5, 6 }, { 8, 11 }, { 8, 9 }, { 8, 10 }, { 9, 10 } }, CorpusSize, DocSize);

                InferenceEngine engine = new InferenceEngine();

                Console.WriteLine(engine.Infer(Z));

               

    }

     

    Friday, June 3, 2011 5:23 PM

Answers

  • msdy replied on 11-13-2009 9:47 PM

    Thanks for reply. Now I got another question.

    Not sure if I understand it correctly, but please help.

    In this implementation, each document is denoted by indexed words. And each word is sampled from a topic’s word distribution.  The example shows that each word only appears once in a document.

    I come across a question here. There is no dimensionality reduction for documents since word counts are not used in this model. If the documents include several repeated words, then each individual word would be regarded different, and the output of the code is inference for each individual word.

    For example, if I replace the docs in John’s code with

    // Documents of variable length

                int[] block1 = System.Linq.Enumerable.Repeat(0, 1000).ToArray();

                int[] block2 = System.Linq.Enumerable.Repeat(1, 2000).ToArray();

                int[] block3 = System.Linq.Enumerable.Repeat(8, 1000).ToArray();

                int[] block4 = System.Linq.Enumerable.Repeat(11, 1500).ToArray();

     

                int[] doc1 = block1.Concat(block2).ToArray();

                int[] doc2 = block3.Concat(block4).ToArray();

                int[] doc3 = block1.Concat(block4).ToArray();

                int[] doc4 = block2.Concat(block3).ToArray();  

     

                int[][] docs = {

                                   doc1,

                                   doc2,

                                   doc3,

                                   doc4

                               };

     Even though there are only 4 unique words (indexed by 0, 1, 8, 11) in the corpus, the model treat each single word in the document as different. The efficiency is not good in this way.

    Did I understand it right? How do we handle this situation?

    Thank you.

    Friday, June 3, 2011 5:26 PM

All replies

  • laura replied on 02-22-2009 10:05 AM

    Hi,

     

    Since W depends on certain choices for Z, you have to add a gate (Variable.Switch).

    Furthermore, you have to give set the valueRange attribute to Z, so infer.net knowns over which values the gate ranges.

    Use the following code and your model compiles.

                        Z[CorpusSize, DocSize] = Variable.Discrete(theta[CorpusSize]).Attrib(new ValueRange (TopicsNum));

                        using(Variable.Switch(Z[CorpusSize, DocSize]))
                        {
                            W[CorpusSize, DocSize] = Variable.Discrete(phi[Z[CorpusSize, DocSize]]);
                        }

     

     

     

    Laura

     

    Friday, June 3, 2011 5:23 PM
  • laura replied on 02-22-2009 10:24 AM

    I just came across a flaw in your code.

    In your example, you first create a datastructure for W and wire it to the model. then you redefine W using a new observed data structure, which is not linked to the model. Since the data is not linked, infer() get the inference results based only on the prior.

    You have to define your observed variables W as such upfront.

    instead of

                VariableArray2D<int> W = Variable.Array<int>(CorpusSize, DocSize).Named("W");

     

    use the following line (and omit it later on)
                VariableArray2D<int> W = Variable.Observed(new int[,] { { 0, 3 }, { 0, 1 }, { 0, 2 }, { 1, 2 }, { 4, 7 }, { 4, 5 }, { 4, 6 }, { 5, 6 }, { 8, 11 }, { 8, 9 }, { 8, 10 }, { 9, 10 } }, CorpusSize, DocSize);

     

     

    Another thing is that you have to break symmetry, otherwise all phis will be identical.

    To break symmetry slightly, create a dense Dirichlet (denseBeta). Draw K times from it using dirich.Sample(), convert it to an infer.net array and call phi.InitializeTo()

                double[] denseBeta = new double[V];
                for (int v = 0; v < V; v++) denseBeta[v] = 10.0;

                Dirichlet[] initPhi = new Dirichlet[K];
                Dirichlet dirich = (new Dirichlet(denseBeta));
                for (int k = 0; k < K; k++)
                {
                    initPhi[k] = new Dirichlet(dirich.Sample());
                }
                phi.InitialiseTo(Distribution<Vector>.Array(initPhi));

     

    Laura

     

    Friday, June 3, 2011 5:24 PM
  • laura replied on 02-22-2009 10:31 AM

    To answer you final question, yes, using jagged arrays documents can have different length. If you need an example, in John Guiver's post i in the Bernoulli thread (http://community.research.microsoft.com/forums/p/2779/4511.aspx#4511 ) "e" is a jagged random variable array. Note that "sRange" is a variable range depending on "uRange".

     

    Laura

    Friday, June 3, 2011 5:24 PM
  • John Guiver replied on 02-23-2009 6:41 AM

    Just to summarise everything Laura has noted (many thanks Laura), including the jagged array stuff, here is a modified version of your C# code that will compile and run:

    static void Main(string [] args)
    {
       
    int K = 3;    //number of topics
       
    int V = 12;  //number of words(terms) in corpus
       
    // Documents of variable length
       
    int[][] docs = {
           
    new int[] { 0, 3, 4 },
           
    new int[] { 0, 1 },
           
    new int[] { 0, 2, 4, 5 },
           
    new int[] { 1, 2 },
           
    new int[] { 4, 7 },
           
    new int[] { 4, 5 },
           
    new int[] { 4, 6 },
           
    new int[] { 5, 6 },
           
    new int[] { 8, 11 },
           
    new int[] { 8, 9 },
           
    new int[] { 8, 10 },
           
    new int[] { 9, 10 }};

        // Put the sizes into an array
       
    int M = docs.Length;
       
    int[] sizes = new int[M];
       
    for (int i = 0; i < M; i++)
            sizes[ i ] = docs[ i ].Length;

       
    // Set up the ranges
       
    Range CorpusSize = new Range(M);
       
    Range TopicsNum = new Range(K);
       
    Range WordsNum = new Range(V);
       
    VariableArray<int> docSizeVar = Variable.Observed(sizes, CorpusSize);
       
    Range DocSize = new Range(docSizeVar[CorpusSize]);

       
    double[] alpha= { 0.5, 0.5, 0.5 };
       
    double[] beta = { 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1 };
       
    VariableArray<Vector> theta = Variable.Array<Vector>(CorpusSize);
       
    VariableArray<Vector> phi = Variable.Array<Vector>(TopicsNum);
        theta[CorpusSize] =
    Variable.Dirichlet(alpha).ForEach(CorpusSize);
        phi[TopicsNum] =
    Variable.Dirichlet(beta).ForEach(TopicsNum);

       
    // Break symmetry by initialising phi marginals
       
    Vector denseBeta = new Vector(V, 10.0);
       
    Dirichlet[] initPhi = new Dirichlet[K];
       
    Dirichlet dirich = new Dirichlet(denseBeta);
       
    for (int k=0; k < K; k++)
            initPhi[k] =
    new Dirichlet(dirich.Sample());
        phi.InitialiseTo(
    Distribution<Vector>.Array(initPhi));

       
    var Z = Variable.Array(Variable.Array<int>(DocSize), CorpusSize);
       
    var W = Variable.Array(Variable.Array<int>(DocSize), CorpusSize);
        W.ObservedValue = docs;

       
    using (Variable.ForEach(CorpusSize))
        {
           
    using (Variable.ForEach(DocSize))
            {
                Z[CorpusSize][DocSize] =
    Variable.Discrete(theta[CorpusSize]).Attrib(new ValueRange(TopicsNum));
               
    using (Variable.Switch(Z[CorpusSize][DocSize]))
                {
                    W[CorpusSize][DocSize] =
    Variable.Discrete(phi[Z[CorpusSize][DocSize]]);
                }
            }
        }
       
    InferenceEngine engine = new InferenceEngine();
       
    Console.WriteLine(engine.Infer(Z));
    }

    Friday, June 3, 2011 5:24 PM
  • xgear replied on 02-23-2009 8:42 AM

    thanks

    Friday, June 3, 2011 5:24 PM
  • Junming Huang replied on 03-05-2009 8:27 PM

    whatshould be the returned value type of engine.Infer(Z) in last line? I wanna store the posterior distribution of Z in a local variable for future use. tried several types but seemed not working.

    Friday, June 3, 2011 5:24 PM
  • Junming Huang replied on 03-06-2009 3:10 AM

    oh, it seems correct if I use a variable of DistributionArray<DistributionRefArray<Discrete, int>>

    Thanks all

    Friday, June 3, 2011 5:24 PM
  • John Guiver replied on 03-06-2009 5:09 AM

    Although what you have is correct in this case, DistributionArray, DistributionRefArray, and other distribution array classes are not designed to be used in the API - Infer.NET may use any one of a number of classes to internally represent distribution arrays, chosing the most efficient representation for the model. However, they can all be referenced via the IDistribution<> interface.

    We encourage you to use either one of the following two approaches, depending on what you want to do with the posterior. 

     IDistribution<int[][]> ZPostAsDistribution = engine.Infer<IDistribution<int[][]>>(Z);

    Discrete[][] ZPostAsArray = Distribution.ToArray<Discrete[][]>(engine.Infer(Z));

    We are looking at possibly making the second case more succinct in a future release by just allowing Discrete[][] to be a type parameter for the Infer method.

    John G.

    Friday, June 3, 2011 5:24 PM
  • laura replied on 03-06-2009 5:54 AM

    Normal 0 21 false false false DE X-NONE X-NONE MicrosoftInternetExplorer4

    Hi John,

     

    Hiding the Ref/Struct arrays is a cool thing I wasn't yet aware of.

     

    Unfortunately I can not make it work in F#. I tried the following, but the compiler complains "The field, constructor or member 'ToArray' is not defined. " This is particularly funny since I can select the method from the member list of the Distribution class.

     

     

            let infResult = inferenceEngine.Infer<IDistribution<Beta[]>>(epsilon)

            let infResultObj = inferenceEngine.Infer<obj>(epsilon)

            let epsilonPostAsArray = Distribution.ToArray<Beta[]>(infResultObj)       

     

    is there anything special to this method?

     

    Laura

    Friday, June 3, 2011 5:25 PM
  • John Guiver replied on 03-06-2009 6:03 AM

    I think that in F# you currently need to use Distribution< >.ToArray rather than Distribution.ToArray. This is an F# bug that has been logged - it occurs when you have a generic and non-generic version of the same class name, and the non-generic version (Distribution in our case) has a generic method (ToArray in our case)

    John

    Friday, June 3, 2011 5:25 PM
  • laura replied on 03-06-2009 6:12 AM

    I tried the following as well, still get the same error. I tried to rebuild all, just in case. Still no success.

     

    let epsilonPostAsArray = Distribution<_>.ToArray<Beta[]>(infResultObj)       

    let epsilonPostAsArray = Distribution<Beta>.ToArray<Beta[]>(infResultObj)       

    // just in case I was referencing the wrong class

    let epsilonPostAsArray = MicrosoftResearch.Infer.Distributions.Distribution<_>.ToArray<Beta[]>(infResultObj)       

     

    I find it strance that the following expression does not give compile errors.

    let x = Distribution.Equals(infResult, infResultObj)

    That is why I wonder what might be so special about the ToArray method.

     

    Laura

    Friday, June 3, 2011 5:25 PM
  • John Guiver replied on 03-06-2009 6:18 AM

    You must have a space rather than an underscore in Distribution< >

    John

    Friday, June 3, 2011 5:25 PM
  • laura replied on 03-06-2009 6:28 AM

    Thanks, John!

    Friday, June 3, 2011 5:25 PM
  • freddycct replied on 08-13-2009 9:00 PM

    Hi,

    May I know the mathematical reason for breaking symmetry? What's so bad about all phis being identical? If we supply the data, the model learns and adapts accordingly, so I am not sure why we have to break symmetry.

    Friday, June 3, 2011 5:25 PM
  • freddycct replied on 08-13-2009 9:58 PM

    I run the proposed LDA code and commented out the symmetry breaking codes. So the inference returns uniform results for the inferred variables.

    I read the mixture of Gaussians tutorial and it states that breaking symmetry is a consequence of using approximate inference algorithms such as VMP. Can I confirm my understanding?

    1. We break symmetry because of the approximate inference algorithms.
    2. If we use exact algorithms, do we still break symmetry?
    3. Will exact inference algorithms such as Junction Trees be supported in the future?

    Friday, June 3, 2011 5:25 PM
  • jwinn replied on 08-14-2009 4:26 AM

    The reason we need to break symmetry is that the model is not identifiable. Supposing we generated some data from the model with known phis e.g. corresponding to topics 1=education, 2=health and 3=economy. Now suppose we relabel the topics so that 1=health and 2=education and swap the parameters accordingly i.e. we swap phi1 and phi2 and we swap the first two elements of theta. If we generate from this new model, then we get data with exactly the same distribution as before the swap.  In fact, this will be true of any permutation of how we label the topics, because the model is symmetric with respect to the topics.

    Now suppose we have some data and don't know the phis but wish to infer their posterior distribution.  The true posterior will be a multi-modal distribution with one mode for every possible permutation of the topics.  However, both EP and VMP can only capture a single posterior mode. Since there is no reason for the inference prodecure to favour one mode over another - the symmetry in the model means that the updates will be exactly the same for all phis and all elements of theta and the inference will get stuck in an unstable equilibrium where all phis are the same and theta is uniform.  To escape from this symmetrical fixed point, we need to perturb the system somehow, to arbitrarily break the symmetry between the different topic permutations. We can do this by making the initial messages slightly different for each topic - this mean that the algorithm is started slightly closer to one posterior mode than the others and it can converge on that mode, corresponding to some particular permutation of the topics.

    In summary, there is nothing wrong with the phis being identically distributed in the model before we see data.  After seeing data, the non-identifiable model will have a set of posterior modes corresponding to all possible permutations of the topics.  Neither VMP nor EP can capture such highly multi-modal distributions so we need to nudge them slightly towards one of the modes.

    Friday, June 3, 2011 5:25 PM
  • jwinn replied on 08-14-2009 4:45 AM

    To expand further on your second question, if we were able to perform exact inference, we would recover the full multi-modal posterior and symmetry breaking would not be necessary.  However, exact inference is not tractable in this model.  In general, exact inference is only tractable for relatively small, discrete models (with some exceptions).  Hence, most kinds of models that people are interested in using today (such as LDA!) are not tractable for exact inference. For this reason, supporting junction trees is relatively low on our priority list - note also that there are plenty of existing software package for junction tree inference in discrete models.

    Friday, June 3, 2011 5:25 PM
  • laura replied on 08-14-2009 5:05 AM

    With LDA we want to learn topics that are hidden in text documents. Phi refers to word distributions that are characteristic for each topic. If all those phis are identical, we found that all topics are identical. The question is whether all topics in the data are indeed identical (which I doubt is true if you have realistic data sets) or whether the inference got stuck at "a saddle point in optimization space". (I use the word saddle point here in a somewhat figurative manner.)

    The problem with LDA is that any permutation of topic indices give an equally good solution.

    Breaking the symmetry refers to initializing messages to non-null message, i.e. instead of starting the inference loop at a position that is likely to be a saddle point, we start a bit next to it. This initialization should "wash out" during the iterations (such as a gibbs sampling initialization will not have any effect on the final result).

    You can achieve a similar effect by perturbing each of the phi's prior a bit. But this will not wash out during inference.

    Laura

     

    Friday, June 3, 2011 5:25 PM
  • freddycct replied on 08-14-2009 9:39 PM

    Thank you all for the replies. I think things will be clearer when I learn about variational inference in my graphical model course.

    Friday, June 3, 2011 5:26 PM
  • msdy replied on 11-12-2009 9:32 PM

    Hi John,

    I ran this model with Gibbs sampling, but failed during compilation. Any idea?

     

    // Use Gibbs sampling

                GibbsSampling gs = new GibbsSampling();

                gs.BurnIn = 100;

                gs.Thin = 10;

                InferenceEngine ie = new InferenceEngine(gs);

                ie.NumberOfIterations = 2000;

                Console.WriteLine(ie.Infer(Z));

     

    Friday, June 3, 2011 5:26 PM
  • minka replied on 11-13-2009 8:07 AM

    The Gibbs sampler is still in the experimental stages and it does not yet support 'Variable.Switch' or 'Variable.If'.

    Friday, June 3, 2011 5:26 PM
  • msdy replied on 11-13-2009 9:47 PM

    Thanks for reply. Now I got another question.

    Not sure if I understand it correctly, but please help.

    In this implementation, each document is denoted by indexed words. And each word is sampled from a topic’s word distribution.  The example shows that each word only appears once in a document.

    I come across a question here. There is no dimensionality reduction for documents since word counts are not used in this model. If the documents include several repeated words, then each individual word would be regarded different, and the output of the code is inference for each individual word.

    For example, if I replace the docs in John’s code with

    // Documents of variable length

                int[] block1 = System.Linq.Enumerable.Repeat(0, 1000).ToArray();

                int[] block2 = System.Linq.Enumerable.Repeat(1, 2000).ToArray();

                int[] block3 = System.Linq.Enumerable.Repeat(8, 1000).ToArray();

                int[] block4 = System.Linq.Enumerable.Repeat(11, 1500).ToArray();

     

                int[] doc1 = block1.Concat(block2).ToArray();

                int[] doc2 = block3.Concat(block4).ToArray();

                int[] doc3 = block1.Concat(block4).ToArray();

                int[] doc4 = block2.Concat(block3).ToArray();  

     

                int[][] docs = {

                                   doc1,

                                   doc2,

                                   doc3,

                                   doc4

                               };

     Even though there are only 4 unique words (indexed by 0, 1, 8, 11) in the corpus, the model treat each single word in the document as different. The efficiency is not good in this way.

    Did I understand it right? How do we handle this situation?

    Thank you.

    Friday, June 3, 2011 5:26 PM