LDA Topic Modelling: word topic assignment RRS feed

  • Question

  • I am using LDA sample in infer.net. I am trying to extend it to get word topic assignments (or distributions).

    Below is how I extended it (main change in bold). The problem is that the inference results in assigning each word (in a document) to the same topic (the topic of the document).

    So for example if doc1 was assigned (with max probability) topic10, all the words in the document the topic assignment as 10. Hardly any probability mass goes to the other topics. Am I doing something wrong?


    NumWordsInDoc = Variable.New<int>().Named("NumWordsInDoc");
    Range W = new Range(SizeVocab).Named("W");
    Range T = new Range(NumTopics).Named("T");
    Range WInD = new Range(NumWordsInDoc).Named("WInD");

    Theta = Variable.New<Vector>().Named("Theta");
    ThetaPrior = Variable.New<Dirichlet>().Named("ThetaPrior");
    Theta = Variable<Vector>.Random(ThetaPrior);
    PhiPrior = Variable.Array<Dirichlet>(T).Named("PhiPrior");
    Phi = Variable.Array<Vector>(T).Named("Phi");
    Phi[T] = Variable.Random<Vector, Dirichlet>(PhiPrior[T]);

    Words = Variable.Array<int>(WInD).Named("Words");
    WordCounts = Variable.Array<double>(WInD).Named("WordCounts");

    //Topic assignment for words in a document
    Zdn = Variable.Array<int>(WInD).Named("WordTopics");
    using (Variable.ForEach(WInD))
    Zdn[WInD] = Variable.Discrete(Theta).Attrib(new ValueRange(T));
    using (Variable.Repeat(WordCounts[WInD]))
        //WordTopic[WInD].SetTo(Variable.Discrete(Theta).Attrib(new ValueRange(T)));
        using (Variable.Switch(Zdn[WInD]))
     Words[WInD] = Variable.Discrete(Phi[Zdn[WInD]]);

    Engine = new InferenceEngine(new VariationalMessagePassing());

    Tuesday, June 25, 2013 9:01 PM

All replies

  • You will not be able to do this within the training model because of the Repeat factor.

    I would use/adapt the prediction model (LDAPredictionModel.cs). For a single word you can use the model as is, but you need to observe the word (along with the thetas and phis) and infer the topic. For multiple words, you can adapt the model so that word and topic are index by the range of words in the document.


    Thursday, June 27, 2013 9:20 AM