locked
problem with _TestLDA (Latend Dirichlet Allocation) evaluation on blei_corpus RRS feed

  • Question

  • hello,

    I have a problem to evaluate _testLDA (Latent Diriclet Allocation) on D.Blei corpus.

    I took the corpus from:   www.cs.princeton.edu/~blei/lda-c/ap.tgz

    I suppose that the corpus, which I've downloaded is in a wrong format...

    regards, 

    Michael





    • Edited by _Michael D Wednesday, July 9, 2014 10:31 AM
    Sunday, July 6, 2014 5:49 PM

Answers

  • Unfortunately we are not authorized to supply the Blei corpus.

    The LoadWordCounts method in Utilities.cs documents the expected format:

        "Each line is of the form cnt,wrd1_index:count,wrd2_index:count,..."

    The LoadVocabulary method loads the vocab which is just a list of words, one per line.

    John

    Friday, July 11, 2014 10:58 AM
    Owner

All replies

  • Unfortunately we are not authorized to supply the Blei corpus.

    The LoadWordCounts method in Utilities.cs documents the expected format:

        "Each line is of the form cnt,wrd1_index:count,wrd2_index:count,..."

    The LoadVocabulary method loads the vocab which is just a list of words, one per line.

    John

    Friday, July 11, 2014 10:58 AM
    Owner
  • Dear John,

    First of all, thank for your reply!

    If I understand well, the format of the corpus should be as following:

    1) line per documnet

    2) instead of the words - its index in Vocabulary (wrd_1, word_2,,, ets)

    3) the 'count' - is the frequency of the word in current document?

    thank you,

    Michael

    • Edited by _Michael D Monday, July 14, 2014 10:45 AM
    Monday, July 14, 2014 10:45 AM
  • Correct, where the index is zero-based.

    John

    Monday, July 14, 2014 2:17 PM
    Owner