LDA samples performance (Migrated from community.research.microsoft.com) RRS feed

  • Question

  • andym posted on 03-15-2011 11:19 PM

    I seem to be having performance issues when running LDA sample models that were provided as a part of SDK (LDAShared) with medium-size document collections. The same document collections can be processed by 3rd party LDA implementation in several minutes, while the infer.net LDA example takes around 2.5 hours.

    Repro steps:

    1. Download sample corpus that accompanied original LDA paper, contains ~2500 documents (http://www.cs.princeton.edu/~blei/lda-c/ap.tgz )
    2. Download mallet from http://mallet.cs.umass.edu, run a 10 topic LDA model on corpus from step 1 (need to convert corpus to individual .txt files). Observe that inference converges in 50 seconds with around 500 iterations. I also have an alternative collapsed Gibbs sampler based LDA implementation that converges in 2500 iterations in around 10 minutes on the same sample.
    3. Use the same corpus with LDAShared (need to write a simple tokenizer for corpus which I can provide if needed), or alternatively in the sample code set numTopics = 10; sizeVocab = 10000; numTrainDocs = 2250; averageDocumentLength = 400; to “simulate” corpus from step 1.
    4. Run LDAShared “sparse.” It takes around 8000 seconds on "simulated" corpus

    I wonder if the performance of LDA reference samples you provided with the SDK is as fast as it can be with the framework or the improvements can be made.


    -- Andy

    Friday, June 3, 2011 6:35 PM


  • John Guiver replied on 03-16-2011 6:09 AM

    Here are some areas of speed up.

    1. By default, the example code for the shared versions use batchCount = numTrainDocs. This is the extreme low memory case and therefore the slowest. You should be able to set batchCount = 200, for example, and this should provide a reasonable speed up whilst still being memory efficient. This showed 2724 seconds/1527MB on my system (Test perplexity 19.433).
    2. The example implementation 'generates' each word separately whereas I suspect the other codes generate the word counts. Modifying the model to use a sparse Multinomial factor rather than a set of Discrete factors should make the biggest difference.
    3. The iteration schedule is fixed and fairly conservative - you could monitor convergence to determine an earlier stopping point.
    4. The model could be simplified - for example (a) don't calculate evidence (b) remove the part of the model that allows for independent priors on hyperparameters.
    5. It is possible that the word distributions per topic are not sparse - in which case the sparse representations will be slower than the dense representations; you can set the tolerances on the sparsity to control for this.
    6. Make sure you are not running in the debugger.
    Friday, June 3, 2011 6:36 PM