Getting Started: Understanding LDA and Infer.NET RRS feed

  • Question

  • All:

    I am attempting to use Infer.NET to solve a topic modeling problem (or at least evaluate if it's a good fit) and seem to be somewhat stuck.  I've got a substantial corpus of documents (2.5M).  I am attempting to identify if the text in newly introduced documents is referencing the same topics in the prior documents.  Ideally, I'd like to be able to generate a 'confidence' level from each paragraph in the new document that it belongs to a certain topic based on the prior corpus so, when finished, I could select text from the new document and store to the RDBMS in the expected category. Is this possible with Infer.NET?  Is this a good approach in general for what I'm trying to do? Should I be looking at other ML algorithms?  Is my concept even correct?  

    Although I have a CS/math-based undergrad, it was over 20 years ago and writing mainly business apps for the last two decades has really dulled my math chops :).  Whenever I read the mathematical proofs for the ML patterns I've seen, I am horribly lost.  I *think* LDA is a good choice for this as I understand to attempts to explain why text is similar to prior text but I am open to suggestions.  I seem to be having a hard time getting started with ML but the example projects look like they're really well done and the community seems sharp for sure.  

    If anyone could provide a starting point or some advice I'd appreciate it!  

    Monday, March 9, 2015 2:36 PM

All replies

  • Hi Stephen,

    Have you looked at this example?


    Monday, March 9, 2015 3:48 PM
  • Thanks for responding.  Yes, I did examine that example and that's what really prompted my above questions.  Although I think I understand the code just fine from a process standpoint, I'm having a problem conceptually mapping the ideas expressed to my own use case.  

    For example, the sample above generates a training corpus and stores it to a structure of Dictionary<int, int>[]. I think this is because the sample is working entirely with integral members for example purposes.  Given my use case, I would think I would need a structure of Dictionary<string, int>[] where the key would be the word and the int would be the count of the word in the document.  The number of items in the array would represent the number of training paragraphs I have.  I also assume I would do the same for the test words as well (meaning make a simliar structure).  Some things I don't know about ML/LDA in general are:  1) Should I strip out stop words?  2) Should sentences be decomposed down to bi-grams or unigrams? 3)  How much training data should I load?  Is more always going to be better (performance considerations nonewithstanding)?   

    In the example, the author defines a symbol 'blei_corpus' that, if present, would build the corpus from a file.  In looking through the files in the downloaded sample, I couldn't find this file.  It would be helpful to get a look at this file and wish it was included in the distribution as it might make things a little clearer.  

    My apologies if this all seems so basic to those of you familiar with the library and concepts - sometimes it feels overwhelming trying to understand this stuff but I do feel this is the preferred way to accomplish my goal of classifying paragraph text. 

    Monday, March 9, 2015 4:15 PM
  • Thanks for this John - it is helpful and would allow me to reconstruct a sample file.

    Wednesday, March 11, 2015 12:54 PM