locked
Classifying strings with Infer .Net (Migrated from community.research.microsoft.com) RRS feed

  • Question

  • wiseguyeh posted on 08-02-2010 9:38 AM

    Hi,

    Im looking to use Infer .Net to classify strings, specifically SMTP error messages returned from SMTP servers. Many of the error codes returned are rather misleading and I'm hoping to train Infer .Net to categorize error messages into two categories - "Unavailable" or "Temporarily Unavailable". 

     

    The strings will be of varying length, will this be a problem? 

    Does anyone have any advice that might be useful in using Infer .Net for such a purpose?

     

    thank you for your time.

    Friday, June 3, 2011 5:58 PM

Answers

  • John Guiver replied on 08-06-2010 10:39 AM

    I think your analysis for your classifier is pretty reasonable, and I would suggest doing further analysis, building histograms of words relative to your two classes. Some individual words may have good discriminating power, and may suffer from being in groups. The rest of your BPM bullets are fine. Make sure you have a bias term as in the example.

    An important point to note on the BPMs in the 2.3 Beta 4 release.

    In both BPM examples (http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Bayes%20Point%20Machine%20tutorial.aspx and http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Multi-class%20classification.aspx) really should be adding noise to the result of the inner product. So, for example, in the former case:

    double noise = 0.1;
    y[j] = Variable.GaussianFromMeanAndVariance(Variable.InnerProduct(w, x[j]).Named("innerProduct"),noise)>0;

    The multi-class version (the second link above) has an addiional issue which can also be easily addressed, which I will discuss below. Both issues will be fixed in the next release.

    Anyway, try adding different amounts of noise,and see how that affects things.

    As regards variable length strings, I am guessing you want something like this: Consider your message as a bag of words. Each word in the total set of messages (ie. your vocabulary)  will have a unique identifier (a feature id). You would filter out any stop words from your vocab and messages. Each word in your vocab would have a weight. But when you train the model, you only want to show the few words within each message.

    The answer is yes, you can do this in Infer.NET, but there are a number of considerations. For example, the standard BPM tries to learn the correlations between each variable by using a VectorGaussian distribution for the weights. This will rapidly become impractical as your vocab size goes up, and so it becomes better to consider the weights as independent. What you need now, is a BPM in which each datum is a subset of features. This pattern is handled in our multi-class example documented at http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Sparse%20Multi-class%20Bayes%20Point%20machine.aspx.

    Even though it is multiclass, you can just use two classes. However, now I need to address the issue I mentioned at the beginning. There is a shortcoming in the Multiclass versions of BPM - namely that performance will improve if we use up a degree of freedom by fixing the prior weights for one of the classes to be a point mass distribution. I suggest you use the BPM_Shared class, and you will need to make the following change in that class's Train method (replacing the original code for setting wInit).

    Gaussian[] wprior0 = new Gaussian[nFeatures];
    Gaussian[] wprior = new Gaussian[nFeatures];
    for (int f = 0; f < nFeatures; f++)
    {
        wprior0[f] = Gaussian.PointMass(0.0);
        wprior[f] = Gaussian.FromMeanAndPrecision(0.0, 1.0);
    }
    for (int c = 0; c < nClass; c++)
        trainModel.wInit[c].ObservedValue = (c==0) ? wprior0 : wprior;

    Also we need to add in noise (as discussed earlier in this post) in the BPMUtils.ComputeClassScores for sparse BPM (the second of the two ComputeClassScore methods in the BPMUtils class), you will need to add in a scorePlusNoise variable array:

        Variable<double>[] scorePlusNoise = new Variable<double>[nClass];
        for (int c = 0; c < nClass; c++)
        {
            ... as before...

            scorePlusNoise[c] = Variable.GaussianFromMeanAndPrecision(score[c], noisePrec);
        }
        return scorePlusNoise;

    Test_BPM_Sparse in Program.cs shows how to call this model. Let me know how you get along.

    John

    Friday, June 3, 2011 5:58 PM

All replies

  • John Guiver replied on 08-05-2010 3:43 AM

    I am presuming that you want to build some sort of generalisation into this model so that when a new message comes along the model can take a reasonable stab at categorising it (i.e. I am assuming that you don't have the complete list of error messages - otherwise you would not need a classifier).

    There are probably many ways to do this. For example here are a couple of approaches:

    A simple approach would be to build a set of features from the strings.These could be binary (such as 'is the first digit a 4?', 'does the message include the string 'insufficient'?'), etc. or continuous (such as the length of the string). You could then feed these into a standard classifier such as a Bayes point machine. This approach would require quite a bit of upfront analysis to pick sensible features.

    Another approach would be to build some sort of supervised topic model (such as supervised LDA) which is directly based on the word content of the messages.

    The modeling for these approaches could be done in Infer.NET. But you would need to decide what your model will be before we can give advice on Infer.NET usage.

    John

    Friday, June 3, 2011 5:58 PM
  • wiseguyeh replied on 08-05-2010 4:57 AM

    Hi John, thank you for taking the time to read and reply to my post and yes, you are correct- I wish to build a generalization model that can classify SMTP error messages (they vary greatly from SMTP server to SMTP server but often contain similar words, just in varying sentence structures).

    I took a stab at using a Bayes point machine but was faced with disappointing results. As this is my first attempt at doing anything with string classification, I wasn't expecting much.

    If I explain my approach, you may be able to identify things I could improve upon:

     

    • I used a bayes point classifier as per the example @ http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Bayes%20Point%20Machine%20tutorial.aspx
    • I split up the various messages into tokens and found some of the most commonly occuring tokens (such as "spam", "email", "unavailable") and then made groupings of  tokens with similar meaning (i.e. the grouping "spam words" contained "spam", "content", "filter", "junk").  These groups of tokens then became my inputs (counting the number of times a token in a group would appear).
    • There were also some boolean inputs such as "error message contains email address" and "error message  contains url".
    • I scaled these inputs so that they ranged from 0.0 to 1.0.

     

    Basically, I got some very disappointing results. I figure this may due to:

     

    • The wrong type of bayes multi-point classifier (is there a type that just accepts boolean values?)
    • Too few training examples (I used about 50, how many should be used to get solid classification results?)
    • The wrong type of inputs (this is something I will play arround with once I have some confirmation that I am not wasting my time!)

    Also, would it be possible to build a classifier that can accept varrying length strings? This way, I could ask some bayes machine to classify a string, having given it a list of examples of good/bad strings and I would not have to attempt to find the "important" information within an error message myself. The little I know about this field however tells me this might be impossible/impractical/beyond my ability.

     

     

     

    Friday, June 3, 2011 5:58 PM
  • John Guiver replied on 08-06-2010 10:39 AM

    I think your analysis for your classifier is pretty reasonable, and I would suggest doing further analysis, building histograms of words relative to your two classes. Some individual words may have good discriminating power, and may suffer from being in groups. The rest of your BPM bullets are fine. Make sure you have a bias term as in the example.

    An important point to note on the BPMs in the 2.3 Beta 4 release.

    In both BPM examples (http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Bayes%20Point%20Machine%20tutorial.aspx and http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Multi-class%20classification.aspx) really should be adding noise to the result of the inner product. So, for example, in the former case:

    double noise = 0.1;
    y[j] = Variable.GaussianFromMeanAndVariance(Variable.InnerProduct(w, x[j]).Named("innerProduct"),noise)>0;

    The multi-class version (the second link above) has an addiional issue which can also be easily addressed, which I will discuss below. Both issues will be fixed in the next release.

    Anyway, try adding different amounts of noise,and see how that affects things.

    As regards variable length strings, I am guessing you want something like this: Consider your message as a bag of words. Each word in the total set of messages (ie. your vocabulary)  will have a unique identifier (a feature id). You would filter out any stop words from your vocab and messages. Each word in your vocab would have a weight. But when you train the model, you only want to show the few words within each message.

    The answer is yes, you can do this in Infer.NET, but there are a number of considerations. For example, the standard BPM tries to learn the correlations between each variable by using a VectorGaussian distribution for the weights. This will rapidly become impractical as your vocab size goes up, and so it becomes better to consider the weights as independent. What you need now, is a BPM in which each datum is a subset of features. This pattern is handled in our multi-class example documented at http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Sparse%20Multi-class%20Bayes%20Point%20machine.aspx.

    Even though it is multiclass, you can just use two classes. However, now I need to address the issue I mentioned at the beginning. There is a shortcoming in the Multiclass versions of BPM - namely that performance will improve if we use up a degree of freedom by fixing the prior weights for one of the classes to be a point mass distribution. I suggest you use the BPM_Shared class, and you will need to make the following change in that class's Train method (replacing the original code for setting wInit).

    Gaussian[] wprior0 = new Gaussian[nFeatures];
    Gaussian[] wprior = new Gaussian[nFeatures];
    for (int f = 0; f < nFeatures; f++)
    {
        wprior0[f] = Gaussian.PointMass(0.0);
        wprior[f] = Gaussian.FromMeanAndPrecision(0.0, 1.0);
    }
    for (int c = 0; c < nClass; c++)
        trainModel.wInit[c].ObservedValue = (c==0) ? wprior0 : wprior;

    Also we need to add in noise (as discussed earlier in this post) in the BPMUtils.ComputeClassScores for sparse BPM (the second of the two ComputeClassScore methods in the BPMUtils class), you will need to add in a scorePlusNoise variable array:

        Variable<double>[] scorePlusNoise = new Variable<double>[nClass];
        for (int c = 0; c < nClass; c++)
        {
            ... as before...

            scorePlusNoise[c] = Variable.GaussianFromMeanAndPrecision(score[c], noisePrec);
        }
        return scorePlusNoise;

    Test_BPM_Sparse in Program.cs shows how to call this model. Let me know how you get along.

    John

    Friday, June 3, 2011 5:58 PM
  • wiseguyeh replied on 08-06-2010 11:15 AM

    Hi John,

    The addition of both Bias and noise have affected my results massively! It appears to be classifying exactly as intended now, thank you so much for your guidance.

    As for the rest of the post, I will examine your suggestion for the sparse bayes machine next week with a fresh head but for now, my nosier and biased bayes multi point classifier works like a charm!

    Friday, June 3, 2011 5:58 PM