locked
[Contd] Implementing Online Bernoulli Mixture Model RRS feed

  • Question

  • As a continued effort to implement a working and practical tool for binary data clustering, I tried to extend the existing model to online learning setting. See the first post on the simple one-shot model & inference on this forum.

    The following extension tries to make the learning in batches of data. The posteriors of parameters are passed to next iteration that consume the next portion of data.

    The code appears to be running although I suspect there are some issues:

    • Every batch re-compiles the model (I expected the model to be compiled only once). Not sure what makes the model to be changing from batch to batch;
    • In every batch, the inference is ran twice. It looks like this is done once for the 2 posteriors and once for the "accumulator" variables that are attached to the former variables.

    First, I set up the necessary constants:

    const int numData = 10000;          // total number of examples in a dataset
    const int numDims = 784;            // dimensionality of an example
    const int numClusters = 10;         // requested number of clusters
    const int batchSize = 100;          // number of examples to use in a batch

    The data is loaded from a CSV file with 10,000 lines. Each line has a label (ignored) and 784 binary variables.

    // Get data from: https://pjreddie.com/projects/mnist-in-csv/
    string filePath = @"C:\Users\vladi\Documents\Visual Studio 2017\Projects\BernoulliMixture\mnist_test.csv";
    StreamReader sr = new StreamReader(filePath);
    
    bool[][] data = new bool[numData][];
    
    int row = 0;
    while (!sr.EndOfStream)
    {
        string[] line = sr.ReadLine().Split(',');
        data[row] = new bool[numDims];
        for (int i=1; i < numDims; i++)
        {
            if (line[i] == "0")
                data[row][i - 1] = false;
            else
                data[row][i - 1] = true;
        }
        row++;
    }


    Next come the declarations of the model variables:

    Variable<int> nItems = Variable.New<int>();
    
    Range n = new Range(nItems);
    Range k = new Range(numClusters);
    Range d = new Range(numDims);
    
    double[] piPrior = new double[numClusters];
    for (int i = 0; i < numClusters; i++)
        piPrior[i] = 1.0 / numClusters;
    
    // define latent variables
    var pi = Variable.Dirichlet(k, piPrior).Named("pi");
    var c = Variable.Array<int>(n).Named("c");
    var t = Variable.Array(Variable.Array<double>(d), k).Named("t");
    var x = Variable.Array(Variable.Array<bool>(d), n).Named("x");
    
    // cluster-specific parameters
    t[k][d] = Variable.Beta(1, 1).ForEach(k).ForEach(d);


    The new part (and where I don't feel comfortable) is creation of accumulators that must be attached to the variables of interest. Those variables are "pi" and "t" and are of primary interest -- these variables must be inferred. Am I missing something or doing wrong here?

    // attach accumulator for pi variable
    Variable<Dirichlet> piMessage = Variable.Observed<Dirichlet>(Dirichlet.Uniform(numClusters));
    Variable.ConstrainEqualRandom(pi, piMessage);
    pi.AddAttribute(QueryTypes.Marginal);
    pi.AddAttribute(QueryTypes.MarginalDividedByPrior);
    
    // attach accumulator for each variable in t array
    var tMessage = Variable.Array(Variable.Array<Beta>(d), k);
    using (Variable.ForEach(k))
    {
        using (Variable.ForEach(d))
        {
            tMessage[k][d] = Variable.Observed<Beta>(Beta.Uniform());
            Variable.ConstrainEqualRandom(t[k][d], tMessage[k][d]);
            t[k][d].AddAttribute(QueryTypes.Marginal);
            t[k][d].AddAttribute(QueryTypes.MarginalDividedByPrior);
        }
    }

    The generative story or the model itself remains unchanged:

    // data generation model
    using (Variable.ForEach(n))
    {
        c[n] = Variable.Discrete(pi);
        using (Variable.Switch(c[n]))
        {
            using (Variable.ForEach(d))
                x[n][d] = Variable.Bernoulli(t[c[n]][d]);
        }
    }

    As we seek to assign cluster label to every data point, we need to make sure that the model can start from a good point. Symmetry must be broken by making random cluster assignments.

    // symmetry breaking -- assign clusters to examples randomly
    Discrete[] cinit = new Discrete[batchSize];
    for (int i = 0; i < cinit.Length; i++)
        cinit[i] = Discrete.PointMass(Rand.Int(k.SizeAsInt), k.SizeAsInt);
    c.InitialiseTo(Distribution<int>.Array(cinit));

    Now the inference part. We create the engine instance and two marginals to infer - "pi" and "t". These are expected to be updated after each batch.

    Here, I am not sure If I am doing correct initialization/use of the two variables.

    In this code, I was simply checking the progress on "piMarginal" as it is simpler to print in console. Are there any other ways to visualize the inference stage?

    InferenceEngine engine = new InferenceEngine();
    
    // marginals to infer (and update in each batch)
    Dirichlet piMarginal = Dirichlet.Uniform(numClusters);
    
    Beta[][] tMarginal = new Beta[numClusters][];
    for (int i = 0; i < numClusters; i++)
    {
        tMarginal[i] = new Beta[numDims];
        for (int j = 0; j < numDims; j++)
            tMarginal[i][j] = Beta.Uniform();
    }

    Finally, the online learning part & inference. The data is fed in "batchSize" chunks and the variables of interest are inferred by the engine. The inference process should update the beliefs from batch to batch. I am not sure how to check that.

    // online learning in batches
    bool[][] batch = new bool[batchSize][];
    for (int b = 0; b < numData / batchSize; b++)
    {
        nItems.ObservedValue = batchSize;
    
        // fill the batch with data
        batch = data.Skip(b * batchSize).Take(batchSize).ToArray();
        x.ObservedValue = batch;
    
        piMarginal = engine.Infer<Dirichlet>(pi);
        tMarginal = engine.Infer<Beta[][]>(t);
    
        piMessage.ObservedValue = engine.Infer<Dirichlet>(pi, QueryTypes.MarginalDividedByPrior);
        tMessage.ObservedValue = engine.Infer<Beta[][]>(t, QueryTypes.MarginalDividedByPrior);
    
        Console.WriteLine("Batch {0}, pi Marginal: {1}", b, piMarginal);
    }
    
    In the online learning part, I am not certain if I am doing this online learning correctly. It still feels like something is missing but I am not sure how to check. I don't like that the model is re-compiled in each batch (should it?) and there are two inference runs per batch.


    Saturday, May 20, 2017 12:03 AM

Answers

  • There are two problems here:

    1. When you assign to piMessage.ObservedValue, you change the constraints in the model.  Therefore it has to re-run inference on the next line.  Instead you want:

    var temp = engine.Infer<Dirichlet>(pi, QueryTypes.MarginalDividedByPrior);
    tMessage.ObservedValue = engine.Infer<Beta[][]>(t, QueryTypes.MarginalDividedByPrior); piMessage.ObservedValue = temp;


    2. Initially you define tMessage to be an array where every element is observed. Then you assign to tMessage.ObservedValue, providing a single observed value for the entire array. The array ends up with two different observations which are not compatible and will trigger re-compilation (and incorrect results). Instead you want to define tMessage to be an observed array from the beginning.



    Sunday, May 21, 2017 10:58 PM
    Owner

All replies

  • There are two problems here:

    1. When you assign to piMessage.ObservedValue, you change the constraints in the model.  Therefore it has to re-run inference on the next line.  Instead you want:

    var temp = engine.Infer<Dirichlet>(pi, QueryTypes.MarginalDividedByPrior);
    tMessage.ObservedValue = engine.Infer<Beta[][]>(t, QueryTypes.MarginalDividedByPrior); piMessage.ObservedValue = temp;


    2. Initially you define tMessage to be an array where every element is observed. Then you assign to tMessage.ObservedValue, providing a single observed value for the entire array. The array ends up with two different observations which are not compatible and will trigger re-compilation (and incorrect results). Instead you want to define tMessage to be an observed array from the beginning.



    Sunday, May 21, 2017 10:58 PM
    Owner
  • Thank you for your response!

    For (1), why do you assign the result of "engine.Infer<Dirichlet>(pi, ...)" to a temporary variable only to assign it to "piMessage.ObservedValue" afterwards? Isn't that doing the same as the original code? The original code has piMessage assignment before tMessage assignment. It is not clear to me how the temporary variable and the order of assignments make the difference.

    In (2), I think I see the issue with tMessage. However, it is not clear to me how to make the array observed, add ConstrainEqualRandom() and add two attributes. All at the same time and for each element of the array.

    Is it possible to clarify a bit more precisely what needs to be changed and where my understanding is insufficient?

    Thanks again for your time and patience!

    Monday, May 22, 2017 4:28 PM
  • 1. Assigning to ObservedValue has side effects (it is a property setter in C#).  It alerts the InferenceEngine that something has changed.

    2. Get rid of "tMessage[k][d] = Variable.Observed<Beta>(Beta.Uniform());" and use tMessage.ObservedValue = ... 

    Monday, May 22, 2017 4:36 PM
    Owner
  • 1. Ok. Should I proceed similarly if I decide also to infer the cluster assignments variable array? The "c" variable in my code.

    2. I removed the "tMessage[k][d] = Variable.Observed<Beta>(Beta.Uniform())" and got an exception saying that 'Variable 'vBeta[][]0' has no definition'...

    Monday, May 22, 2017 5:20 PM
  • It's been a while I am struggling with the proper definition & initialization of "tMessage". In my code, "tMessage" is a jagged array (an array of arrays containing Beta distribution).

    I have the following line:

    var tMessage = Variable.Array(Variable.Array<Beta>(d), k);

    In the next line I am trying to make it observed:

    tMessage.Observed = Variable.Observed<Beta[][]>(???);

    I am still looking through examples where something similar is done but I still can't get past this point. I don't know what to put in the right side of that assignment...

    For other parts of the code, I removed "tMessage[k][d]=Variable.Observed<Beta>(Beta.Uniform())" and also implemented 1. (inference of piMessage and tMessage).

    Monday, May 22, 2017 8:40 PM
  • Try reading the user guide sections on Creating variables and observing arrays, and see if it answers your question.
    Monday, May 22, 2017 8:50 PM
    Owner
  • I think I have found an explanation for 1. in the Infer.NET 101 document (page 26, bottom):

    "If you change the model, the inference engine must recompile it before running a query, which can take significant amount of time. However, assigning a new value to an ObservedValue property doesn't change the model, even if the value is a reference type rather than a value type. For all queries after the first one, the engine just runs the existing compiled model with new observed value."

    Is this why you stored the result of the first query in a temporary variable?

    var temp = engine.Infer<Dirichlet>(pi, QueryTypes.MarginalDividedByPrior);

    Monday, May 22, 2017 8:52 PM
  • Here is another way to say it.  I'm thinking of adding this the user guide page on Running inference.

    When you call Infer() the first time, the inference engine will collect all factors and variables related to the variable that you are inferring (i.e. the model), compile an inference algorithm, run the algorithm, and return the result.  This is an expensive process, so the results are cached at multiple levels.  If you call Infer() on the same variable again without changing anything, then it will immediately return the cached result from the last call.  If you call Infer() on another variable in the same model, then it will return the cached result for that variable, if one is available.   For instance, in the example above, when Infer<Gaussian>(mean) is called, the inference engine will compute both the posterior over the mean and the posterior over the precision (since this requires no additional computation).  Then when Infer<Gamma>(precision) is called, this cached precision is returned and no additional computation is performed.   These cached results get invalidated when you change things.  In particular, if you change observed values or minor settings on the inference engine such as the number of iterations, then the inference results are invalidated, but the compiled algorithm is kept.  If you change the model itself, such as adding new variables or constraints, making a variable observed when it was previously un-observed, or you change major settings such as the choice of inference algorithm, then the compiled algorithm is also invalidated and the next call to Infer() will trigger a re-compile.

    Tuesday, May 23, 2017 10:14 AM
    Owner
  • I think I have managed to implement the updates to "tMessage". Basically I create the jagged array (as before) but I make it observed by initializing with an array of arrays of uniform Beta distributions:

    var tMessage = Variable.Array(Variable.Array<Beta>(d), k);
    
    Beta[][] tInit = new Beta[numClusters][];
    for (int i = 0; i < numClusters; i++ ) {
       tInit[i] = new Beta[numDims];
       for (int j = 0; j < numDims; j++ )
          tInit[i][j] = Beta.Uniform();
    }
    
    tMessage.ObservedValue = tInit;

    After this change (and 1. implementation), I see that the model is not getting recompiled after each batch. Nevertheless, I see from piMarginal that from batch to batch, only the same Dirichlet pseudo count positions are incremented. Others stay at their init values (0.1 in my case; 1 / numClusters). I find this suspicious, as the data points should scatter across the requested number of clusters (in my case I asked for 10 clusters).

    Am I doing the initialization of tMessage correctly? I would love to be more or less sure about this part before moving forward with simulation.

    Many thanks for following this issue and helping me out!


    • Edited by usptact Friday, May 26, 2017 12:04 AM
    Friday, May 26, 2017 12:02 AM