locked
Clustering Using Infer.net (Migrated from community.research.microsoft.com) RRS feed

  • Question

  • SallyHamouda posted on 03-10-2010 4:04 AM

    Hi,

    Thanks so much for this great work.

    I was using GMM of the Infer.net for clustering. It works fine for those datasets that are small and with small number of features . But when I tried  on data size of 2000 point each point have 30 features for 5 clusters I’ve faced OUT OF MEMORY EXCEPTION.

    To solve this problem I’ve used shared variables, I ‘ve shared  weights , means and precision between data chuncks.

    The problem that those variables are from Dirchlet , Gaussian and Wishart distributions and shared variables in Infer.net don't allow except Gaussian and Gamma. When I made them all Gaussians I’ve faced the following exception:

    Unhandled Exception: System.InvalidCastException: Unable to cast object of type 'MicrosoftResearch.Infer.Distributions.DistributionRefArray`2[MicrosoftResearch. Infer.Distributions.VectorGaussian,MicrosoftResearch.Infer.Maths.Vector]' to type 'MicrosoftResearch.Infer.Distributions.VectorGaussian'.

     Also I've a problem because the number of features is relatively large and the examples of the shared variables on the Infer.net site are only for  GMM of one component for one feature.

     So can you please tell if it is possible to do clustering in the way discribed above using Infer.net and how also could I deal with shared variables for multiple Gaussian models with data points have large number of features.

    Thanks for your Support

    Friday, June 3, 2011 5:37 PM

Answers

  • John Guiver replied on 03-10-2010 1:55 PM

    You also will need to decide if you want to infer a full multivariate distribution over the weights in each cluster or not. If the former, the model will need to store large VectorGaussian messages in both directions for each cluster and for data point in the chunk. This is probably fine for the 30 features if the chunks of data are not too large, but you could also consider arrays of Gaussians rather than Vector Gaussians if the feature space gets very large.. Let us know how you get along, and if you need any more help.

    John

    Friday, June 3, 2011 5:37 PM

All replies

  • minka replied on 03-10-2010 10:11 AM

    Sounds like you want to use a SharedVariableArray.  See this thread: http://community.research.microsoft.com/forums/p/3933/7055.aspx#7055

    Friday, June 3, 2011 5:37 PM
  • John Guiver replied on 03-10-2010 1:55 PM

    You also will need to decide if you want to infer a full multivariate distribution over the weights in each cluster or not. If the former, the model will need to store large VectorGaussian messages in both directions for each cluster and for data point in the chunk. This is probably fine for the 30 features if the chunks of data are not too large, but you could also consider arrays of Gaussians rather than Vector Gaussians if the feature space gets very large.. Let us know how you get along, and if you need any more help.

    John

    Friday, June 3, 2011 5:37 PM