Asked by:
Imputing missing discrete values on a grid
Question

I would like to impute missing data with Infer.net. I have discrete data (0,1,2) from a number of variables (columns) that are observed on a number of individuals (rows). I would like to utilize the covariance between both variables and indiviuals when I impute the missing data. My first idea was to use a multivaiate normal distribution and then make the imputed data value discrete, but there ought to be better ways. An example data is below. Any ideas are welcome.
1
0
0
0
1
0
2
1
1
1
1
0
0
0
0
2
0
2
2
2
0
0
0
0
0
0
2
0
2
2
2
0
0
2
0
0
0
0
0
2
0
0
2
2
2
0
0
0
0
0
2
0
0
2
2
1
0
0
0
1
NA
2
1
1
1
1
1
0
0
0
1
0
2
1
1
1
1
1
0
0
0
1
0
2
1
1
1
1
0
0
0
0
2
0
2
2
2
0
0
0
0
0
0
2
0
2
2
2
0
0
1
0
0
0
1
0
2
1
1
1
1
Tuesday, September 27, 2011 1:33 PM
All replies

Once you have chosen a model, Infer.NET can easily do the imputation. But it sounds like you haven't decided what sort of model is appropriate here. To answer this question, you need to think about where the data came from. Your idea of discretizing a Gaussian assumes that the discrete values (0,1,2) are ordered. Is this true? Are the individuals independent and identically distributed? What distinguishes the columns? Do you expect the rows and columns to be clustered? Thinking about these questions will help to choose a model. For some ideas, you might want to look at the reviewer model, the multiclass Bayes point machine, or multinomial regression, all of which could apply to this sort of data.Tuesday, September 27, 2011 4:29 PMOwner

The underlying data is genetic, i.e. an individual has 0, 1 or 2 copies of a certain DNA marker. Therefore the columns refer to a certain DNA marker that has been ordered along a chromosome. Hence, closely situated markers often are more correlated with each other. There is a certain distance between each marker (but this may not be possible to account for). The individuals (rows) are often assumed to be i.i.d., but this is seldom the case since some individuals could be (or are) more closely related (correlated) with each other. However, it is not possible to order the individuals along a distance metric. If we look at columns separately, we could fit a Gaussian Process to the columns in order to model the spatial dependence between the DNA markers. How to deal with the nonspatial dependence between the individuals is unclear to me.Wednesday, September 28, 2011 9:14 AM

Stochastic blockmodels are often used for this type of data. Check out the paper "Mixed Membership Stochastic Blockmodels" by Airoldi et al. Their model is straightforward to implement in Infer.NET (since it is similar to Latent Dirichlet Allocation). Beyond that, you can try the models mentioned above. I suspect that finding the right model for this data is a research problem in itself.
Wednesday, September 28, 2011 12:04 PMOwner 
Yes, the problem is not so trivial (at least not to me). Different mixture models have been used earlier over the individual dimension, but I would like to avoid that. If we just look at the individual dimension and assume that the markers are independent, would it be possible to formulate a model where individuals are treated as nodes and each marker is multinomial and get an 'average' probability (over markers) of individuals having the same genetic setup (leaving out the NA observation to start with)?Wednesday, September 28, 2011 2:33 PM

Are you saying that you want to loop over all pairs of individuals and compute a score of how similar their marker distributions are? This is easy enough to do, though I think the stochastic blockmodel provides a cleaner solution.Wednesday, September 28, 2011 4:20 PMOwner

An Infer.NET implementation of stochastic blockmodel is documented here for Infer.NET 2.4 beta 2.Thursday, September 29, 2011 8:46 AMOwner