Asked by:
Imputing missing discrete values on a grid

I would like to impute missing data with Infer.net. I have discrete data (0,1,2) from a number of variables (columns) that are observed on a number of individuals (rows). I would like to utilize the covariance between both variables and indiviuals when I impute the missing data. My first idea was to use a multivaiate normal distribution and then make the imputed data value discrete, but there ought to be better ways. An example data is below. Any ideas are welcome.
1
0
0
0
1
0
2
1
1
1
1
0
0
0
0
2
0
2
2
2
0
0
0
0
0
0
2
0
2
2
2
0
0
2
0
0
0
0
0
2
0
0
2
2
2
0
0
0
0
0
2
0
0
2
2
1
0
0
0
1
NA
2
1
1
1
1
1
0
0
0
1
0
2
1
1
1
1
1
0
0
0
1
0
2
1
1
1
1
0
0
0
0
2
0
2
2
2
0
0
0
0
0
0
2
0
2
2
2
0
0
1
0
0
0
1
0
2
1
1
1
1
Question
All replies

Once you have chosen a model, Infer.NET can easily do the imputation. But it sounds like you haven't decided what sort of model is appropriate here. To answer this question, you need to think about where the data came from. Your idea of discretizing a Gaussian assumes that the discrete values (0,1,2) are ordered. Is this true? Are the individuals independent and identically distributed? What distinguishes the columns? Do you expect the rows and columns to be clustered? Thinking about these questions will help to choose a model. For some ideas, you might want to look at the reviewer model, the multiclass Bayes point machine, or multinomial regression, all of which could apply to this sort of data.

The underlying data is genetic, i.e. an individual has 0, 1 or 2 copies of a certain DNA marker. Therefore the columns refer to a certain DNA marker that has been ordered along a chromosome. Hence, closely situated markers often are more correlated with each other. There is a certain distance between each marker (but this may not be possible to account for). The individuals (rows) are often assumed to be i.i.d., but this is seldom the case since some individuals could be (or are) more closely related (correlated) with each other. However, it is not possible to order the individuals along a distance metric. If we look at columns separately, we could fit a Gaussian Process to the columns in order to model the spatial dependence between the DNA markers. How to deal with the nonspatial dependence between the individuals is unclear to me.

Stochastic blockmodels are often used for this type of data. Check out the paper "Mixed Membership Stochastic Blockmodels" by Airoldi et al. Their model is straightforward to implement in Infer.NET (since it is similar to Latent Dirichlet Allocation). Beyond that, you can try the models mentioned above. I suspect that finding the right model for this data is a research problem in itself.

Yes, the problem is not so trivial (at least not to me). Different mixture models have been used earlier over the individual dimension, but I would like to avoid that. If we just look at the individual dimension and assume that the markers are independent, would it be possible to formulate a model where individuals are treated as nodes and each marker is multinomial and get an 'average' probability (over markers) of individuals having the same genetic setup (leaving out the NA observation to start with)?

