Answered by:
Penalized multiple regression (Migrated from community.research.microsoft.com)
Question

patwa posted on 08302010 7:34 AM
A hot topic in bioinformatics is penalized multiple regression where the number of explanatory variables is much larger than the number of observations (p >> n). I now wonder if it would be feasible to construct a model within Infer that could handle a data set where p would be on the order of 1 million and s on the order of 10 000. Lets say that we use a simple multiple regression model, i.e. y = my + sum(X_ij*b_j) + e . Here, y is a vector with Gaussian data, my is the mean that could be assumed to have some uninformative prior, X_ij contains indicators (1, 0 and 1) for each of variable j and observation i. e is Gaussian noise. What I want is a prior over b_j that allows for some kind of shrinkage. One approach often used is Gibbs sampling in combination with a mixture approach (Stochastic Search Variable Selection), but this is not computationally feasible for the size mentioned here. Another popular approach is the LASSO, there are both frequntist and Bayesian versions. Any recommendations would be welcome.
Friday, June 3, 2011 6:01 PM
Answers

patwa replied on 09012010 7:48 AM
OK, thanks. I guess we need to wait for the next release for the full model. A code example would then be helpful. Meanwhile, what changes needs to be done to perform the multiple regression (eg.1) to the code in: http://community.research.microsoft.com/forums/t/3275.aspx
 Marked as answer by Microsoft Research Friday, June 3, 2011 6:01 PM
Friday, June 3, 2011 6:01 PM
All replies

patwa replied on 08312010 11:25 AM
One example of a VBmodel can be found in (Methods section): http://www.biomedcentral.com/14712105/11/58
Is this model feasible in Infer.NET?
Friday, June 3, 2011 6:01 PM 
DavidKnowles replied on 08312010 12:05 PM
Hi
The model in the article could be implemented straightforwardly in Infer.NET, with the one exception of the truncated Dirichlet (which they describe how to calculate sufficient statistics for in the supplementary material). If particularly wanted to include this factor (which doesn't seem essential to the model) you could consider developing the appropriate factor and message operators (see here: http://research.microsoft.com/enus/um/cambridge/projects/infernet/docs/How%20to%20add%20a%20new%20factor%20and%20message%20operators.aspx).
Good examples to look at would be this thread on Bayesian linear regression:
http://community.research.microsoft.com/forums/p/3275/5383.aspx#5383
and the Gaussian mixture model example:
(since your model has a mixture over the regression coefficients).
I hope that helps
David.
Friday, June 3, 2011 6:01 PM 
jwinn replied on 08312010 12:21 PM
The regression part of the VBmodel (eq. 1) can be implemented in the current version of Infer.NET. The mixture prior on the weights (eq. 2) requires truncated Gaussian support for VB, which will be available in the forthcoming release of Infer.NET (around Sept/Oct). The adhoc Dirichlet 'truncation' would require special treatment  it might be possible to implement it in a principled manner by adding a mixture component at zero with fixed mixture weight. Alternatively you could write a custom factor to perform the truncation, although this would not be recommended practice.
Hope that makes sense,
Best
John W.Friday, June 3, 2011 6:01 PM 
patwa replied on 09012010 7:48 AM
OK, thanks. I guess we need to wait for the next release for the full model. A code example would then be helpful. Meanwhile, what changes needs to be done to perform the multiple regression (eg.1) to the code in: http://community.research.microsoft.com/forums/t/3275.aspx
 Marked as answer by Microsoft Research Friday, June 3, 2011 6:01 PM
Friday, June 3, 2011 6:01 PM