# Penalized multiple regression (Migrated from community.research.microsoft.com)

• ### Question

• patwa posted on 08-30-2010 7:34 AM

A hot topic in bioinformatics is penalized multiple regression where the number of explanatory variables is much larger than the number of observations (p >> n). I now wonder if it would be feasible to construct a model within Infer that could handle a data set where p would be on the order of 1 million and s on the order of 10 000. Lets say that we use a simple multiple regression model, i.e. y = my + sum(X_ij*b_j) + e . Here, y is a vector with Gaussian data, my is the mean that could be assumed to have some uninformative prior, X_ij contains indicators (-1, 0 and 1) for each of variable j and observation i. e is Gaussian noise. What I want is a prior over b_j that allows for some kind of shrinkage. One approach often used is Gibbs sampling in combination with a mixture approach (Stochastic Search Variable Selection), but this is not computationally feasible for the size mentioned here.  Another popular approach is the LASSO, there are both frequntist and Bayesian versions.  Any recommendations would be welcome.

Friday, June 3, 2011 6:01 PM

• patwa replied on 09-01-2010 7:48 AM

OK, thanks. I guess we need to wait for the next release for the full model. A code example would then be helpful. Meanwhile, what changes needs to be done to perform the multiple regression (eg.1) to the code in: http://community.research.microsoft.com/forums/t/3275.aspx

Friday, June 3, 2011 6:01 PM

### All replies

• patwa replied on 08-31-2010 11:25 AM

One example of a VB-model can be found in (Methods section): http://www.biomedcentral.com/1471-2105/11/58

Is this model feasible in Infer.NET?

Friday, June 3, 2011 6:01 PM
• DavidKnowles replied on 08-31-2010 12:05 PM

Hi

The model in the article could be implemented straightforwardly in Infer.NET, with the one exception of the truncated Dirichlet (which they describe how to calculate sufficient statistics for in the supplementary material). If particularly wanted to include this factor (which doesn't seem essential to the model) you could consider developing the appropriate factor and message operators (see here: http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/How%20to%20add%20a%20new%20factor%20and%20message%20operators.aspx).

Good examples to look at would be this thread on Bayesian linear regression:

http://community.research.microsoft.com/forums/p/3275/5383.aspx#5383

and the Gaussian mixture model example:

http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Mixture%20of%20Gaussians%20tutorial.aspx

(since your model has a mixture over the regression coefficients).

I hope that helps

David.

Friday, June 3, 2011 6:01 PM
• jwinn replied on 08-31-2010 12:21 PM

The regression part of the VB-model (eq. 1) can be implemented in the current version of Infer.NET.  The mixture prior on the weights (eq. 2) requires truncated Gaussian support for VB, which will be available in the forthcoming release of Infer.NET (around Sept/Oct).  The ad-hoc Dirichlet 'truncation' would require special treatment - it might be possible to implement it in a principled manner by adding a mixture component at zero with fixed mixture weight.  Alternatively you could write a custom factor to perform the truncation, although this would not be recommended practice.

Hope that makes sense,

Best
John W.

Friday, June 3, 2011 6:01 PM
• patwa replied on 09-01-2010 7:48 AM

OK, thanks. I guess we need to wait for the next release for the full model. A code example would then be helpful. Meanwhile, what changes needs to be done to perform the multiple regression (eg.1) to the code in: http://community.research.microsoft.com/forums/t/3275.aspx

Friday, June 3, 2011 6:01 PM