locked
Smart way to use Side Information RRS feed

  • Question

  • Hi everyone. 

    I'm working on a recommender system for biologist in the sense that this should have some biology application. The base for my work is the recommender system example in the Infer.NET web guide. I add some side information to the model in this way:

    I first sample the temporary latent vector for genes(that in the case of recommender system example could be films)

    I recombine these according to the side information(that in this case is the similarity between genes).

    the code I use is the following:

                int[] tmpRBP;
                int[] tmpGene;
                int[] tmpRating;
    
                LoadData(trainFilename, out tmpRBP, out tmpGene, out tmpRating);//, "train");
        
                //*********************** File Reading End ************************
    
                // Define ranges
                int numRBPs = RBPs;
                int numGenes = Genes;
                int numTraits = numTrait;
                Variable<int> numObservations = Variable.Observed(tmpGene.Length).Named("numObservations");
                int numLevels = 1;
    
                // Define ranges
                Range RBP = new Range(numRBPs).Named("RBP");
                Range gene = new Range(numGenes).Named("gene");
                Range trait = new Range(numTraits).Named("trait");
                Range observation = new Range(numObservations).Named("observation");
                Range level = new Range(numLevels).Named("level");
    
    
                Dictionary<int, Dictionary<int, Variable<double>>> genesSideInfo = new Dictionary<int, Dictionary<int, Variable<double>>>();
                List<int> noSim = new List<int>();
    
                Console.WriteLine("Start Loading GeneKernel ...");
                loadKernelDictionary(out genesSideInfo, out noSim, genesMap, "gene");
                Console.WriteLine("GeneKernel loaded...");
    
                // Define latent variables
                var RBPTraits = Variable.Array(Variable.Array<double>(trait), RBP).Named("RBPTraits");
                var geneTraits = Variable.Array(Variable.Array<double>(trait), gene).Named("geneTraits");
                var RBPBias = Variable.Array<double>(RBP).Named("RBPBias");
                var geneBias = Variable.Array<double>(gene).Named("geneBias");
                var RBPThresholds = Variable.Array<double>(RBP).Named("RBPThresholds");
    
                // Define priors
                var RBPTraitsPrior = Variable.Array(Variable.Array<Gaussian>(trait), RBP).Named("RBPTraitsPrior");
    
                //var geneTraitsPrior = Variable.Array(Variable.Array<Gaussian>(trait), gene).Named("geneTraitsPrior");
                var RBPBiasPrior = Variable.Array<Gaussian>(RBP).Named("RBPBiasPrior");
                var geneBiasPrior = Variable.Array<Gaussian>(gene).Named("geneBiasPrior");
                var RBPThresholdsPrior = Variable.Array<Gaussian>(RBP).Named("RBPThresholdsPrior");
    
                
                //var RBPKernelPrior = Variable.Array(Variable.Array<double>(RBP), trait).Named("RBPKernelPrior");
                //var RBPKernelPriorPrior = Variable.Array(Variable.Array<Gaussian>(trait), RBP).Named("RBPKernelPriorPrior");
    
                //var geneKernelPriorPrior = Variable.Array(Variable.Array<Gaussian>(gene), trait).Named("geneKernelPriorPrior");
    
                var geneKernelPriorMean = Variable.Array<Gaussian>(gene).Named("geneKernelPriorMean");
                var geneKernelMean = Variable.Array<double>(gene).Named("geneKernelMean");
                var geneKernelPriorPrec = Variable.Array<Gamma>(gene).Named("geneKernelPriorPrec");
                var geneKernelPrec = Variable.Array<double>(gene).Named("geneKernelPrec");
                var geneKernelPrior = Variable.Array(Variable.Array<double>(trait), gene).Named("geneKernelPrior");
    
                var geneThresholdsPrior = Variable.Array<Gaussian>(gene).Named("geneThresholdsPrior");
    
                /***/
                
                var RBPBiasPriorPrec = Variable.Array<Gamma>(RBP).Named("RBPBiasPriorPrec");
                var RBPBiasPrec = Variable.Array<double>(RBP).Named("RBPBiasPrec");
                var RBPBiasPriorMean = Variable.Array<Gaussian>(RBP).Named("RBPBiasPriorMean");
                var RBPBiasMean = Variable.Array<double>(RBP).Named("RBPBiasMean");
                RBPBiasPrec[RBP] = Variable<double>.Random(RBPBiasPriorPrec[RBP]);
                RBPBiasMean[RBP] = Variable<double>.Random(RBPBiasPriorMean[RBP]);
    
                var geneBiasPriorPrec = Variable.Array<Gamma>(gene).Named("geneBiasPriorPrec");
                var geneBiasPrec = Variable.Array<double>(gene).Named("geneBiasPrec");
                var geneBiasPriorMean = Variable.Array<Gaussian>(gene).Named("geneBiasPriorMean");
                var geneBiasMean = Variable.Array<double>(gene).Named("geneBiasMean");
                geneBiasPrec[gene] = Variable<double>.Random(geneBiasPriorPrec[gene]);
                geneBiasMean[gene] = Variable<double>.Random(geneBiasPriorMean[gene]);
                
                /***/
    
                geneKernelMean[gene] = Variable<double>.Random(geneKernelPriorMean[gene]);
                geneKernelPrec[gene] = Variable<double>.Random(geneKernelPriorPrec[gene]);
                geneKernelPrior[gene][trait] = Variable.GaussianFromMeanAndPrecision(geneKernelMean[gene], geneKernelPrec[gene]).ForEach(trait);
    
    

    in this part i define all the structures. geneKernelPrior represent the "temporary" traits matrix, while geneSideInfo is the Dictionary where is "reported" the side information. Finally the list noSim is a list for that genes that have not similarty. 

    In order to reduce the memory requirement I have made these structure( using a complete similarity matrix reporting the similarity of each gene to each other had huge memory requirements). 

    The following code recombine the temporary traits matrix according to the similarities:

                foreach(var g in genesSideInfo.Keys){
                    for (int i = 0; i < numTraits; i++)
                    {
                        Range tmpRange = new Range(genesSideInfo[g].Count);
                        VariableArray<double> tmpProduct = Variable.Array<double>(tmpRange);
                        int index = 0;
                        foreach (var y in genesSideInfo[g].Keys)
                        {
                            tmpProduct[index] = genesSideInfo[g][y] * geneKernelPrior[g][i];
                            index += 1;
                        }
                        geneTraits[g][i] = Variable.Sum(tmpProduct);
                    }
                }
                foreach(var g in noSim){
                    geneTraits[g] = geneKernelPrior[g];
                }

    The numer of similarities in the Dictionary is different for each gene and can happen that a gene has 10 similarities while another one have none. For this reason I create a temporary vector tmpProduct where i store the influence of each similarity for the gene considered(g). Then i sum up all the values of this vector and I use it as latent vector item(this is made foreach value of the latent vector). 

    Since the Theory could seems good, the problem stay in the fact that in this way the model is not even able to complile(or at least i stopped it before it finish after i've wait a lot of time). 

    I Know that the model is more complex then the original version but i suppose that i'm doing something wrong, and this make the system infinity slower. Therefore i neither check the quality of this model if it require so much time.

    Have you got any ideas or advice to make the system faster still including side information??

    Thank you a lot for your time and aveilability

    Best Regards

    Marco

    P.s i can also provide the complete system or the model graph if can help.

    Tuesday, July 30, 2013 2:55 PM

All replies