locked
VIBES or infer.NET for inferring genotype probabilities over the pedigree (Migrated from community.research.microsoft.com) RRS feed

  • Question

  • ggorjan posted on 04-03-2009 7:07 AM

    Friday, June 3, 2011 4:55 PM

Answers

  • John Guiver replied on 06-26-2009 11:10 AM

    Below is an example of usage, followed by a print out of the results, which seem to make sense. A warning that this model cannot be used for very deep trees because the generations are represented as dotNOT arrays (rather than ranges), and the Infer.NET compiler will unwind these in the generated code, generating unwieldy amounts of code. If you want to do very deep trees, then we will need to think more carefully.

    using System;
    using System.Collections.Generic;
    using System.Text;
    using MicrosoftResearch.Infer;

    namespace Genotype
    {
        class Program
       
    {
            static void Main(string[] args)
            {
                int[,][] observedGenotypes = new int[,][]
                {
                    // Generation 0: 3 individuals
                   
    {new int[] {0, -1, -1}, new int[] {2, -1, -1}},
                    // Generation 1: 3 individuals
                   
    {new int[] {-1, -1, 1}, new int[] {-1, -1, 2}},
                    // Generation 2: 1 individual
                   
    {new int[] {-1}, new int[] {-1}}
                };
                int[,][] parentMap = new int[,][]
                {
                    {new int[3], new int[3]},
                    {new int[] {0, 0, 2}, new int[] {1, 1, 1}},
                    {new int[] {2}, new int[] {0}}
                };
                var sgm = new SingleGeneModel(3, observedGenotypes, parentMap);
                sgm.Infer();
                int numGenerations = sgm.PosteriorGenotypes.GetLength(0);
                for (int g = 0; g < numGenerations; g++)
                {
                    Console.WriteLine("Generation " + g);
                     int numIndiv = sgm.PosteriorGenotypes[g, 0].Length;
                     for (int i=0; i < numIndiv; i++)
                        Console.WriteLine(sgm.PosteriorGenotypes[g,0][ i ] + ", " + sgm.PosteriorGenotypes[g,1][ i ]);
                }
            }
        }
    }

    The result from running this is:
    Generation 0
    Discrete.PointMass(0), Discrete.PointMass(2)
    Discrete(0.1667 0.1667 0.6667), Discrete(0.1667 0.1667 0.6667)
    Discrete(0.1667 0.6667 0.1667), Discrete(0.1667 0.6667 0.1667)
    Generation 1
    Discrete(0.5 0 0.5), Discrete(0.1667 0.1667 0.6667)
    Discrete(0.5 0 0.5), Discrete(0.1667 0.1667 0.6667)
    Discrete.PointMass(1), Discrete.PointMass(2)
    Generation 2
    Discrete(0 0.5 0.5), Discrete(0.3333 0.08333 0.5833)

    Friday, June 3, 2011 4:56 PM

All replies

  • minka replied on 04-15-2009 5:33 AM

    It appears that the Fernández paper is doing standard probabilistic inference in a discrete Bayesian network.  What they call "peeling" is variable elimination, and "iterative peeling" is loopy belief propagation.  If this is also what you want to do, then Infer.NET is the right tool.  Just enter your pedigree network as a model with integer variables connected by stochastic inheritance relationships.  The presence of loops is not a problem.  For efficiency, be sure to follow the guidelines (that John sent) for large irregular graphs.

    Friday, June 3, 2011 4:55 PM
  • ggorjan replied on 04-17-2009 8:20 PM

    I am really glad you stated the equalities between methods - it will be much easier for me to follow the literature in this area. I am now trying to represent the pedigree network, but I have a hard time, since I do not "speak" neither C# nor any other .NET languages. I installed Visual Studio and I am trying with the examples - I got the workflow, but I am stumbling upon the C# code. Could anyone please help me to write the code for the following example?

    We have a small family - father (code 1), mother (code 2),  and a son (code 3). Father has a genotype 1/1, while mother has 1/2. We would like to infer the genotype probabilities of a son. He will always receive allele 1 from a father, but can receive either allele 1 or allele 2 from a mother. Therefore, son has a genotype 1/1 or 1/2 (each with 0.5 probability). I was thinking about setting the network as alleles and not as individuals. These would be the working variables:

    s - a vector of length n to specify which allele will be passed from an individual to its offspring; this should be a sample from Bernoulli(0.5) + 1 (addition is done so that the subsetting of a matrix is OK)

    a - an n*k matrix of alleles, where n is the number of individuals and k is the number of distinct alleles

    a has fixed values if genotype data is observed; otherwise we assign values as a sample from parent alleles, i.e. a(i, a(f(i), s(f(i)), a(m(i), s(m(i))), where f(i) is a father of i and m(i) is a mother of i

    Thanks!

     

     

    Friday, June 3, 2011 4:55 PM
  • John Guiver replied on 06-11-2009 11:45 AM

    Hi Gorjanc

    Did anyone reply to you on this? I'm not sure if this was resolved by the InferSup alias rather than on the forum. We had a period when we weren't getting notifications of postings, so this posting may have slipped through our fingers; if so, apologies.

    As regards your question, I am a bit confused by your nxk matrix as it seems that a better representation is an nx2 genotype matrix. Does the following capture what you want?

    Each parent has a geneotype consisting of a pair of alleles each drawn from a Discrete distribution (whose dimension depends on the number of alleles for the particular gene). A boolean selector variable gates which allele is passed through to the child from each parent, thus giving the child genotype.

    The above model is easily written and we can certainly help with it. It can also be straightforwardly extended across multiple genes/alleles. Let us know if you still need help.

    John G. 

    Friday, June 3, 2011 4:55 PM
  • ggorjan replied on 06-11-2009 2:13 PM

    Hi John,

    No, I did not get any response on my query. I am fully aware that I am asking for "spoonfeeding" and to be honest I did not expect much. Therefore, I am really glad you replied.

    John Guiver:
    As regards your question, I am a bit confused by your nxk matrix as it seems that a better representation is an nx2 genotype matrix.

    You are right, n*2 is enough!

    John G.:
    Each parent has a geneotype consisting of a pair of alleles each drawn from a Discrete distribution (whose dimension depends on the number of alleles for the particular gene). A boolean selector variable gates which allele is passed through to the child from each parent, thus giving the child genotype.

    Yes, your extension with the prior (discrete distribution with probabilities p_1, p_2, ..., p_k) for founders (individuals without parents) is the next step I had in mind for my work.

    Your assumption about the boolean selector is also the right one. But I do not want to do just the "gene dropping", i.e., sampling alleles from the discrete distributuion and dropping them through the pedigree. This can be easily done anywhere. I want to infer the genotype probabilities for ungenotyped individuals given the genotypes for some individuals and the pedigree (= network), i.e., genotypes might be observed anywhere in the pedigree - sometimes parents will be genotyped, sometimes their children or grandchildren etc.

    The next step would be the inclusion of a penetrance function, where observed genotypes are treated as phenotypes. Implemention could be via another matrix, say p matrix of order n*k. This is necesary if there are genotyping errors, otherwise the network does not make sense.

    John G.:
    The above model is easily written and we can certainly help with it. It can also be straightforwardly extended across multiple genes/alleles. Let us know if you still need help.

    Yes, please!

    Gregor G.

     

     

     

    Friday, June 3, 2011 4:55 PM
  • John Guiver replied on 06-26-2009 10:54 AM

    Hi Gregor

    Here is some code that does something like what you want for a single gene model. First the model class. I'll then follow up with another post on usage.

    using System;
    using System.Collections.Generic;
    using System.Text;
    using MicrosoftResearch.Infer.Models;
    using MicrosoftResearch.Infer;
    using MicrosoftResearch.Infer.Distributions;

    namespace Genotype
    {
        class SingleGeneModel
       
    {
            public VariableArray<int>[,] Genotype;
            public VariableArray<int>[,] ObservedGenotype;
            public VariableArray<bool>[] GenotypeIsObserved;
            public VariableArray<int>[,] ParentMap;
            public Discrete[,][] PosteriorGenotypes;
            public InferenceEngine engine;

            ///
    <summary>
           
    /// Constructor
           
    /// </summary>
           
    /// <param name="numAlleles">Number of alleles</param>
           
    /// <param name="isObserved"></param>
           
    /// <param name="parentMap">Maps [generation,parent][individual] to parent index in previous generation</param>
           
    public SingleGeneModel(int numAlleles, int[,][] observedGenotype, int[,][] parentMap)
            {
                int numGenerations = observedGenotype.GetLength(0);
                bool[][] obs = new bool[numGenerations][];
                Range[] I = new Range[numGenerations];
                for (int g = 0; g < numGenerations; g++)
                {
                    int numInd = observedGenotype[g,0].Length;
                    obs[ g ] = new bool[numInd];
                    for (int i=0; i < numInd; i++)
                       obs[ g ][ i ] = (observedGenotype[g,0][ i ] < 0) ? false : true;
                    I[ g ] = new Range(numInd).Named("I_"+g);
                }
               Genotype = new VariableArray<int>[numGenerations, 2];
               ObservedGenotype = new VariableArray<int>[numGenerations, 2];
               GenotypeIsObserved = new VariableArray<bool>[numGenerations];
               ParentMap = new VariableArray<int>[numGenerations, 2];
               PosteriorGenotypes = new Discrete[numGenerations, 2][];
               for (int g = 0; g < numGenerations; g++)
               {
                   GenotypeIsObserved[ g ] = Variable.Array<bool>( I[ g ] ).Named("IsObserved_" + g);
                   GenotypeIsObserved[ g ].ObservedValue = obs[ g ];
                   for (int p = 0; p < 2; p++)
                   {
                        Genotype[g, p] = Variable.Array<int>( I[ g ] ).Named("Genotype_" + g + "_" + p);
                        ObservedGenotype[g, p] = Variable.Array<int>( I[ g ] ).Named("Observed_" + g + "_" + p);
                        ObservedGenotype[g, p].ObservedValue = observedGenotype[g, p];
                        using (Variable.ForEach( I[ g ] ))
                        {
                            Genotype[g, p][ I[ g ] ] = Variable.DiscreteUniform(numAlleles);
                             // If genotype is observed, constrain its value
                            
    using (Variable.If(GenotypeIsObserved[ g ][ I[ g ] ]))
                                 Variable.ConstrainEqual<int>(Genotype[g, p][ I[ g ] ], ObservedGenotype[g, p][ I[ g ] ]);
                        }
                    }
                }
                for (int g = 1; g < numGenerations; g++)
                {
                    for (int p = 0; p < 2; p++)
                    {
                        ParentMap[g, p] = Variable.Array<int>( I[ g ] ).Named("ParentMap_" + g + "_" + p).Attrib(new ValueRange( I[ g-1 ] ));
                        ParentMap[g, p].ObservedValue = parentMap[g,p];
                        using (Variable.ForEach( I[ g ] ))
                        {
                            using (Variable.Switch(ParentMap[g,p][ I[ g ] ]))
                            {
                                Variable<bool> selector = Variable.Bernoulli(0.5);
                                using (Variable.If(selector))
                                    Variable.ConstrainEqual<int>(Genotype[g, p][ I[ g ] ], Genotype[g-1, 0][ParentMap[g, p][ I[ g ] ]]);
                                using (Variable.IfNot(selector))
                                    Variable.ConstrainEqual<int>(Genotype[g, p][ I[ g ] ], Genotype[g-1, 1][ParentMap[g, p][ I[ g ] ]]);
                            }
                        }
                    }
                }
                engine = new InferenceEngine();
            }
            public void Infer()
            {
                for (int g=0; g < PosteriorGenotypes.GetLength(0); g++)
                    for (int p=0; p < PosteriorGenotypes.GetLength(1); p++)
                        PosteriorGenotypes[g, p] = Distribution.ToArray<Discrete[]>(engine.Infer(Genotype[g, p]));
            }
        }
    }

     

    Friday, June 3, 2011 4:56 PM
  • John Guiver replied on 06-26-2009 11:10 AM

    Below is an example of usage, followed by a print out of the results, which seem to make sense. A warning that this model cannot be used for very deep trees because the generations are represented as dotNOT arrays (rather than ranges), and the Infer.NET compiler will unwind these in the generated code, generating unwieldy amounts of code. If you want to do very deep trees, then we will need to think more carefully.

    using System;
    using System.Collections.Generic;
    using System.Text;
    using MicrosoftResearch.Infer;

    namespace Genotype
    {
        class Program
       
    {
            static void Main(string[] args)
            {
                int[,][] observedGenotypes = new int[,][]
                {
                    // Generation 0: 3 individuals
                   
    {new int[] {0, -1, -1}, new int[] {2, -1, -1}},
                    // Generation 1: 3 individuals
                   
    {new int[] {-1, -1, 1}, new int[] {-1, -1, 2}},
                    // Generation 2: 1 individual
                   
    {new int[] {-1}, new int[] {-1}}
                };
                int[,][] parentMap = new int[,][]
                {
                    {new int[3], new int[3]},
                    {new int[] {0, 0, 2}, new int[] {1, 1, 1}},
                    {new int[] {2}, new int[] {0}}
                };
                var sgm = new SingleGeneModel(3, observedGenotypes, parentMap);
                sgm.Infer();
                int numGenerations = sgm.PosteriorGenotypes.GetLength(0);
                for (int g = 0; g < numGenerations; g++)
                {
                    Console.WriteLine("Generation " + g);
                     int numIndiv = sgm.PosteriorGenotypes[g, 0].Length;
                     for (int i=0; i < numIndiv; i++)
                        Console.WriteLine(sgm.PosteriorGenotypes[g,0][ i ] + ", " + sgm.PosteriorGenotypes[g,1][ i ]);
                }
            }
        }
    }

    The result from running this is:
    Generation 0
    Discrete.PointMass(0), Discrete.PointMass(2)
    Discrete(0.1667 0.1667 0.6667), Discrete(0.1667 0.1667 0.6667)
    Discrete(0.1667 0.6667 0.1667), Discrete(0.1667 0.6667 0.1667)
    Generation 1
    Discrete(0.5 0 0.5), Discrete(0.1667 0.1667 0.6667)
    Discrete(0.5 0 0.5), Discrete(0.1667 0.1667 0.6667)
    Discrete.PointMass(1), Discrete.PointMass(2)
    Generation 2
    Discrete(0 0.5 0.5), Discrete(0.3333 0.08333 0.5833)

    Friday, June 3, 2011 4:56 PM