Asked by:
Model not working on bigger data set

Question
-
I am trying to work on DARE model (How to Grade a Test Without Knowing the Answers – ICML 2012). I have made a little change in DefineGenerativeProcess() function, other than that everything is same as DARE model.
This model works seamlessly with a smaller dataset consist of 1150 labels, but occurs exception - improper distribution -when data set is bigger: approximately 8000 labels.
What is the main reason of this problem? Any ideas on how I could solve it?
public class DARE { // Ranges - size of the variables public static Range task; public static Range worker; public static Range choice; public static Range workerTask; // Main Variables in the model public static VariableArray<double> workerAbility; public static VariableArray<double> taskDifficulty; public static VariableArray<double> discrimination; public static VariableArray<int> trueLabel; public static VariableArray<VariableArray<int>, int[][]> workerResponse; // Variables in model public static Variable<int> WorkerCount; public static VariableArray<int> WorkerTaskCount; public static VariableArray<VariableArray<int>, int[][]> WorkerTaskIndex; // Prior distributions public static Gaussian workerAbilityPrior; public static Gaussian taskDifficultyPrior; public static Gamma discriminationPrior; // Posterior distributions public static Gaussian[] workerAbilityPosterior; public static Gaussian[] taskDifficultyPosterior; public static Gamma[] discriminationPosterior; public static Discrete[] trueLabelPosterior; // Inference engine public static InferenceEngine Engine; /// <summary> /// The number of inference iterations. /// </summary> public static int NumberOfIterations { get; set; } /// <summary> /// Creates a DARE model instance. /// </summary> public DARE() { NumberOfIterations = 35; } /// <summary> /// Initializes the ranges, the generative process and the inference engine of the BCC model. /// </summary> /// <param name="taskCount">The number of tasks.</param> /// <param name="labelCount">The number of workers.</param> /// <param name="labelCount">The number of labels.</param> public static void CreateModel(int taskCount, int workerCount, int labelCount) { DefineVariablesAndRanges(taskCount, workerCount, labelCount); DefineGenerativeProcess(); DefineInferenceEngine(); } /// <summary> /// Initializes the ranges of the variables. /// </summary> /// <param name="taskCount">The number of tasks.</param> /// <param name="taskCount">The number of workers.</param> /// <param name="labelCount">The number of labels.</param> public static void DefineVariablesAndRanges(int taskCount, int workerCount, int labelCount) { worker = new Range(workerCount).Named("worker"); task = new Range(taskCount).Named("task"); choice = new Range(labelCount).Named("choice"); // The tasks for each worker WorkerTaskCount = Variable.Array<int>(worker).Named("WorkerTaskCount"); workerTask = new Range(WorkerTaskCount[worker]).Named("workerTask"); WorkerTaskIndex = Variable.Array(Variable.Array<int>(workerTask), worker).Named("WorkerTaskIndex"); WorkerTaskIndex.SetValueRange(task); //worker ability for each worker workerAbilityPrior = new Gaussian(0, 1); workerAbility = Variable.Array<double>(worker).Named("workerAbility"); workerAbility[worker] = Variable.Random(workerAbilityPrior).ForEach(worker); //task difficulty for each task taskDifficultyPrior = new Gaussian(0, 1); taskDifficulty = Variable.Array<double>(task).Named("taskDifficulty"); taskDifficulty[task] = Variable.Random(taskDifficultyPrior).ForEach(task); // discrimination of each task discriminationPrior = Gamma.FromMeanAndVariance(1, 0.01); discrimination = Variable.Array<double>(task).Named("discrimination"); discrimination[task] = Variable.Random(discriminationPrior).ForEach(task); //unobserved true label for each task trueLabel = Variable.Array<int>(task).Named("trueLabel"); trueLabel[task] = Variable.DiscreteUniform(choice).ForEach(task); //worker label workerResponse = Variable.Array(Variable.Array<int>(workerTask), worker).Named("workerResponse"); } /// <summary> /// Defines the DARE generative process. /// </summary> public static void DefineGenerativeProcess() { // The process that generates the worker's label using (Variable.ForEach(worker)) { using (Variable.ForEach(workerTask)) { var index = WorkerTaskIndex[worker][workerTask]; var advantage = (workerAbility[worker] - taskDifficulty[index]).Named("advantage"); var advantageNoisy = Variable.GaussianFromMeanAndPrecision(advantage, discrimination[index]).Named("advantageNoisy"); var correct = (advantageNoisy > 0).Named("correct"); using (Variable.If(correct)) workerResponse[worker][workerTask] = trueLabel[index]; using (Variable.IfNot(correct)) workerResponse[worker][workerTask] = Variable.DiscreteUniform(choice); } } } // <summary> /// Initializes the DARE inference engine. /// </summary> public static void DefineInferenceEngine() { Engine = new InferenceEngine(new ExpectationPropagation()); Engine.Compiler.UseParallelForLoops = true; Engine.ShowProgress = false; Engine.Compiler.WriteSourceFiles = false; } /// <summary> /// Attachs the data to the workers labels. /// </summary> /// <param name="taskIndices">The matrix of the task indices (columns) of each worker (rows).</param> /// <param name="workerLabels">The matrix of the labels (columns) of each worker (rows).</param> public static void AttachData(int[][] taskIndices, int[][] workerLabels) { WorkerTaskCount.ObservedValue = taskIndices.Select(tasks => tasks.Length).ToArray(); WorkerTaskIndex.ObservedValue = taskIndices; workerResponse.ObservedValue = workerLabels; } /// <summary> /// Infers the posteriors of DARE using the attached data and priors. /// </summary> /// <param name="taskIndices">The matrix of the task indices (columns) of each worker (rows).</param> /// <param name="workerLabels">The matrix of the labels (columns) of each worker (rows).</param> public static void Infer(int[][] taskIndices, int[][] workerLabels) { AttachData(taskIndices, workerLabels); Engine.NumberOfIterations = NumberOfIterations; workerAbility.AddAttribute(new Sequential()); // needed to get stable convergence taskDifficulty.AddAttribute(new Sequential()); // needed to get stable convergence workerAbilityPosterior = Engine.Infer<Gaussian[]>(workerAbility); taskDifficultyPosterior = Engine.Infer<Gaussian[]>(taskDifficulty); discriminationPosterior = Engine.Infer<Gamma[]>(discrimination); trueLabelPosterior = Engine.Infer<Discrete[]>(trueLabel); } }
Friday, March 27, 2015 10:30 AM
All replies
-
How is this model different from the one in your other thread?Friday, March 27, 2015 10:57 AM
-
Hi cindyak
Can you
(a) Remind me where you got this code from
(b) Say what you changed
(c) Confirm that you are using a serial schedule (I believe that this should be the default with Infer.NET 2.6 but better to explicitly set it).
(d) Possibly make a data set available that triggers this exception.John
Monday, March 30, 2015 12:44 PMOwner -
Hi John Guiver,
(a) I got original code form Infer.NET 2.6\Samples\C#\ExamplesBrowser\DifficultyAbility.cs file
(b) I have made a little change in DefineGenerativeProcess() function. According to main source code, each worker performs all the tasks. But in my code, I changed it where a worker can only perform a subset of the tasks from the task pool.
/// <summary> /// Defines the DARE generative process. /// </summary> public static void DefineGenerativeProcess() { // The process that generates the worker's label using (Variable.ForEach(worker)) { using (Variable.ForEach(workerTask)) // instead of all the tasks, a worker performs subset of tasks { var index = WorkerTaskIndex[worker][workerTask]; var advantage = (workerAbility[worker] - taskDifficulty[index]).Named("advantage"); var advantageNoisy = Variable.GaussianFromMeanAndPrecision(advantage,discrimination[index]).Named("advantageNoisy"); var correct = (advantageNoisy > 0).Named("correct"); using (Variable.If(correct)) workerResponse[worker][workerTask] = trueLabel[index]; using (Variable.IfNot(correct)) workerResponse[worker][workerTask] = Variable.DiscreteUniform(choice); } } }
(c) According to Customizing the algorithm initialization tutorial I already changed it.
(d) Dataset
- Edited by cindyak Tuesday, March 31, 2015 3:33 AM
Tuesday, March 31, 2015 3:32 AM -
Hi Cindy
I am quite confused because the code in Infer.NET 2.6\Samples\C#\ExamplesBrowser\DifficultyAbility.cs doesn't look anything like your code. Your code looks more like the Crowdsourcing code from http://blogs.msdn.com/b/infernet_team_blog/archive/2014/06/25/community-based-bayesian-classifier-combination.aspx.
As I don't have your complete code, it is difficult to run this and help figure out the problem. But in general we use the Subarray factor to efficiently deal with the non-dense case. Something like:
using (Variable.ForEach(worker)) { var workerTaskDifficulty = Variable.Subarray(TaskDifficulty, WorkerTaskIndex[worker]); using (Variable.ForEach(workerTask)) { var advantage = workerAbility[worker] - workerTaskDifficulty[workerTask]; ... } }
John
Tuesday, March 31, 2015 9:56 AMOwner -
Hi John Guiver,
Thanks for your response.
Here is my entire runnable source code. Here you will find EditDARE class is similar to original source code.
/// <summary> /// The class for the main program. /// </summary> class InferLabel { /// <summary> /// The data mapping. /// </summary> public static DataMapping Mapping { get; private set; } static string Dataset = "CFWithTrueLabels" ; static void Main(string[] args) { var data = Datum.LoadData(@".\Data\" + Dataset + ".csv"); Mapping = new DataMapping(data); var labelsPerWorkerIndex = Mapping.GetLabelsPerWorkerIndex(data); var TaskPerWorkerIndex = Mapping.GetTaskIndicesPerWorkerIndex(data); EditDARE.RunDARE(Mapping.TaskCount, Mapping.WorkerCount, Mapping.LabelCount, labelsPerWorkerIndex, TaskPerWorkerIndex); } } /// <summary> /// Edited version of DARE model /// Reference: ICML 2012 how to grade a test without knowing the answer? /// </summary> public class EditDARE { #region Fields // const public const double ABILITY_PRIOR_MEAN = 0; public const double ABILITY_PRIOR_VARIANCE = 1; public const double DIFFICULTY_PRIOR_MEAN = 0; public const double DIFFICULTY_PRIOR_VARIANCE = 1; public const double DISCRIM_PRIOR_SHAPE = 1; public const double DISCRIM_PRIOR_SCALE = 0.01; const int NUMBER_OF_ITERATIONS = 35; // Ranges - size of the variables public static Range task; public static Range worker; public static Range choice; public static Range workerTask; // Main Variables in the model public static VariableArray<double> workerAbility; public static VariableArray<double> taskDifficulty; public static VariableArray<double> discrimination; public static VariableArray<int> trueLabel; public static VariableArray<VariableArray<int>, int[][]> workerResponse; // Variables in model public static Variable<int> WorkerCount; public static VariableArray<int> WorkerTaskCount; public static VariableArray<VariableArray<int>, int[][]> WorkerTaskIndex; // Prior distributions public static Gaussian workerAbilityPrior; public static Gaussian taskDifficultyPrior; public static Gamma discriminationPrior; // Posterior distributions public static Gaussian[] workerAbilityPosterior; public static Gaussian[] taskDifficultyPosterior; public static Gamma[] discriminationPosterior; public static Discrete[] trueLabelPosterior; // Inference engine public static InferenceEngine Engine; #endregion public static void RunDARE(int nQuestions, int nSubjects, int nChoices, int[][] workerLabels, int[][] taskIndices) { worker = new Range(nSubjects).Named("worker"); task = new Range(nQuestions).Named("task"); choice = new Range(nChoices).Named("choice"); // The tasks for each worker WorkerTaskCount = Variable.Array<int>(worker).Named("WorkerTaskCount"); workerTask = new Range(WorkerTaskCount[worker]).Named("workerTask"); WorkerTaskIndex = Variable.Array(Variable.Array<int>(workerTask), worker).Named("WorkerTaskIndex"); WorkerTaskIndex.SetValueRange(task); //worker ability for each worker workerAbilityPrior = new Gaussian(ABILITY_PRIOR_MEAN, ABILITY_PRIOR_VARIANCE); workerAbility = Variable.Array<double>(worker).Named("workerAbility"); workerAbility[worker] = Variable.Random(workerAbilityPrior).ForEach(worker); workerAbility[worker].InitialiseTo(workerAbilityPrior); //task difficulty for each task taskDifficultyPrior = new Gaussian(DIFFICULTY_PRIOR_MEAN, DIFFICULTY_PRIOR_VARIANCE); taskDifficulty = Variable.Array<double>(task).Named("taskDifficulty"); taskDifficulty[task] = Variable.Random(taskDifficultyPrior).ForEach(task); taskDifficulty[task].InitialiseTo(taskDifficultyPrior); // discrimination of each task discriminationPrior = Gamma.FromMeanAndVariance(DISCRIM_PRIOR_SHAPE, DISCRIM_PRIOR_SCALE); discrimination = Variable.Array<double>(task).Named("discrimination"); discrimination[task] = Variable.Random(discriminationPrior).ForEach(task); discrimination[task].InitialiseTo(discriminationPrior); //unobserved true label for each task trueLabel = Variable.Array<int>(task).Named("trueLabel"); trueLabel[task] = Variable.DiscreteUniform(choice).ForEach(task); //worker label workerResponse = Variable.Array(Variable.Array<int>(workerTask), worker).Named("workerResponse"); // The process that generates the worker's label using (Variable.ForEach(worker)) { var workerTaskDifficulty = Variable.Subarray(taskDifficulty, WorkerTaskIndex[worker]); var workerTaskDiscrimination = Variable.Subarray(discrimination, WorkerTaskIndex[worker]); var TrueLabel = Variable.Subarray(trueLabel, WorkerTaskIndex[worker]); using (Variable.ForEach(workerTask)) { var advantage = (workerAbility[worker] - workerTaskDifficulty[workerTask]).Named("advantage"); var advantageNoisy = Variable.GaussianFromMeanAndPrecision(advantage, workerTaskDiscrimination[workerTask]).Named("advantageNoisy"); var correct = (advantageNoisy > 0).Named("correct"); using (Variable.If(correct)) workerResponse[worker][workerTask] = TrueLabel[workerTask]; using (Variable.IfNot(correct)) workerResponse[worker][workerTask] = Variable.DiscreteUniform(choice); } } Engine = new InferenceEngine(new ExpectationPropagation()); Engine.Compiler.UseParallelForLoops = true; Engine.ShowProgress = false; Engine.Compiler.WriteSourceFiles = false; /// Attachs the data to the workers labels. WorkerTaskCount.ObservedValue = taskIndices.Select(tasks => tasks.Length).ToArray(); WorkerTaskIndex.ObservedValue = taskIndices; workerResponse.ObservedValue = workerLabels; Engine.NumberOfIterations = NUMBER_OF_ITERATIONS; workerAbility.AddAttribute(new Sequential()); // needed to get stable convergence taskDifficulty.AddAttribute(new Sequential()); // needed to get stable convergence trueLabelPosterior = Engine.Infer<Discrete[]>(trueLabel); } } /// <summary> /// This class represents a single datum, and has methods to read in data. /// </summary> public class Datum { /// <summary> /// The worker id. /// </summary> public string WorkerId; /// <summary> /// The task id. /// </summary> public string TaskId; /// <summary> /// The worker's label. /// </summary> public int WorkerLabel; /// <summary> /// The task's gold label (optional). /// </summary> public int? GoldLabel; /// <summary> /// Loads the data file in the format (worker id, task id, worker label, ?gold label). /// </summary> /// <param name="filename">The data file.</param> /// <returns>The list of parsed data.</returns> public static IList<Datum> LoadData(string filename) { var result = new List<Datum>(); using (var reader = new StreamReader(filename)) { string line; while ((line = reader.ReadLine()) != null) { var strarr = line.Split(','); int length = strarr.Length; //if (length < 3 || length > 4) //Filter bad entries!! // continue; int workerLabel = int.Parse(strarr[2]); //if (workerLabel < -4 || workerLabel > 4) //Filter bad entries!! // continue; var datum = new Datum() { WorkerId = strarr[0], TaskId = strarr[1], WorkerLabel = workerLabel, }; if (length == 4) datum.GoldLabel = int.Parse(strarr[3]); else datum.GoldLabel = null; result.Add(datum); } } return result; } } /// <summary> /// Data mapping class. This class manages the mapping between the data (which is /// in the form of task, worker ids, and labels) and the model data (which is in term of indices). /// </summary> public class DataMapping { #region Fields /// <summary> /// The mapping from the worker index to the worker id. /// </summary> public string[] WorkerIndexToId; /// <summary> /// The mapping from the worker id to the worker index. /// </summary> public Dictionary<string, int> WorkerIdToIndex; /// <summary> /// The mapping from the community id to the community index. /// </summary> public Dictionary<string, int> CommunityIdToIndex; /// <summary> /// The mapping from the community index to the community id. /// </summary> public string[] CommunityIndexToId; /// <summary> /// The mapping from the task index to the task id. /// </summary> public string[] TaskIndexToId; /// <summary> /// The mapping from the task id to the task index. /// </summary> public Dictionary<string, int> TaskIdToIndex; /// <summary> /// The lower bound of the labels range. /// </summary> public int LabelMin; /// <summary> /// The upper bound of the labels range. /// </summary> public int LabelMax; #endregion #region Properties /// <summary> /// The enumerable list of data. /// </summary> public IEnumerable<Datum> Data { get; private set; } /// <summary> /// The number of label values. /// </summary> public int LabelCount { get { return LabelMax - LabelMin + 1; } } /// <summary> /// The number of workers. /// </summary> public int WorkerCount { get { return WorkerIndexToId.Length; } } /// <summary> /// The number of tasks. /// </summary> public int TaskCount { get { return TaskIndexToId.Length; } } #endregion #region Methods /// <summary> /// Creates a data mapping. /// </summary> /// <param name="data">The data.</param> /// <param name="numCommunities">The number of communities.</param> /// <param name="labelMin">The lower bound of the labels range.</param> /// <param name="labelMax">The upper bound of the labels range.</param> public DataMapping(IEnumerable<Datum> data, int numCommunities = -1, int labelMin = int.MaxValue, int labelMax = int.MinValue) { WorkerIndexToId = data.Select(d => d.WorkerId).Distinct().ToArray(); WorkerIdToIndex = WorkerIndexToId.Select((id, idx) => new KeyValuePair<string, int>(id, idx)).ToDictionary(x => x.Key, y => y.Value); TaskIndexToId = data.Select(d => d.TaskId).Distinct().ToArray(); TaskIdToIndex = TaskIndexToId.Select((id, idx) => new KeyValuePair<string, int>(id, idx)).ToDictionary(x => x.Key, y => y.Value); var labels = data.Select(d => d.WorkerLabel).Distinct().OrderBy(lab => lab).ToArray(); if (labelMin <= labelMax) { LabelMin = labelMin; LabelMax = labelMax; } else { LabelMin = labels.Min(); LabelMax = labels.Max(); } Data = data; if (numCommunities > 0) { CommunityIndexToId = Util.ArrayInit(numCommunities, comm => "Community" + comm); CommunityIdToIndex = CommunityIndexToId.Select((id, idx) => new KeyValuePair<string, int>(id, idx)).ToDictionary(x => x.Key, y => y.Value); } } /// <summary> /// Returns the matrix of the task indices (columns) of each worker (rows). /// </summary> /// <param name="data">The data.</param> /// <returns>The matrix of the task indices (columns) of each worker (rows).</returns> public int[][] GetTaskIndicesPerWorkerIndex(IEnumerable<Datum> data) { int[][] result = new int[WorkerCount][]; for (int i = 0; i < WorkerCount; i++) { var wid = WorkerIndexToId[i]; result[i] = data.Where(d => d.WorkerId == wid).Select(d => TaskIdToIndex[d.TaskId]).ToArray(); } return result; } /// <summary> /// Returns the matrix of the labels (columns) of each worker (rows). /// </summary> /// <param name="data">The data.</param> /// <returns>The matrix of the labels (columns) of each worker (rows).</returns> public int[][] GetLabelsPerWorkerIndex(IEnumerable<Datum> data) { int[][] result = new int[WorkerCount][]; for (int i = 0; i < WorkerCount; i++) { var wid = WorkerIndexToId[i]; result[i] = data.Where(d => d.WorkerId == wid).Select(d => d.WorkerLabel - LabelMin).ToArray(); } return result; } /// <summary> /// Returns the the gold labels of each task. /// </summary> /// <returns>The dictionary keyed by task id and the value is the gold label.</returns> public Dictionary<string, int?> GetGoldLabelsPerTaskId() { // Gold labels that are not consistent are returned as null // Labels are returned as indexed by task index return Data.GroupBy(d => d.TaskId). Select(t => t.GroupBy(d => d.GoldLabel).Where(d => d.Key != null)). Where(gold_d => gold_d.Count() > 0). Select(gold_d => { int count = gold_d.Distinct().Count(); var datum = gold_d.First().First(); if (count == 1) { var gold = datum.GoldLabel; if (gold != null) gold = gold.Value - LabelMin; return new Tuple<string, int?>(datum.TaskId, gold); } else { return new Tuple<string, int?>(datum.TaskId, (int?)null); } }).ToDictionary(tup => tup.Item1, tup => tup.Item2); } /// <summary> /// For each task, gets the majority vote label if it is unique. /// </summary> /// <returns>The list of majority vote labels.</returns> public int?[] GetMajorityVotesPerTaskIndex() { return Data.GroupBy(d => TaskIdToIndex[d.TaskId]). OrderBy(g => g.Key). Select(t => t.GroupBy(d => d.WorkerLabel - LabelMin). Select(g => new { label = g.Key, count = g.Count() })). Select(arr => { int max = arr.Max(a => a.count); int[] majorityLabs = arr.Where(a => a.count == max).Select(a => a.label).ToArray(); if (majorityLabs.Length == 1) return (int?)majorityLabs[0]; else { return null; } }).ToArray(); } /// <summary> /// For each task, gets the empirical label distribution. /// </summary> /// <returns></returns> public Discrete[] GetVoteDistribPerTaskIndex() { return Data.GroupBy(d => TaskIdToIndex[d.TaskId]). OrderBy(g => g.Key). Select(t => t.GroupBy(d => d.WorkerLabel - LabelMin). Select(g => new { label = g.Key, count = g.Count() })). Select(arr => { Vector v = Vector.Zero(LabelCount); foreach (var a in arr) v[a.label] = (double)a.count; return new Discrete(v); }).ToArray(); } #endregion }
Hope you can tell me why I am getting Improper distribution exception.
Thanks
- Edited by cindyak Tuesday, March 31, 2015 1:47 PM
Tuesday, March 31, 2015 1:45 PM -
Precision random variables are always tricky to learn. I was able to run with a much tighter prior:
discriminationPrior = Gamma.FromShapeAndScale(1, 0.002);
but I'm not sure it gives you much benefit. It is possible that you could do some damping, but I don't have time to look into this right now. You could also just dispense with discrimination for now and set it as a point mass.
By the way, you don't need all your Initialise statements.
John
Tuesday, March 31, 2015 4:41 PMOwner -
Hi John Guiver,
Thanks a lot. It was very helpful. It seems to be working after changing discrimination prior scale to 0.002.
But I noticed different dataset requires different discrimination prior scale and number of iterations to work seamlessly without showing any improper distribution exception.
For example,
- Dataset "D" produce best result when discrimination prior scale = 0.01 and number of iteration 7 to 17
- Dataset "CF" produce best result when discrimination prior scale = 0.002 and number of iteration = 35
Now Is there any methodical way to learn how to select these numbers?
Wednesday, April 1, 2015 2:44 AM