locked
More complete example that involves reading data from xlsb or csv file RRS feed

  • Question

  • Hi, I am a beginner in .net programming but desperately would like to use infer.net to investigated certain problems. Part of my frustration is that, despite spending many, many hours pouring through the documentation, I have literally not been able to make a single step towards applying any of this stuff. Perhaps I am just being dumb here, but I am able to do plenty of things in JMP, Matlab, etc. I think part of my problem is that all the examples use dummy data that is generated by random numbers or is literally just keyed in as part of the source code. I would be incredibly appreciative if someone could help me see how to transform the Gaussian process regression tools into something where I could specify an excel file (preferably XLSB so I am not limited to 65k rows) or, even better, a csv/tab delimited text file with a header column which is imported to several columns, each named after the header row. I'd like to be able to just specify this file and then specify what are to be the input variables and what are the output variables, and then "train" the system to make predictions of new outputs given inputs it hasn't seen before that are specified in other, identically formatted/constructed files. Once I saw that, I am confident I would be able to generalize it to use other parts of Infer.net, or even switch to loading data from other kinds of sources, but with the "toy" examples (no offense intended) included it is just proving impossible for me to make the leap to having this be useful. Thanks so much in advance for any assistance. 
    Saturday, October 22, 2011 8:37 PM

Answers

  • There are several examples of loading data in the installed solutions (for example the Multiclass Bayes Point Machine and the LDA in the recent release). For your particular request, the following code will read arbitrary input columns, and one output column in any specified order from a comma or tab delimited file.

    staticTuple<Vector[], bool[]> LoadCSV(string fileName, string[] inputFields, string outputField)
    {
      char[] separators = new char[] { ',', '\t' };
      List<Vector> inputList = new List<Vector>();
      List<bool> outputList = new List<bool>();
      int numInputs = inputFields.Length;
      using (StreamReader sr = new StreamReader(fileName))
      {
        string str = sr.ReadLine();  // Header
        int[] inputIndices = new int[inputFields.Length];
        List<string> columnNames = new List<string>(str.Split(separators));
        for (int i = 0; i < numInputs; i++)
        {
          int indx = columnNames.FindIndex(s => s == inputFields[i]);
          if (indx < 0)
            throw new ApplicationException("Cannot find column name " + inputFields[i]);
          inputIndices[i] = indx;
        }
        int outputIndex = columnNames.FindIndex(s => s == outputField);
        if (outputIndex < 0)
          throw new ApplicationException("Cannot find column name " + outputField);

        while ((str = sr.ReadLine()) != null)
        {
          string[] arr = str.Split(separators);
          Vector v = Vector.Zero(numInputs);
          for (int i = 0; i < numInputs; i++)
            v[i] = double.Parse(arr[inputIndices[i]]);
          inputList.Add(v);
          bool output = bool.Parse(arr[outputIndex]);
          outputList.Add(output);
        }
      }
      return new Tuple<Vector[], bool[]>(inputList.ToArray(), outputList.ToArray());
    }

    Now let's assume you have a file testdata.csv sitting in the project folder which looks as follows:

    a b c d e
    9999 0 TRUE junk 0
    9999 1 TRUE junk 0
    9999 0 FALSE junk 1
    9999 0.5 TRUE junk 0
    9999 0 FALSE junk 1.5
    9999 1 FALSE junk 0.5

    You can then call this as follows in place of the standard Gaussian Process example in place of the existing data:

    var data = LoadCSV(@"..\..\testdata.csv", new string[] {"e", "b"}, "c");
    Vector[] inputs = data.Item1;
    bool[] outputs = data.Item2;

    You should be able to able to easily convert this code to allow other types of output.

    • Marked as answer by Dicklestein Wednesday, October 26, 2011 11:22 AM
    Wednesday, October 26, 2011 10:09 AM
    Owner

All replies

  • There are several examples of loading data in the installed solutions (for example the Multiclass Bayes Point Machine and the LDA in the recent release). For your particular request, the following code will read arbitrary input columns, and one output column in any specified order from a comma or tab delimited file.

    staticTuple<Vector[], bool[]> LoadCSV(string fileName, string[] inputFields, string outputField)
    {
      char[] separators = new char[] { ',', '\t' };
      List<Vector> inputList = new List<Vector>();
      List<bool> outputList = new List<bool>();
      int numInputs = inputFields.Length;
      using (StreamReader sr = new StreamReader(fileName))
      {
        string str = sr.ReadLine();  // Header
        int[] inputIndices = new int[inputFields.Length];
        List<string> columnNames = new List<string>(str.Split(separators));
        for (int i = 0; i < numInputs; i++)
        {
          int indx = columnNames.FindIndex(s => s == inputFields[i]);
          if (indx < 0)
            throw new ApplicationException("Cannot find column name " + inputFields[i]);
          inputIndices[i] = indx;
        }
        int outputIndex = columnNames.FindIndex(s => s == outputField);
        if (outputIndex < 0)
          throw new ApplicationException("Cannot find column name " + outputField);

        while ((str = sr.ReadLine()) != null)
        {
          string[] arr = str.Split(separators);
          Vector v = Vector.Zero(numInputs);
          for (int i = 0; i < numInputs; i++)
            v[i] = double.Parse(arr[inputIndices[i]]);
          inputList.Add(v);
          bool output = bool.Parse(arr[outputIndex]);
          outputList.Add(output);
        }
      }
      return new Tuple<Vector[], bool[]>(inputList.ToArray(), outputList.ToArray());
    }

    Now let's assume you have a file testdata.csv sitting in the project folder which looks as follows:

    a b c d e
    9999 0 TRUE junk 0
    9999 1 TRUE junk 0
    9999 0 FALSE junk 1
    9999 0.5 TRUE junk 0
    9999 0 FALSE junk 1.5
    9999 1 FALSE junk 0.5

    You can then call this as follows in place of the standard Gaussian Process example in place of the existing data:

    var data = LoadCSV(@"..\..\testdata.csv", new string[] {"e", "b"}, "c");
    Vector[] inputs = data.Item1;
    bool[] outputs = data.Item2;

    You should be able to able to easily convert this code to allow other types of output.

    • Marked as answer by Dicklestein Wednesday, October 26, 2011 11:22 AM
    Wednesday, October 26, 2011 10:09 AM
    Owner
  • Thank you so much John, this is amazing and exactly what I needed! I really appreciate it.
    Wednesday, October 26, 2011 11:22 AM
  • Hi John, I made a lot of progress with your help but I seem to be stuck again. I am importing from a csv file that has 11 inputs variables that are of type double and one output variable that is also a double. I tried to make the obvious modifications to the test solution, but I am getting this error when I compile:

    "Vectors have different size

    Parameter name: b"

    the error occurs in the Model_EP.cs class at line 302:

    this.f_rep0_B[j] = SparseGPOp.FuncAverageConditional(vdouble__1_use_B[j], f_rep0_F[j], this.X[j], this.f_rep0_B[j]);

     

    I did get the example to work when I was using just two input variables as in the sample provided in the demo solution. I suspect also that there is an easier way to construct the basis for a higher dimensional input space than to just type in lots of vectors. 

    Here is my code thus far:

    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Linq;
    using System.IO;
    using MicrosoftResearch.Infer;
    using MicrosoftResearch.Infer.Models;
    using MicrosoftResearch.Infer.Maths;
    using MicrosoftResearch.Infer.Distributions;
    using MicrosoftResearch.Infer.Distributions.Kernels;
    
    namespace GaussianProcessExample
    {
    	class Program
    	{
    
               static Tuple<Vector[], double[]> LoadCSV(string fileName, string[] inputFields, string outputField)
            {
                char[] separators = new char[] { ',', '\t' };
                List<Vector> inputList = new List<Vector>();
    
                List<double> outputList = new List<double>();
    
                int numInputs = inputFields.Length - 1;
               
    
                   using (StreamReader sr = new StreamReader(fileName))
                {
                    string str = sr.ReadLine();  // Header
    
                    int[] inputIndices = new int[inputFields.Length];
                    
    
                    List<string> columnNames = new List<string>(str.Split(separators));
                    for (int i = 0; i < numInputs; i++)
                    {
                        int indx = columnNames.FindIndex(s => s == inputFields[i]);
                        if (indx < 0)
                            throw new ApplicationException("Cannot find column name " + inputFields[i]);
                        inputIndices[i] = indx;
                    }
    
                 int outputIndex = columnNames.FindIndex(s => s == outputField);
        if (outputIndex < 0)
          throw new ApplicationException("Cannot find column name " + outputField);
                  
                    while ((str = sr.ReadLine()) != null)
                    {
                        string[] arr = str.Split(separators);
    
                        Vector v = Vector.Zero(numInputs);
                        for (int i = 0; i < numInputs; i++)
                            v[i] = double.Parse(arr[inputIndices[i]]);
                        inputList.Add(v);
    
                     double output = double.Parse(arr[outputIndex]);
                        outputList.Add(output);
    
                    }
                }
                   return new Tuple<Vector[], double[]>(inputList.ToArray(), outputList.ToArray());
            }
    
    
    		static void Main(string[] args)
    		{
                //redirect console output to text file
    
                FileStream ostrm;
                StreamWriter writer;
                TextWriter oldOut = Console.Out;
                try
                {
                    ostrm = new FileStream("./Redirect.txt", FileMode.OpenOrCreate, FileAccess.Write);
                    writer = new StreamWriter(ostrm);
                }
                catch (Exception e)
                {
                    Console.WriteLine("Cannot open Redirect.txt for writing");
                    Console.WriteLine(e.Message);
                    return;
                }
                Console.SetOut(writer);
    
                //
    
                var data = LoadCSV("C:\\Users\\Jeff\\Documents\\Infer.NET 2.4\\Example Solutions\\GaussianProcess\\sample.csv", new string[] {"inp1","inp2","inp3", "inp4", "inp5", "inp6","inp7", "inp7", "inp9", "inp10", "inp11" }, "output");
                Vector[] inputs = data.Item1;
                double[] output = data.Item2;
    
    			// Open an evidence block to allow model scoring
    			Variable<bool> evidence = Variable.Bernoulli(0.5).Named("evidence");
    			IfBlock block = Variable.If(evidence);
    
    			// Set up the GP prior, which will be filled in later
    			Variable<SparseGP> prior = Variable.New<SparseGP>().Named("prior");
    
    			// The sparse GP variable - a distribution over functions
    			Variable<IFunction> f = Variable<IFunction>.Random(prior).Named("f");
    
    			// The locations to evaluate the function
    			VariableArray<Vector> x = Variable.Observed(inputs).Named("x");
    			Range j = x.Range.Named("j");
    
    			// The observation model
            
                VariableArray<double> y = Variable.Observed(output,j).Named("y");
                Variable<double> score = Variable.FunctionEvaluate(f, x[j]);
    
                y[j] = Variable.GaussianFromMeanAndVariance(score, 0.1);
    
    
    			// Close the evidence block
    			block.CloseBlock();
    
    			InferenceEngine engine = new InferenceEngine(new ExpectationPropagation());
    
    			// The basis
    			Vector[] basis = new Vector[] {
    				Vector.FromArray(new double[11] {.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, .1, 0, 0, 0, 0, 0, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, .1, 0, 0, 0, 0, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, .1, 0, 0, 0, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, .1, 0, 0, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, 0, .1, 0, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, 0, 0, .1, 0, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, 0, 0, 0, .1, 0, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, 0, 0, 0, 0, .1, 0, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, 0, 0, 0, 0, 0, .1, 0 }),
            Vector.FromArray(new double[11] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, .1 })
          };
    
    			for (int trial = 0; trial < 3; trial++) {
    				// The kernel
    				IKernelFunction kf;
    				if (trial == 0) {
    					kf = new SquaredExponential(-0.0);
    				} else if (trial == 1) {
    					kf = new SquaredExponential(-0.5);
    				} else {
                        kf = new NNKernel(new double[] { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 }, -1.0);
    				}
    
    				// Fill in the sparse GP prior
    				GaussianProcess gp = new GaussianProcess(new ConstantFunction(0), kf);
    				prior.ObservedValue = new SparseGP(new SparseGPFixed(gp, basis));
    
    				// Model score
    				double NNscore = engine.Infer<Bernoulli>(evidence).LogOdds;
    				Console.WriteLine("{0} evidence = {1}", kf, NNscore.ToString("g4"));
    			}
    
    			// Infer the posterior Sparse GP
    			SparseGP sgp = engine.Infer<SparseGP>(f);
    
    			// Check that training set is classified correctly
    			Console.WriteLine();
    			Console.WriteLine("Predictions on training set:");
    			for (int i = 0; i < inputs.Length; i++) {
    				Gaussian post = sgp.Marginal(inputs[i]);
    				double postMean = post.GetMean();
                    string comment = "";//System.Math.Abs((output - (postMean)))<.02 ? "correct" : "incorrect";
    				Console.WriteLine("f({0}) = {1} ({2})", inputs[i], post, comment);
             
      
    			}
    
                Console.SetOut(oldOut);
                writer.Close();
                ostrm.Close();
    		}
    
           
    	}
    }
    


     

     

    Saturday, October 29, 2011 3:04 AM
  • Your call to LoadCSV has inp7 repeated and inp8 missing.

    John

    Monday, October 31, 2011 3:35 PM
    Owner
  • Actually, there is also a bug that has been introduced to LoadCSV - the numInputs is calculated as 1 short of what it should be.

    John

    Monday, October 31, 2011 3:40 PM
    Owner
  • OK thanks, that -1 thing got introduced somehow and messed it up-- now it works. 

     

    A couple questions for you:

    1) What is an acceptable basis? Is the simple one I used (basically the n-dimensional identity matrix) a reasonable choice? Why is the one in the example more complicated? Does this somehow make it more robust or efficient?

     

    2) If I wanted to change the example gaussian process code to do things slightly differently, like using a different function besides a SquaredExponential, what are some of the main things I could try? Or if I use another kind of function, is it no longer technically a "Guassian" process? And what are the other functions, if any, that tend to work the best in common use cases?

     

    Any insight into these questions would be greatly appreciated. Thanks again for your help. I'm really excited to get this working!

    Friday, November 4, 2011 2:08 AM
  • Also, after reading the tutorial on the GaussianProcess program again, I have a couple questions:

    1) You mention that for simplicity, you built a basis by hand in the example, but that a better way is to use clustering on the inputs to make the basis. Can this clustering be done using only built-in parts of infer.net, or would it require a separate clustering algorithm? Is there any sample code that I could see to do this? I'm guessing that this step is really the key to making these things computationally tractable for large data sets, since you can focus the optimization on the important parts of the space.

     

    2) You then say "Note that we could have built a model for making predictions (as in the Bayes Point Machine tutorial) but here for simplicity we call the Marginal method on the SparseGP posterior to get the distribution of the score at a particular input." I have review the Point Machine tutorial and it is still unclear to me (yes, I am slow at this-- sorry!) how I would change the GuassianProcess to instead generate a prediction given an input vector. And once I got it to generate a prediction, would this be in the form of a distribution like "Bernoulli(0.9555)" (as in the BPM example)? Or would it simply be a number, which is just the predicted value of the output?

     

    Thanks again. 

    Friday, November 4, 2011 2:38 AM
  • 1) Infer.NET does not provide a ready-made general purpose clustering algorithm.  However you can construct one from the provided components.  The mixture of Gaussians tutorial provides a start.

    2) In the Bayes Point Machine tutorial, the modelling code was put into a method BayesPointMachine() that took the input values, a variable for the weights, and variables for the outputs.  You can do the same thing for the Gaussian Process: make a method that takes the input values, a variable for the unknown function f, and variables for the outputs.  The method would contain only this code:

    // The locations to evaluate the function
    Range j = y.Range;
    VariableArray<Vector> x = Variable.Observed(inputs, j).Named("x");
    // The prediction model
    Variable<double> score = Variable.FunctionEvaluate(f, x[j]);
    y[j] = (Variable.GaussianFromMeanAndVariance(score, 0.1) > 0);

    You'd call it in the same way as done there, where f is Variable<IFunction>.Random(sgp) and sgp is the posterior distribution from training.

    Friday, November 4, 2011 10:33 AM
    Owner
  • Hi,

     

    I'm trying to do something similar only pass the Vector[] inputs, bool[] outputs to the BayesPointMachine() method.  

    This is what I'm currently passing to it:

     

     VariableArray<bool> y = Variable.Observed(output).Named("y");

                VariableArray<Vector> w = Variable.Observed(inputs).Named("w");

               //function below take in "inputs"

                BayesPointMachine(w, y);

    ...
    ...

     

     public static void BayesPointMachine(VariableArray<Vector> w, VariableArray<bool> y)

    {

    ....

     

     

    However, I'm running to a small issue with accessing the variablearray inputs once inside the BPM method.   

    Thursday, February 2, 2012 2:44 PM
  • Please could you provide more information? What do you mean by 'accessing the variablearray inputs', and what is the small issue?

    John

    Thursday, February 2, 2012 2:54 PM
    Owner
  • I'm fairly beginner when applying and using vectors...so I'm not sure how to access the passed Vector w for the method below.  Since I'm not passing double[] arrays and instead passing VariableArray<Vector> w (which contains the inputs from the CSV file)

     

    This is the BPM method:

     

      public static void BayesPointMachine(VariableArray<Vector> w, VariableArray<bool> y)

            {

                // Create x vector, augmented by 1

                Range j = y.Range.Named("test");

                Vector[] xdata = new Vector[];   <------not sure what to access here with the passed vector w

                for (int i = 0; i < xdata.Length; i++)

                    xdata[i] = Vector.FromArray(w[i], 1);   <------and this is incorrect as well, here 

                VariableArray<Vector> x = Variable.Observed(xdata, j).Named("x");

                double noise = 0.1;

                y[j] = Variable.GaussianFromMeanAndVariance(Variable.InnerProduct(w, x[j]).Named("innerProduct"), noise) > 0;

            }

    Thursday, February 2, 2012 3:23 PM
  • Vector is a class in the Infer.NET runtime library for representing vector types and can be used as a general Vector class quite apart from building graphical models. Within a graphical model we would use Vector as a type parameter for a random variable if we wanted to treat the corresponding random variable as a whole for inference purposes. In the BPM we use Vector for the type of the weights random variable w because we want to infer the joint distribution of w (we can also formulate a BPM using independent weights - see the various options provided in the BPM wrapper).

    VariableArray<> is an array type in the Infer.NET modeling API for building graphical models. In your example, you are passing down a random variable array of type Vector. You will typically only index this random variable array with Infer.NET ranges (see the section on array and ranges in the user guide).

    I am not sure why you are trying to construct your observed data x from the random variable weights. Firstly, the weights random variable will typically not be observed (are you observing it?) and so there is no data attached to it. If there were data attached to it, you could access that data via w.ObservedValue which would be of type Vector[] - but your model would not make any sense.

    What are you trying to do here? If you just want use the Vector and Distribution classes directly and are not trying to build a graphical model, you should not be using the Variable API and the Variable types. If you do want to build a model, what is that model?

    John

    Thursday, February 2, 2012 4:15 PM
    Owner
  • Sorry I shouldn't have used "w" to since it deals with weights in the BPM example.  All I'm trying to do is pass "inputs" into the BPM method.  

     

    I'm trying to predict whether someone will shop at a specific store given various demographic and income related questions.  I"m using EP with BPM for this.

     

     var testdata = LoadCSV("C:\\Users\\myproj\\Desktop\\training - test.csv", new string[] { "age", "gender", "location","income","race","familysize","parentincome","ses","var2","var3","var4","var6","var7","var8" }, "willGoToStore");

                Vector[] inputs = data.Item1;

                bool[] toutput = data.Item2;

    ...

    ...

     

    I'm just trying to pass "inputs" into the BPM method.  But I'm not sure how to get the BPM method to handle/recognize the "inputs" as a Vector[]. 

     

     public static void BayesPointMachine(inputs Variable<Vector> w, VariableArray<bool> y )

            {

                Range j = y.Range.Named("test");

                Vector[] xdata = new Vector[];   <------not sure what to access here with the passed vector inputs

                for (int i = 0; i < xdata.Length; i++)

                    xdata[i] = Vector.FromArray(inputs[i].length, 1);   <------not sure how to handle the inputs value here

                VariableArray<Vector> x = Variable.Observed(xdata, j).Named("x");

                double noise = 0.1;

                y[j] = Variable.GaussianFromMeanAndVariance(Variable.InnerProduct(w, x[j]).Named("innerProduct"), noise) > 0;

     

     

    Thursday, February 2, 2012 5:12 PM
  • Given that you are creating inputs already as a Vector array, you can directly set x to observe that:

     VariableArray<Vector> x = Variable.Observed(inputs, j).Named("x");

    Note that your LoadCSV should append a 1 onto each vector (for the bias term) - i.e. make each Vector of length numInputs + 1 and make the last value 1.

    John

    Monday, February 6, 2012 9:38 AM
    Owner
  • Hi John,

    Ok.  I've modified my method to the following:

                

        public static void BayesPointMachine(Vector[] inputs, Variable<Vector> w, VariableArray<bool> y)
            {
                Range j = y.Range.Named("testcase");
                VariableArray<Vector> x = Variable.Observed(inputs, j).Named("x");


                double noise = 0.1;
                y[j] = Variable.GaussianFromMeanAndVariance(Variable.InnerProduct(w,x[j]).Named("innerProduct"), noise) > 0;
            }

    I don't understand what you mean when you say "LoadCSV should append a 1".

    Tuesday, February 7, 2012 3:25 PM
  • In a BPM you need a bias input which has a constant value of 1 for every data point. In the tutorial example this is explicitly created in the following bit of code (the '1'):

    for (int i = 0; i < xdata.Length; i++) 
        xdata[i] = Vector.FromArray(incomes[i], ages[i], 1);
    

    I was just observing that your Vector inputs need a similar bias term. I was assuming that you were using the LoadCSV at the beginning of this thread, in which case that would be the most convenient place to do this. Althernatively you could have a column of 1's in your data set.

    John

    Tuesday, February 7, 2012 6:05 PM
    Owner
  • Yes, sorry.  I'm indeed using LoadCSV from the above thread and I have  inputFields.Lengths + 1 set.  

    static Tuple<Vector[], bool[]> LoadCSV(string fileName, string[] inputFields, string outputField)
            {
                char[] separators = new char[] { ',', '\t' };
                List<Vector> inputList = new List<Vector>();
                List<bool> outputList = new List<bool>();
                //List<double> biasList = new List<double>();
                int numInputs = inputFields.Length + 1;    
                using (StreamReader sr = new StreamReader(fileName))
                {
                    string str = sr.ReadLine();  // Header
                    int[] inputIndices = new int[inputFields.Length];
                    List<string> columnNames = new List<string>(str.Split(separators));
                    for (int i = 0; i < numInputs; i++)
                    {
                        int indx = columnNames.FindIndex(s => s == inputFields[i]);
                        if (indx < 0)
                            throw new ApplicationException("Cannot find column name " + inputFields[i]);
                        inputIndices[i] = indx;
                    }
                    int outputIndex = columnNames.FindIndex(s => s == outputField);
                    if (outputIndex < 0)
                        throw new ApplicationException("Cannot find column name " + outputField);
                     
                    while ((str = sr.ReadLine()) != null)
                    {
                        string[] arr = str.Split(separators);
                        Vector v = Vector.Zero(numInputs);
                        for (int i = 0; i < numInputs; i++)
                            v[i] = double.Parse(arr[inputIndices[i]]);
                        inputList.Add(v);
                        bool output = bool.Parse(arr[outputIndex]);
                        outputList.Add(output);
                       /* double outputbias = int.Parse(arr[biasIndex]);
                         biasList.Add(outputbias);*/
                    }
                }
                return new Tuple<Vector[], bool[]>(inputList.ToArray(), outputList.ToArray());
            }
    



    Tuesday, February 7, 2012 9:29 PM
  • Shouldn't this be:

    static Tuple<Vector[], bool[]> LoadCSV(string fileName, string[] inputFields, string outputField)
    {
        char[] separators = new char[] { ',', '\t' };
        List<Vector> inputList = new List<Vector>();
        List<bool> outputList = new List<bool>();
        int numInputs = inputFields.Length;
        using (StreamReader sr = new StreamReader(fileName))
        {
            string str = sr.ReadLine();  // Header
            int[] inputIndices = new int[numInputs];
            List<string> columnNames = new List<string>(str.Split(separators));
            for (int i = 0; i < numInputs; i++)
            {
                int indx = columnNames.FindIndex(s => s == inputFields[i]);
                if (indx < 0)
                    throw new ApplicationException("Cannot find column name " + inputFields[i]);
                inputIndices[i] = indx;
            }
            int outputIndex = columnNames.FindIndex(s => s == outputField);
            if (outputIndex < 0)
                throw new ApplicationException("Cannot find column name " + outputField);
                     
            while ((str = sr.ReadLine()) != null)
            {
                string[] arr = str.Split(separators);
                Vector v = Vector.Zero(numInputs);
                for (int i = 0; i < numInputs; i++)
                    v[i] = double.Parse(arr[inputIndices[i]]);
                v[numInputs] = 1.0;
                inputList.Add(v);
                bool output = bool.Parse(arr[outputIndex]);
                outputList.Add(output);
                /* double outputbias = int.Parse(arr[biasIndex]);
                biasList.Add(outputbias);*/
            }
        }
        return new Tuple<Vector[], bool[]>(inputList.ToArray(), outputList.ToArray());
    }

    Wednesday, February 8, 2012 9:06 AM
    Owner
  • Ok.  So now in this vector:

    v[numInputs] = 1.0;

    It's creating a last column and setting it all to 1.0.

    When I debug this I still don't see the actual vallue of 1.0 being stored in the array.  

    Wednesday, February 8, 2012 4:02 PM
  • Sorry - that should be Vector v = Vector.Zero(numInputs+1);

    Wednesday, February 8, 2012 4:18 PM
    Owner
  • Right.  That's how I'm incrementing already.  Hmmm..something else must be wrong, because the program just hangs.  It's not even executing BPM method.  
    Wednesday, February 8, 2012 4:28 PM
  • After debugging some more...this seems to be the hangup at the moment.  

    Infer.Compiler.dll!MicrosoftResearch.Infer.Models.Variable.Observed<boo>(bool[] observedValue) Line 955

    Here is the code for that area:

       //training data
                var trainingdata = LoadCSV("C:\\Users\\myfiles\\training.csv", new string[] { "age", "gender", "location","income","race","familysize","parentincome","ses","var2","var3","var4","var6","var7","var8" }, "willGoToStore");
                Vector[] inputs = trainingdata.Item1;
                bool[] output = trainingdata.Item2;
         
               
                //test data
                var testdata = LoadCSV("C:\\Users\\myfiles\\test.csv", new string[] { "age", "gender", "location","income","race","familysize","parentincome","ses","var2","var3","var4","var6","var7","var8" }, "willGoToStore");
                Vector[] testinputs = testdata.Item1;
                bool[] testoutput = testdata.Item2;
          

    VariableArray<bool> y = Variable.Observed(output).Named("y");

    Variable<Vector> w = Variable.Random(new VectorGaussian(Vector.Zero(trainingdata.Item1[0].Count()), PositiveDefiniteMatrix.Identity(trainingdata.Item1[0].Count()))).Named("w");

    BayesPointMachine(inputs, w, y);

    Wednesday, February 8, 2012 4:54 PM
  • Ok...I got this working.  
    Monday, February 13, 2012 6:14 PM