locked
Multi thread parallel training RRS feed

  • Question

  • Hello. I'm starting to use Infer.NET and surprised by lack of materials on how to build parallel inference (especially for training classifier model). I'm trying to build binary classifier using BPM algorithm and facing the problem of incredible long training. I'm already splitting training set into batches because of high memory consumption. So now I want to train this batches in parallel. How in big details should I do this?
    Wednesday, June 24, 2015 12:43 PM

Answers

  • Hi Dmitry,

    Unfortunately, parallel inference isn't very straightforward with Infer.NET these days. Our long-term goal is to have the compiler automatically generate distributed algorithms, but this isn't coming in the very near future. For the time being though, you still have several options.

    • Split the model in chunks and manually compute the messages passed between the various parts of the model. This will give you exactly the same results as the singleton model. There is an example of this - the Matchbox Recommender Learner. You can look at the code in the train method and mimic this behaviour. With the Bayes Point Machine Learner we also split the model in chunks, but only enough to be able to process the data in batches. That is, there we didn't go all the way to full parallelization. You can certainly do so, but the coding is quite involved, and the closest we have of documentation is the answer with the pictures of this forum thread. It demonstrates the approach, although in order to solve a different (similar) problem.
    • Run the model on different chunks of data in parallel, and then combine the posteriors. These marginals can be combined using a separate model that you write. This approach will only give an approximation of the true posterior, and I don't think we have any examples of it.
    • Another approximation of the posterior can be achieved by using online learning. Read some of the data, infer the posteriors, plug them as new priors, read some more data, and keep on repeating these steps until you process the whole dataset. This will only have an effect on models that are not linear in the size of the data (the BPM not being one of them). It also doesn't parallelize :-\
    • You can set engine.Compiler.UseParallelForLoops = true. This simply converts the for loops in the generated algorithm into Parallel.For loops where possible, so it will give the true posterior. However, since the model is not split into chunks, this option only has significant effect on certain types of models. Certainly worth trying with the BPM as I think it'll parallelize by the features (and certainly not by the instances).

    -Y-

    Thursday, June 25, 2015 9:47 PM

All replies

  • Hi Dmitry,

    Unfortunately, parallel inference isn't very straightforward with Infer.NET these days. Our long-term goal is to have the compiler automatically generate distributed algorithms, but this isn't coming in the very near future. For the time being though, you still have several options.

    • Split the model in chunks and manually compute the messages passed between the various parts of the model. This will give you exactly the same results as the singleton model. There is an example of this - the Matchbox Recommender Learner. You can look at the code in the train method and mimic this behaviour. With the Bayes Point Machine Learner we also split the model in chunks, but only enough to be able to process the data in batches. That is, there we didn't go all the way to full parallelization. You can certainly do so, but the coding is quite involved, and the closest we have of documentation is the answer with the pictures of this forum thread. It demonstrates the approach, although in order to solve a different (similar) problem.
    • Run the model on different chunks of data in parallel, and then combine the posteriors. These marginals can be combined using a separate model that you write. This approach will only give an approximation of the true posterior, and I don't think we have any examples of it.
    • Another approximation of the posterior can be achieved by using online learning. Read some of the data, infer the posteriors, plug them as new priors, read some more data, and keep on repeating these steps until you process the whole dataset. This will only have an effect on models that are not linear in the size of the data (the BPM not being one of them). It also doesn't parallelize :-\
    • You can set engine.Compiler.UseParallelForLoops = true. This simply converts the for loops in the generated algorithm into Parallel.For loops where possible, so it will give the true posterior. However, since the model is not split into chunks, this option only has significant effect on certain types of models. Certainly worth trying with the BPM as I think it'll parallelize by the features (and certainly not by the instances).

    -Y-

    Thursday, June 25, 2015 9:47 PM
  • Thank you very much Yordan for your answer. I'll investigate first two options you mentioned. By the way, can shared variables help me in some way? I mean for example declaring model parameters as shared variables array to combine messages from different submodels/data chunks.

    As for other two options you mentioned, I'm using online learnig now because data is being generated continiously. And now I'm trying to speed up training on one chunk (say data generated during one day). And I've tried to switch UseParallelForLoops on with no luck. It begin to train even longer. I think because it creates huge number of relatively small tasks (it really parallelizes by features as you say, so it is aprox. number of features * number of instances = millions of parallel tasks) and parallelization advantages are eaten by tasks/threads maintenance overhead.

    Monday, June 29, 2015 8:12 AM
  • You can certainly use shared variables to process the data in chunks, but I don't see a straightforward way to parallelize through shared variables :-\

    EDIT: This is wrong. Read below for details.

    -Y-


    Monday, June 29, 2015 9:18 AM
  • Thank you again. By the way, on the shared variables page in user guide parallelization is mentioned ("You want to parallelise your inference code" in the list of why you may want to split model) with no details or example. And frankly speaking I spend some time to investigate in this direction. I think it's better to remove this item from the list or to explain that shared variables won't help to parallelize inference.
    Monday, June 29, 2015 12:19 PM
  • Good point. But let me first discuss with the author of this page. I might be missing something here.

    -Y-

    Monday, June 29, 2015 2:29 PM
  • I have investigated realisation of SharedVariable<DomainType, DistributionType> and it looks like it is not thread safe (even if we setup all models and shared variables copies for them before going parallel).

    It uses instance property CurrentMarginal to store marginal after each infer if DivideMessages is true. Otherwise it computes output for each batch/model using outputs from other batches/models. In both cases the result should be incorrect and it will depend on the order in which inference in different models/batches will occur.

    So, I think, Yordan, you are right. Shared variables can't help to parallelize.

    If I'm wrong somewhere please correct me because my investigation was not so deep as your knowledge of the Infer.NET code.

    Friday, July 3, 2015 1:45 PM
  • Okay, so all of the credit for this post goes to Tom Minka, who edited the User Guide to contain the following at the bottom of this page. This should go online the next time we re-publish the website.

    ... The same approach allows running different parts of the model in parallel.  For example, suppose we want to run meanModel and precModel on parallel threads.  We create a separate inference engine for each thread and call Parallel.Invoke:

    InferenceEngine engine1b = new InferenceEngine()
        { NumberOfIterations = 10 };
    for (int pass = 0; pass < 10; pass++)
    {
        System.Threading.Tasks.Parallel.Invoke(
            () => meanModel.InferShared(engine1, 0),
            () => precModel.InferShared(engine1b, 0));
        dataModel.InferShared(engine2, 0);
    }

    Now, to clarify, you can split your data array in multiple data arrays and achieve true parallelism through shared variables (not just the mean and the variance as in the example above). All messages, however, will be stored in memory at all times, so you can't use this approach to distribute the model across multiple machines.

    I'm very sorry for leading you astray with my previous post, which claimed that you can't parallelize though shared variables :-\

    -Y-

    Friday, July 3, 2015 4:30 PM
  • Yordan, thank you for a good clarification. And thanks Tom Minka for this piece of information.

    But, as you can see, Tom offers to parallelize by splitting the model itself (not data, as I can understand this situation is refered in the first item from your first answer). There are more than one shared variable and in fact there is separate shared variable for each thread being invoked. So there is no multi-thread access to shared variables here.

    I'm more interested in parallelizing by splitting data (not model, the second item from your first answer). And I tried to investigate shared variables realization from this point of view. The results of this investigation I've written in my previous post. The main idea is that one of the base shared variables classes (SharedVariable<DomainType, DistributionType>) isn't thread safe. So, I see that shared variables can't help us to combine messages from two or more identical submodels running inference in parallel on different chunks of data. Can you answer if I'm right in this particular topic?

    Monday, July 6, 2015 4:10 PM
  • If you define thread-safe code as giving a result consistent with some sequential processing of the data, then you are correct that splitting across data will not be thread-safe.  However, this is not a property of Infer.NET but rather the training algorithm itself.  Virtually all training algorithms for linear models cannot be parallelized in a way that is consistent with a sequential ordering.  However, the algorithm can still reach a fixed point so in practice people do parallelize across data despite the inconsistency.
    Monday, July 6, 2015 6:15 PM
    Owner
  • First of all under thread-safe code I meant a code that gives a consistent result regardless of the order in which its threads are launched and their execution pace (i.e. the way OS plans their execution). May be somewere deep in the mind I expected this code to give consistent results with sequential one, now I realize that in this task it is impossible. But this hardly matters because as I think there is problems with consistency between different execution of this parallel code.

    I assume that we are using a (single) shared variable to combine outputs from several submodels and calculate a marginal. We run inference for this submodels in parallel threads in one batch (with different instances of InferenceEngine per thread). The problem that I will describe is located in SharedVariable<DomainType, DistributionType> class as I mentioned before. For now I'll assume DivideMessages=true.

    When a thread executes SetInput method from our shared variable it calculates prior for current model based on CurrentMarginal field (which is initially set to a copy of a prior passed to shared variable constructor). Then a thread executes InferOutput which calculates output message and also overwrites(!) CurrentMarginal field (which is common for all threads). Since this moment we have lost any messages from inferences that are allready completed if they completed after this thread was started. Because the results of this completed threads are stored in Outputs field (where output from one model has no effect on other models in the case of DivideMessages=true) and in CurrentMarginal field which is now overwritten. After all threads completed the result marginal will simply be taken from this CurrentMarginal feild.

    So if we start all our threads in a loop (i.e. almost simultaneously) we will get marginal only from one single thread which will end the last (i.e. only from one chunk of data). If we'll use Parallel.For/ForEach than may by we will get marginal from several chunks of data (if some threads will manage to finish before some others will start) but hardly from all data (after all we expect Parallel.For to be at least a little bit parallel but not completely sequential).

    In case of DivideMessages=false it looks like there is no much messages lost, because the result marginal is calculated from all items in Outputs dictionary. But there is some inconsistency because in SetInput priors are calculated from all Outputs too (except current). And thus priors will depend on which threads are already finished and which are not. May be this is not so critical. Any way we are already argeed to an aproximation results when we began to use parallelization. But the fact that result isn't fully deterministic is uncomfortable.

    And no matter what value DivideMessages has there is a race condition with an algorithm field. While it is null there is no much calculation while setting input. MessageToBatch just returns copy of CurrentMarginal/Prior (depending on the value of DivideMessages). But when InferOutput is called in the begining of it algorithm field is initialized and since then all MessageToBatch (called from SetInput) will begin to make additional calculations as was briefly described above. I can't now exactly understand will it lead to something bad. Anyway it's a bit uncomfortable. It seems that it will be better to initialize it at the very end of InferOutput.

    Please let me know if I'm wrong somewere. Or if I'm wrong everywhere :)
    Tuesday, July 7, 2015 4:30 PM
  • The type of thread-safety you are describing is known as "synchronous parallelism".  Synchrony is not required for fixed-point numerical algorithms to converge, in fact it usually just slows them down.  You can find lots of discussion about this at http://www.mit.edu/~jnt/parallel.html. If you really want synchrony then you can achieve this by adding locks to the shared variable code.
    Wednesday, July 8, 2015 3:41 PM
    Owner