05 Juli 2011 21:24
Our process discovers what the next set of tasks is dynamcially as a job is running. Each block consists of 1-300 tasks, the last one then queues the next block. The block of tasks if large enough is added to the same job using multiple theads and scheduler connections.
The bad behavior we see randommly, occuring at least once per week, is a job go 'crazy' and keep adding the same task over an over. Sometimes it is one task, and at other times we might see a few tasks that keep getting added. We have seen it reach into the thousands of the same repated task before either the job is killed or we cancel it. The repeated tasks have the same name which before threw an exception from the client programatic side. Recently I added extra logging to our code just prior to the call the .AddTask() call to verify that it wasn't something in our code getting into a loop.
I have screen shots of the job. No error events were showing in the Event viewer.
We have experienced this in both Hpc 2008 SP1 and in the last month upgraded to Hpc 2008 R2 SP1 and see the problem still. We have not been able to come up with a simple reproduction case, but it does happen fairly frequently.
Anyone else having this problem?
21 Juli 2011 14:27
I want to make sure I understand the scenario. My understanding ius that the last task in the block of 1-300 tasks starts the next blkock of tasks?
What is your method of determining when the last task is running or completed? Task dependencies? Task completed states? or something else?
26 Juli 2011 20:49Question 1. My understanding is that the last task in the block of 1-300 tasks starts the next block of tasks?
Yes. Block A has many tasks in it, each one checks if it is the last one during a callback to our web service at the end it's processing. During that callback in the web service that last task then adds Block B set of tasks to the job. It keeps the job from automatically moving to the finished state since the last task in Block A has not yet exited.
Question 2. What is your method of determining when the last task is running or completed? Task dependencies? Task completed states? or something else?
Determining if an individual task is the last one is in a block is guarded via a table in our database. The sprocs guarantee that only one in that set will be considered the 'last' task. Basically it is a guarded reference counter.
Note 1: We don't use task dependencies, in initial prototyping I found a bug in Hpc a couple years ago. We needed a graph and the support then was only for a tree (has since been fixed, see another one of my posts).
Note 2: We previously used task state and encoded data in the task name to determine last step but there were too many race conditions that made the process error prone.