locked
Automatic Task Retries RRS feed

  • Question

  • How do you achieve this using the C# Scheduler API?

    I have tried a battery of code to get HPC to automatically retry tasks for me in the event of task failure.

    Setting the Task.IsRerunnable property to true, seems to have no effect at all. This just seems to be a indicator flag to HPc as to whether it lets users manually requeue the job.

    The only way i can see of doing this (which is NOT automatic) is for you to call the job.RequeueTask method.

    The technical overview of HPc mentions the automatic retry of jobs and tasks, but i can find no documentation on how this is actually achieved.

    Any ideas?

    Thanks

    Richard
    Tuesday, December 22, 2009 5:07 PM

Answers

  • Hi Richard,

    I discussed this with some other people on our product team and it looks like the high-level documentation you reference is correct but perhaps a little misleading here. That is, yes, the Job Scheduler does retry jobs or tasks, but only in very narrow circumstances. For example, if a task is running on a node, and that node goes unreachable, then the Job Scheduler will effectively cancel and requeue the task. The Job Scheduler may then try to re-run the task on that node up to 3 times (by default), before failing the task altogether. The number of retries is configurable via cluscfg.

    So, when the documentation refers to "automatic retrying of failed jobs or tasks", it's likely referring to those jobs or tasks running on nodes that experience a failure. There is no retrying of tasks that return a non-zero exit code, which I think is what you were hoping when reading the doc.

    Also, like all technical documentation, it's sometimes clearer to write it with the opposite meaning. That is, the Job Scheduler guarantees that Finished jobs or tasks will *not* be automatically retried. This is the situation where, say, a job fails and has one task that Finished successfully and one task that Failed. If the user or admin then requeues that Failed Job, the Failed task is the only task that actually gets requeued, and the Finished task will not be requeued. So the "failed task automatically gets requeued" when requeuing the job.

    To answer your other question, we don't currently have any plans to offer a feature where the scheduler automatically retries jobs or tasks right after they fail.

    Regards,

    Patrick
    Wednesday, December 23, 2009 11:24 PM

All replies

  • Hi Richard,

    You're correct. The Scheduler does not automatically re-queue failed tasks. The IsRerunnable task property indicates whether a failed task *may* be re-run, but it is up to the admin or user to get the failed job or task requeued. In addition, there are cluster-wide parameters [JobRetryCount] and [TaskRetryCount] (configurable via cluscfg) that limit the number of times that a job or task, respectively, may be re-run.

    A failed task can be re-queued by either (i) requeuing the job or (ii) requeuing the task. Once the task is scheduled to run, the IsRerunnable flag combined with the task retry count determines whether the task can run.

    Also, if you point me to the documentation that indicates that there is an automatic retry of jobs and tasks, I can either get it corrected or correct my answer. :-)

    Regards,

    Patrick

    Tuesday, December 22, 2009 7:33 PM
  • Patrick,

    Thanks.

    http://resourcekit.windowshpc.net/AT%20A%20GLANCE/Papers1/Windows_HPC_Server_2008_Job_Scheduler.pdf

    which looks like the official documentation for the job scheduler:

    The Windows HPC Server 2008 Job Scheduler includes powerful features for job and task management, including automatic retrying of failed jobs or tasks, identification of unresponsive nodes, and automated cleanup of completed jobs. Each job runs in the security context of the user, so that jobs and their tasks have the access rights and permissions only of the initiating user. All features of the Job Scheduler are also available from the command line.

    Page 6 of the document.

    Hope this helps! :-)

    Cheers

    Richard


    Wednesday, December 23, 2009 11:38 AM
  • Also, is this a feature planned for the next version?

    Thanks

    Richard
    Wednesday, December 23, 2009 11:39 AM
  • Hi Richard,

    I discussed this with some other people on our product team and it looks like the high-level documentation you reference is correct but perhaps a little misleading here. That is, yes, the Job Scheduler does retry jobs or tasks, but only in very narrow circumstances. For example, if a task is running on a node, and that node goes unreachable, then the Job Scheduler will effectively cancel and requeue the task. The Job Scheduler may then try to re-run the task on that node up to 3 times (by default), before failing the task altogether. The number of retries is configurable via cluscfg.

    So, when the documentation refers to "automatic retrying of failed jobs or tasks", it's likely referring to those jobs or tasks running on nodes that experience a failure. There is no retrying of tasks that return a non-zero exit code, which I think is what you were hoping when reading the doc.

    Also, like all technical documentation, it's sometimes clearer to write it with the opposite meaning. That is, the Job Scheduler guarantees that Finished jobs or tasks will *not* be automatically retried. This is the situation where, say, a job fails and has one task that Finished successfully and one task that Failed. If the user or admin then requeues that Failed Job, the Failed task is the only task that actually gets requeued, and the Finished task will not be requeued. So the "failed task automatically gets requeued" when requeuing the job.

    To answer your other question, we don't currently have any plans to offer a feature where the scheduler automatically retries jobs or tasks right after they fail.

    Regards,

    Patrick
    Wednesday, December 23, 2009 11:24 PM
  • I am working on a similiar problem...

    I understand that if you requeue a job , it will only re-run the failed tasks,  is that true?

    I can't seem to find a method on an interface which will requeue the job , only one which will requeue tasks.

    There was a method in the CCS API , but I don't see one in HPC.

    Tuesday, April 13, 2010 4:34 AM
  • You should be able to requeue a job by doing the following:

    job.Configure();
    
    job.Submit();
    
    Wednesday, April 21, 2010 8:46 PM
  • Need C++ Help!!!

    I've been trying to write c++ code that automatically handles failed tasks, while the job is still running.

    Which methods should I use and in what order?

    I tried to use RequeueTask  or tried to create a new task , add it to the job and submit it to the job.

    Both cases failed - when I looked on my job status on the HPC Job Manager (Graphic Interface),

    I saw the failed task status changed to "configure".

    Can you also give me some c++ code examples showing how to implement the above?

     

    Thanks,

    Shay

    Monday, December 20, 2010 2:08 PM
  • Depends on how you define your tasks I may propose one solution.Say,you have an image which consists of 1000 small images and 50 core cluster .One task would be processing of small image.Organize your memory where you keep your tasks in this manner :taskId, taskData x 1000 records.Now when you submit your job to service application which is on compute node via head node (job submit) you expect to get processed task data with corresponding taskId.

    In case your task fails (service application problem,connection problem,algorithm problem - for esample div. by zero) you may add an error message instead of returned data and add the task to the pool of unprocessed task.Of course you need to build a kind of controller for this purpose.If the task fails due to service application crash or compute node problem , you need to implement timeout mechanizm.If the timeout occurs and a task data is not  returned you may add it again to your task pool.


    Daniel Drypczewski
    Monday, December 5, 2011 10:52 AM
  • Need C++ Help!!!

    I've been trying to write c++ code that automatically handles failed tasks, while the job is still running.

    Which methods should I use and in what order?

    I tried to use RequeueTask  or tried to create a new task , add it to the job and submit it to the job.

    Both cases failed - when I looked on my job status on the HPC Job Manager (Graphic Interface),

    I saw the failed task status changed to "configure".

    Can you also give me some c++ code examples showing how to implement the above?

     

    Thanks,

    Shay


    Requeue Task should work, as long as the job is still running. 

    If all of your other tasks have completed, except for your failed tasks, then the job goes into a completed (Failed) state.  You have to re-submit the job.  I've not had to add new tasks to a job.

    Monday, January 23, 2012 2:56 PM
  • Has anyone tried working directly with the underlying SQL database HPCScheduler and use a database trigger on failed tasks that fulfill certain criteria?

    In that way it should be possible to have do resubmission without having a separate service or program to check for failed tasks and resubmit?

    Wednesday, August 6, 2014 8:55 AM