locked
Jobs exit after the last added task completes even if canceled tasks are requeued or ather tasks are submitted RRS feed

  • Question

  • Say I Submit a Job with 32 tasks I have added. All tasks complete except the last one that hangs for some reason. I then cancel that task  monitoring its status within my code. When I see that it has failed I then requeue it expecting the Job to continue until the requeued task has finished. Instead the Job finishes with a Failed status and the Job exits from the Console. The same type of situation occurs if one of my original tasks is canceled or fails and I submit a new task to the Scheduler. The Job finishes after all the original Added Tasks have returned (whether they finished, failed, or were canceled) leaving the new or requeued task in limbo within my application. Shouldn't the Job stay alive without returning its status until all requeued or submitted tasks have completed? I'm using the HPC Server Job Scheduler API. 

     


    Jay Ferguson
    Tuesday, November 16, 2010 5:48 PM

Answers

  • Hi Jay,

    From what I understand, the problem occurs while you are canceling last task in the job and after you requested its cancellation you're adding a 'replacement' task. Unfortunatelly for your scenario, scheduler is expected to mark the job as failed at this point, because it doesn't have a way to detect any newly incoming tasks. When those which are already known to it are in their final state, it will finish/fail the job.

    I think possible solutions here are:

    1. To cancel a hanging task use a custom tool or script, which will first add a new task and then cancel a hanging task, so new task is already known to scheduler before cancelation occurs.

    2. While creating job, mark it as 'Run Until Canceled' (you will need to cancel the whole job manually when all tasks are finished as expected)

    Regards,
    Łukasz

    Monday, November 29, 2010 3:11 PM

All replies

  • Hi Jay,

    I have a few questions to better understand your issue:

    1. Which version of Windows HPC Server are you using?

    2. What is the sequence of the operations that you are performing in order to 'restore' haning task and which Job Scheduler API methods are you calling?

    Thank you,
    Łukasz

    Wednesday, November 17, 2010 3:38 PM
  • Hi lukasz,The Server ver is Windows Server 2008 Standard.

    If I submit a job with a number of tasks, all of the finish OK except one. This one hangs because of a Application problem.

    In my  task Callback routine I look for canceled tasks, if I find one I then do the following;

     

    retrytask = job.CreateTask();

     

    int jobpart = RetryQueue[args.TaskId.JobTaskId] + NumberOfJobParts;

    retrytask.CommandLine = "perl " + RunScriptName + " " + scriptArguments + " " + jobpart + ";";

    job.AddTask(retrytask);

    job.SubmitTaskById(retrytask.TaskId);

     

    This works great if I cancel a task while other tasks are being executed within the Job. The Job continues until all tasks are completed, including the new submitted task.

    My problem is the last task standing. If I cancel the hanging task my task Callback routine is entered and I then submit a new task. While this is going on the Job exits from the Console and finishes (or fails) leaving my new submitted task in limbo.  

    I hoped the Job would  continue until my new submitted task has completed, it seems however that the Job has finished before I was able to submit the new task. It all boils down to being able to resubmit the last running task in a job if it is canceled.

    I hope this is clear.


    Jay Ferguson
    Wednesday, November 24, 2010 7:57 PM
  • Hi Jay,

    From what I understand, the problem occurs while you are canceling last task in the job and after you requested its cancellation you're adding a 'replacement' task. Unfortunatelly for your scenario, scheduler is expected to mark the job as failed at this point, because it doesn't have a way to detect any newly incoming tasks. When those which are already known to it are in their final state, it will finish/fail the job.

    I think possible solutions here are:

    1. To cancel a hanging task use a custom tool or script, which will first add a new task and then cancel a hanging task, so new task is already known to scheduler before cancelation occurs.

    2. While creating job, mark it as 'Run Until Canceled' (you will need to cancel the whole job manually when all tasks are finished as expected)

    Regards,
    Łukasz

    Monday, November 29, 2010 3:11 PM
  • Have you resolved your issue?
    Wednesday, January 12, 2011 2:41 AM
    Moderator