Exit code for canceled tasks RRS feed

  • Question

  • We are running a python scripts in HPC and we have it set to catch the ctrl-break via an exception.  In our response to the ctrl-break, we have some clean up that we want to do and then we want to indicate that we are done with our clean up but the we still want to be requeued.  Returning a non-zero code indicates a failure and will not invoke a requeue.  If we do want a requeue, and don't want to let HPC think our task is finished, should we just return a 0 or is there another return code that we should use?

    We are having issues where our node prep task is canceled, and we are returning a 0 exit code.  It looks like HPC thinks that the node prep is complete, so it launches tasks on the node even though the node prep task was not run to completion, resulting in failed jobs.  How do we address canceled node prep tasks?


    Wednesday, August 8, 2012 3:58 PM

All replies

  • Unfortunately, I have still not been able to figure this out properly.  When we catch the ctrl-break, after we do what we need to do, if we exit 1, then HPC fails us and does not requeue the task and if we exit 0, then HPC thinks we have finished and does not requeue the task.  So it seems that HPC has to be the one to actually terminate the process in order for the task to be requeued.  Our non-ideal workaround is after we finish our clean up actions, we go into an intentional infinite loop.  This causes HPC to cancel the task after the grace period duration and requeue the task.  The down side is that anytime a task is canceled, it now lasts the entire cancellation grace duration (2 minutes).  This isn't a big deal, but wastes grid resources for that time. 

    I would appreciate any help on this issue and am surprised that I can't find anything about this.  The documentation talks about this cancellation grace period but doesn't mention anything about these complications. 

    • Edited by egcow Friday, August 10, 2012 11:39 AM
    Friday, August 10, 2012 11:38 AM
  • You might try having your task create a "child" task (a task that is dependent on it) and then exit cleanly.  This new task can check to see the exit state of its parent and decide how to proceed.

    As you have discovered, the Scheduler does not have an auto-requeue feature for tasks.  They only get requeued as a consequence of other events (job requeue, grace period exceeded, preemption, etc.).


    Monday, August 20, 2012 10:44 PM