locked
Job cancellation and process cleanup RRS feed

  • Question

  • I am using Ansoft HFSS v12 on and HPC Server 2008 cluster and have found that if I cancel running job, the running process is killed and is thus unable to clean up after itself.  HFSS has an option to do a 'clean stop', which allows the application to complete running tasks, cleanup lock files and other files related to the job, and to flush any pending data to the result set UNC before closing.  This is sometimes necessary if, during the run, a solve on a particular frequency or set of frequencies fails to converge on a solution, meaning the rest of the run is wasted as the model needs modification.  Without this flushed results data, however, it becomes very difficult to track down the source of the problem

    Unfortunately, when HPC Server kills a job, it kills the process(es) involved in a way that prevents this sort of cleanup. I've been told this is just the way it is, but  I'm wondering if anyone has seen this issue and done any work to find a way around it.

    Thanks in advance!

    Jamie

    Friday, October 30, 2009 10:08 PM

Answers

  • When a task is started, the scheduler creates a Windows Job Object (http://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx) and then starts your command line within that job object.  When a task is canceled, we simply call TerminateJobObject() (http://msdn.microsoft.com/en-us/library/ms686709(VS.85).aspx) on that job object.  According to MSDN "It is not possible for any of the processes associated with the job to postpone or handle the termination. It is as if TerminateProcess were called for each process associated with the job."

    In v3, we will introduce the new capability I mentioned above, where the processes in your job object will be signaled with CTRL+BREAK and then be given a configurable amount of time to exit before TerminateJobObject() gets called.

    On the bright side, Beta 1 of v3 is now available at http://connect.microsoft.com so you can give that a try and let us know if it addresses your problem.

    Thanks!
    Josh


    -Josh
    Monday, November 16, 2009 9:22 PM
    Moderator

All replies

  • Jamie,
    Unfortunately, that is the way things are for now.  Your best bet for working around this would probably be to wrap execution of your app in a script that could somehow handle the magic.

    The good news is that in v3, we plan to allow applications to catch a CTRL_BREAK signal when they are terminated, giving them a (configurable) amount of time to clean up before exiting.

    Thanks,
    Josh
    -Josh
    Friday, October 30, 2009 11:24 PM
    Moderator
  • Thanks for the info, Josh.  I have a followup question.  I'm trying to figure out how to trap the signal HPC Server sends to kill the processes associated with the cancelled job.  When I tell HPC Server to cancel a job, what does it do to kill the processes? 

    Thanks!

    Jamie
    Monday, November 2, 2009 5:31 PM
  • This issue is really important to us as well, so the more info the better!

    Cheers,

    Brian
    Thursday, November 12, 2009 7:54 AM
  • When a task is started, the scheduler creates a Windows Job Object (http://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx) and then starts your command line within that job object.  When a task is canceled, we simply call TerminateJobObject() (http://msdn.microsoft.com/en-us/library/ms686709(VS.85).aspx) on that job object.  According to MSDN "It is not possible for any of the processes associated with the job to postpone or handle the termination. It is as if TerminateProcess were called for each process associated with the job."

    In v3, we will introduce the new capability I mentioned above, where the processes in your job object will be signaled with CTRL+BREAK and then be given a configurable amount of time to exit before TerminateJobObject() gets called.

    On the bright side, Beta 1 of v3 is now available at http://connect.microsoft.com so you can give that a try and let us know if it addresses your problem.

    Thanks!
    Josh


    -Josh
    Monday, November 16, 2009 9:22 PM
    Moderator