Resources for IT Professionals >
Forums Home
>
Windows HPC (High Performance Computing) Forums
>
Windows HPC Server Job Submission and Scheduling
>
HPC 2008 job suspending/resuming
HPC 2008 job suspending/resuming
- I am used to using SGE on a linux cluster with abaqus, when using this and wanting to add a higher priority job any exisiting jobs need to be suspended by using a specific command so that abaqus frees the flexlm tokens.
Is it possible to do this with HPC or do jobs just have to go until completion, our problem is that some jobs take 2 days to run to completion and to utilize abaqus tokens efficiently we need to be able to suspend and resume them as smaller jobs go into the queue.
Thanks
Paul- Moved byAlex SuttonMSFT, OwnerMonday, November 09, 2009 6:46 PM (From:Windows HPC Server Deployment, Management, and Administration)
Answers
I'm not sure whether or not Abaqus supports this type of behavior on Windows. I do know that the HPC Server 2008 scheduler is sadly not able to take advantage of this capability.
Thanks,
Josh
-Josh- Marked As Answer byJosh BarnardMSFT, OwnerThursday, November 19, 2009 1:22 AM
All Replies
- To be clear, you are not looking for the OS to suspend the job, but rather to run a command that tells the running application to shut down? This isn't really supported in v2, but you can read me about our plans in this thread: http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/52a7c38d-e53f-4028-a3c5-1ca42c3b6052/
Thanks,
Josh
-Josh- Proposed As Answer byJosh BarnardMSFT, OwnerMonday, November 16, 2009 9:09 PM
- No I really am looking to suspend a job and have it release it's current licenses, this maybe because a higher priority job is submitted or because say someone is debugging an input deck and will submit to a queue that will only allow a 2 minute run before terminating it if it hasn't already completed.
This is supported by abaqus on linux (abaqus suspend/resume which releases all tokens and stops running on the CPU almost immediately) and the SGE engine gives commands to suspend/resume which are hookable so can run a defined script, I haven't tried it over multiple machines but on a single machine it's fine and we have 3 queues, background, normal and debug. With debug suspending anything in normal or background and having a max wallclock time of 2 minutes, normal just suspending anything in background and background running when nothing else is.
Currently we have a series of jobs which will take approximately 2 months of solid runtime on 8 cores (48hrs per job), but if this could be shifted to a higher number of cores it will speed up, but without being able to suspend the job we'd never be able to do this as other people would end up having to wait 2 days to get any results for other projects.
Thanks
Paul I'm not sure whether or not Abaqus supports this type of behavior on Windows. I do know that the HPC Server 2008 scheduler is sadly not able to take advantage of this capability.
Thanks,
Josh
-Josh- Marked As Answer byJosh BarnardMSFT, OwnerThursday, November 19, 2009 1:22 AM
- Does Abaqus support a check-point restart of some sort? If so it might be possible to check point the Abaqus computation then cancel the job & requeue or restart it later from the check-point position.

