Answered by:
HPC 2008 job suspending/resuming

Question
-
I am used to using SGE on a linux cluster with abaqus, when using this and wanting to add a higher priority job any exisiting jobs need to be suspended by using a specific command so that abaqus frees the flexlm tokens.
Is it possible to do this with HPC or do jobs just have to go until completion, our problem is that some jobs take 2 days to run to completion and to utilize abaqus tokens efficiently we need to be able to suspend and resume them as smaller jobs go into the queue.
Thanks
Paul- Moved by Alex Sutton Monday, November 9, 2009 6:46 PM (From:Windows HPC Server Deployment, Management, and Administration)
Saturday, November 7, 2009 9:23 AM
Answers
-
I'm not sure whether or not Abaqus supports this type of behavior on Windows. I do know that the HPC Server 2008 scheduler is sadly not able to take advantage of this capability.
Thanks,
Josh
-Josh- Marked as answer by Josh BarnardModerator Thursday, November 19, 2009 1:22 AM
Thursday, November 19, 2009 1:22 AMModerator
All replies
-
To be clear, you are not looking for the OS to suspend the job, but rather to run a command that tells the running application to shut down? This isn't really supported in v2, but you can read me about our plans in this thread: http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/52a7c38d-e53f-4028-a3c5-1ca42c3b6052/
Thanks,
Josh
-Josh- Proposed as answer by Josh BarnardModerator Monday, November 16, 2009 9:09 PM
Monday, November 16, 2009 9:09 PMModerator -
No I really am looking to suspend a job and have it release it's current licenses, this maybe because a higher priority job is submitted or because say someone is debugging an input deck and will submit to a queue that will only allow a 2 minute run before terminating it if it hasn't already completed.
This is supported by abaqus on linux (abaqus suspend/resume which releases all tokens and stops running on the CPU almost immediately) and the SGE engine gives commands to suspend/resume which are hookable so can run a defined script, I haven't tried it over multiple machines but on a single machine it's fine and we have 3 queues, background, normal and debug. With debug suspending anything in normal or background and having a max wallclock time of 2 minutes, normal just suspending anything in background and background running when nothing else is.
Currently we have a series of jobs which will take approximately 2 months of solid runtime on 8 cores (48hrs per job), but if this could be shifted to a higher number of cores it will speed up, but without being able to suspend the job we'd never be able to do this as other people would end up having to wait 2 days to get any results for other projects.
Thanks
PaulWednesday, November 18, 2009 10:29 PM -
I'm not sure whether or not Abaqus supports this type of behavior on Windows. I do know that the HPC Server 2008 scheduler is sadly not able to take advantage of this capability.
Thanks,
Josh
-Josh- Marked as answer by Josh BarnardModerator Thursday, November 19, 2009 1:22 AM
Thursday, November 19, 2009 1:22 AMModerator -
Does Abaqus support a check-point restart of some sort? If so it might be possible to check point the Abaqus computation then cancel the job & requeue or restart it later from the check-point position.Thursday, November 19, 2009 7:28 PM