Thursday, March 24, 2011 12:26 PM
Hi all ,
Lately we see strange phenomenon -
Some times when we submit job (from our client application via .net API ) and all the jobs tasks finished successfully - the job is stuck on running state and also keep its resources.
If we try to cancel that job - the state switched into canceling state but there is no cancellation - the job still keep its resources.
The only way that we can remove that kind of jobs from the scheduler is to restart the head node .
Do you know what might cause that phenomenon ?
(We are using Windows HPC R2 SP1 , Dot.Net 3.5 Sp1 , all our jobs contains parametric sweep tasks .)
Friday, March 25, 2011 4:15 PM
Are you seeing any errors in scheduler's event log?
Event Viewer -> Applications and Services Logs -> Microsoft -> HPC -> Scheduler -> Operational
Wednesday, June 08, 2011 2:51 PM
I had a similar situation this morning when a user submitted a job, that didn't really start running. Only one task seemed to start on 1 core on 1 node. I was unable to cancel the job itself, but was able to cancel the individual tasks inside the job. Even then, I could not cancel the job, and the one core still seemed to be in use on the 1 node, even though nothing was running.
I did look in the event viewer as suggested above, and I found many event 8, 24, and 25 errors. These appear to be SQL exceptions that are being thrown. Re-starting the SQL services did not resolve the issue, and I'm in the process of rebooting the head node.
Has anyone else seen this type of behavior, or is this is clue to something else?