I had a similar situation this morning when a user submitted a job, that didn't really start running. Only one task seemed to start on 1 core on 1 node. I was unable to cancel the job itself, but was able to cancel the individual tasks inside
the job. Even then, I could not cancel the job, and the one core still seemed to be in use on the 1 node, even though nothing was running.
I did look in the event viewer as suggested above, and I found many event 8, 24, and 25 errors. These appear to be SQL exceptions that are being thrown. Re-starting the SQL services did not resolve the issue, and I'm in the process of rebooting
the head node.
Has anyone else seen this type of behavior, or is this is clue to something else?
Thanks,
Adam