Answered by:
Job is in a "Failed" state but all tasks have a "finished" state

Question
-
We run on HPC 2008 and have a job that indicates that it's failed - and nominates the failed tasks in the error message. However, all tasks for the job have a state of "Finished" (ie there are none that failed). The tasks that the job manager says failed show no sign of failure: they all are Finished; they have no error messages; and they all processed what they were supposed to succesfully.
There's no indication of problems in any event logs and there's no one particular Node that they were running on. (And all other tasks on the nodes were successful)
The job is embarassingly parallel and was running on 3 nodes with 28 cores. There's 271 tasks.
In one instance the job was running for 22 hours. The 5 tasks that failed all started within a 20 minute period. None of the other tasks started during that time. All cores were fully utilised throughout the job so they weren't the only 5 things running at the time.
However, we've also had the problem with the same job (a lot less data) running with the same number of nodes, cores and tasks and running for 5 minutes. And then one of the tasks fails.
We can't reproduce this behaviour at will.
We'd appreciate any thoughts on what's happened and how we can resolve what appears to be incorrect reporting of the state.- Edited by wendy d Thursday, July 9, 2009 6:54 AM
- Moved by parmita mehtaModerator Thursday, July 9, 2009 6:31 PM (From:Windows HPC Server Deployment, Management, and Administration)
Thursday, July 9, 2009 6:00 AM
Answers
-
This sounds very mysterious. I'd recommend calling support if you see it again.
-Josh- Proposed as answer by Josh BarnardModerator Wednesday, July 22, 2009 10:05 PM
- Marked as answer by Josh BarnardModerator Thursday, July 30, 2009 6:39 PM
Wednesday, July 22, 2009 10:05 PMModerator
All replies
-
Wendy,
What is the error code giving for the failed tasks? When the task fails are you see any errors in the HPCManagement.log located on the compute node that coincides with the failure?
Log location:
C:\Program Files\Microsoft HPC Pack\Data\LogFiles
Thanks,
BenTuesday, July 14, 2009 4:05 PM -
Hi Ben,
Thanks for the response. The problem is that all the tasks are succesful and they have no errors - only the job has errors. There's also no errors in the logfiles at the time of the problems.
WendyThursday, July 16, 2009 1:37 AM -
Wendy,
Is the problem intermittent? Or is it reproducable with the same job?
Thanks,
BenFriday, July 17, 2009 6:41 PM -
We can't reproduce the problem at all which is what makes it hard to track down.Monday, July 20, 2009 7:32 AM
-
This sounds very mysterious. I'd recommend calling support if you see it again.
-Josh- Proposed as answer by Josh BarnardModerator Wednesday, July 22, 2009 10:05 PM
- Marked as answer by Josh BarnardModerator Thursday, July 30, 2009 6:39 PM
Wednesday, July 22, 2009 10:05 PMModerator -
It is very mysterious so we'll do as you suggest.
RegardsWendy
Friday, July 31, 2009 12:53 AM