Answered by:
Problems after installing Microsoft HPC 2008 R2 SP4 + KB2802106

Question
-
Hello Together,
we run very successfull since 1 year a Microsoft HPC 2008 SP3 Cluster.
We have currently 1 Head Node and 73 Workstation Nodes inside our Cluster.
Since a view months we have some trouble with our Network, in excat words, this means, that sometimes or often, the connection between some Execution Nodes and our Data server was lost due several reasons.
- Domain Controller not found for authentication
- Server not found, due some protokoll Issues,
- ....
This Network Problem lead to the HPC behaviour, that all running and queued Jobs failed, because the HEad Node tried on the "corrupt" Workstation Node to execute the Nodepreparation Task, this Task failed, and so the complete Job was marked as failed.
We was very happy, when we found out, that exactly this Problem was fixed with the KB2802106 "HPC Pack 2008 R2 SP4 Fix for Node Preparation Tasks Failing Jobs"
So we decided to install this fix including the SP4 first, because it is necessary.
Install Procedure:
- Take offline all Nodes
- Installed SP4 on the Head node
- Reboot Head Node
- Installed KB2802106 on Head Node
- Reboot HEad node
- Installed SP4 on all Workstation Nodes
- Reboot Workstation Nodes and take Online the Workstation Nodes.
Now the Real Problem:
After installing the Fix, we have the Issue, that most of the Nodes are idle, so they are not used to process tasks from a job, even when they not listed in the "ExcludeNode" List from a job.
Currently less then 10% of our 73 Nodes are realy productive (between 4 and 7 Nodes) and the rest stays idle.
We find currently no information inside the HPC Cluster Manager, why we have this behaviour, everything seems to be ok.
Additional to that several small Issues are detected now on the cluster, which seems to be new:
- The Head Node needs much more CPU-Power, partly between 60 and 90%
- The Run Command in the Cluster-Manager needs minutes to run a command on the Execution Node.
- When executing the Run Command on multiple (all) ecexution nodes, it will never start.
- When checking the Tasklist on the HEadNode the sqlservr.exe needs a lot of cputime and uses 1.614.916 KB of RAM
Any help, how to get the idle nodes back to work would be very great. Also some tipps, to solve the other issues are welcome.
Please write also your experience, when you installed the SP4 with the KB2802106.
Thank you very much in advance for your help,
best regards,
Bobby
- Edited by Bobby013 Friday, March 22, 2013 8:18 PM
Friday, March 22, 2013 8:18 PM
Answers
-
Hello Together,
The Issue described above is solved now.
After very long investigation with several Microsoft Support Engineers and HPC Specialist, Microsoft was able
to reproduce this Issue on their side.The Root Cause for this behaviour was the TaskDependencies for this big job. This forced the HPC HEad Node,
to do a lot of calculation to determine the next Task to start inside the Queue.
This leaded in the long dispatching time the tasks.To Fix this Issue, we had to Change the usage of the Task Dependencies, and a new Fix from Microsoft is also provided.
The Name of the Fix is "HPC Pack 2012 R2 Fix for Task Failure Related to Multiple Task Dependencies"
Thank you all very much for the help, to get this Issue solved.
best regards,
Bobby
- Marked as answer by Bobby013 Thursday, August 28, 2014 2:16 PM
Thursday, August 28, 2014 2:13 PM
All replies
-
Hello Together,
The Issue described above is solved now.
After very long investigation with several Microsoft Support Engineers and HPC Specialist, Microsoft was able
to reproduce this Issue on their side.The Root Cause for this behaviour was the TaskDependencies for this big job. This forced the HPC HEad Node,
to do a lot of calculation to determine the next Task to start inside the Queue.
This leaded in the long dispatching time the tasks.To Fix this Issue, we had to Change the usage of the Task Dependencies, and a new Fix from Microsoft is also provided.
The Name of the Fix is "HPC Pack 2012 R2 Fix for Task Failure Related to Multiple Task Dependencies"
Thank you all very much for the help, to get this Issue solved.
best regards,
Bobby
- Marked as answer by Bobby013 Thursday, August 28, 2014 2:16 PM
Thursday, August 28, 2014 2:13 PM -
Hi Bobby, Together,
Could you please explain what you mean by "Change the usage of the Task Dependencies"
I have jobs that contain upto ~26 sets of ~400+1 tasks and 1 final task submitted to a 8 node x64 core (512 cores) cluster
Each set is 400 standalone tasks, with a post completion task dependent on the preceeding ~400.
Then a final task dependent on the ~26 post completion tasks.
I'm finding the cluster is only running at sub capacity (around 450 concurrent tasks) and seems to be very slow at starting new tasks, with it even occasionally getting down to 1 task running (511 cores free), before it submits more tasks.
we moved the SQL dbs; to the head node, and that seemed to improve it as well.
the configuration time for the job is also 5-10 minutes.
We have 2008R2 and not 2012, so I can't use Job dependencies, unless i code it myself, but may have to try that.
thanks
Steve
Friday, August 29, 2014 4:12 AM -
Hi Steve,
my typical problematic job contained several thousand tasks and one final task.
This final task was set up as depending task which had a dependency to all other tasks.
So if the job contained 10.000 normal tasks, the final task had 10.000 dependencies.
This was exactly the issue, which forced the head node to go into overload, because he was just busy, to find out, which task to start next.
The solution from Microsoft which was given was following:
- Not to set the "DependOn" - Property with a array of 10.000 Task Names.
- Create a group where 10.000 Task Names are inside.
- Make the Final Task depending on this whole group.
Unfortunatly for using it in this way, a Fix was needed. This fix was available only for HPC2012 and we used at this time 2008R2.
So i was not able to check this feature. But we will do this now, because ~2 months ago we moved now also with our cluster to HPC2012 version.
The problem will be, that there is no standard API call to perform this.
To create the group, and the whole setup, you have to create yourself a *.xml file with all the Tasks inside,
and use the Method "RestoreFromXml" to set up the job with the group dependency.
This is all i know about that, i hope it helps you to understand how to solve your own issue.
Let's see, if we get it working....... :-)
best regards,
Bobby
Tuesday, September 2, 2014 7:24 AM