none
Compute node failover in HPC RRS feed

  • Question

  • Hi,

    I have an HPC 2008 SP1 cluster running on Windows 2008 Service Pack 2.  

    I have seen a problem recently where a compute node stops functioning and causes the running tasks to fail, on one occasion subsequent jobs tried to use the faulty node and also failed.  I have tried to recreate the issue and discovered that by rebooting a compute node or stopping the HPC services the running job will fail, if i disconnect the network cable to the compute node the inbuilt failover works and the work is moved to another host.  I have raised a ticket with Microsoft to ask why the grid doesn't seem to be very resilient.

    I understand from speaking to Microsoft that the Headnode only checks the status of a compute node using a UDP ping, it appears that no verification is made of the services and their processes being operational.

    Is anyone else having this problem at all?

    Tuesday, January 4, 2011 11:42 AM

All replies

  • Hi Bob,

      We don't have node failover feature. It a task is failed to run on a compute node (For example, kill the process from the node) or the scheduler detects the node is unreachable (Missing heartbeats that caused by reboot, network down, service down), the task will be requeued by the scheduler. So in your above case of stopping HPC service (Node Manager Service on the compute node), the task will be requeued.

      By default the task can only be requeued three times (You can change it, please check "cluscfg listparams" --> TaskRetryCount), if task's rerunnable is set to false, it will fail directly.

      From HPC 2008 R2, you can set excluded nodes when it is running for jobs so that the job won't surfer from bad node. (Check http://technet.microsoft.com/en-us/library/ff919671(WS.10).aspx). If you're using SOA job, this is automatically done.


    Qiufang Shi
    Thursday, January 6, 2011 4:03 AM
  • Thanks for the reply, regarding the work being requeued in the event of a server power failure or the network connect being disconnected we are seeing the work successfully being reallocated to another Compute Node.  The problem we are seeing is that sometimes a compute node stops functioning but is still up and on the network, it seems to hold on to the tasks without them being reassigned.  On other occassions i have seen a single server having a problem and causing its current run to fail, subsequent runs have also failed due to trying to use the faulty compute node.  If i set the template for to exclusive then i guess this will stop subsequent jobs from failing due to trying to reuse a faulty host.

    I have maybe described it as failover but i actually am referring to fault tolerance, I have been advised microsoft that the headnode verifies a compute node is operational by means of a UDP ping.  This may not be sufficient in situations where a server has encountered a software issue and not completely dropped off the network.

    Thanks again for the response.

    Thursday, January 6, 2011 11:19 AM
  • Hi,

    I also have encountered this problem where the compute node "stops functioning but is still up and on the network, it seems to hold on to the tasks without them being reassigned. ". Have you gotten a solution from microsoft or how did you solve this problem ?

    Thanks.

    Monday, February 27, 2012 6:25 PM
  • Sorry for not replying sooner, the issue i had does appear to be resolved in Windows 2008 HPC R2.  I have carried out a few tests to simulate node failures and all appear to have been successful so far.
    Thursday, September 20, 2012 1:49 PM