locked
Question regarding MS HPC 2008 head node failure RRS feed

  • Question

  • I am researching MS HPC 2008 as a grid computing solution, and have a question regarding head node operations that I have not yet been able to find an answer to.

    In the event of a head node failure (assuming a single head node configuration), what happens to any jobs that are currently in process?  Do the jobs fail?  If not, do the jobs continue to process until the head node is brought back on-line?  Or do jobs that are currently process go into a "suspended" state and resume once the head node is brought back on-line?  Is this resumption process automatic, or a manual process?

    Basically, I'm trying to make the determination if I need to cluster the head node right away, or if having the head node be a single point-of-failure within the HPC grid is a low-risk decision.

    Thanks,
    Matt

    Friday, October 16, 2009 1:27 PM

Answers

  • The tasks and jobs that are running keep on running. When the head node is back up, it will re-establish communication with the compute nodes.

    Whether or not you need a failover head node depends on your type of jobs (long or short running tasks), type of hardware, tolerance for risk, etc.

    Monday, October 19, 2009 1:55 AM

All replies

  • The tasks and jobs that are running keep on running. When the head node is back up, it will re-establish communication with the compute nodes.

    Whether or not you need a failover head node depends on your type of jobs (long or short running tasks), type of hardware, tolerance for risk, etc.

    Monday, October 19, 2009 1:55 AM
  • Good Day,

     

    After powering up my 8 nodes cluster, i restarted the HeadNode while other ComputeNodes are running; i did some diagnostic tests using HPC Manager console, all tests failed with message like "...mpi fatal error...".

     

    I shut down the entire cluster, restarted the HeadNode followed by all CompuNodes and re-ran the diagnostic tests, nothing works.

     

    What does this mean? Is there anything to do with the OpenSM which i installed on HeadNode and set to be running Automatically upon re-starting? My HeadNode also running DHCP/DNS/DC/FTP other than those HPC components.

     

    The same result happened to another scenario which i used CD-Adapco's Star-CCM+ for testing. The Star-CCM+ just failed to run parallel processing once i restarted the HeadNode while other ComputeNodes are running. What i need to do is to re-start the entire cluster - HeadNode first, then ComputeNodes - then only Star-CCM+ able to run parallel processing using Windows HPC Manager.

     

    Thanks

    Thursday, April 14, 2011 4:06 AM