Ask a questionAsk a question
 

AnswerQuestion regarding MS HPC 2008 head node failure

  • Friday, October 16, 2009 1:27 PMMatt Steffes Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    I am researching MS HPC 2008 as a grid computing solution, and have a question regarding head node operations that I have not yet been able to find an answer to.

    In the event of a head node failure (assuming a single head node configuration), what happens to any jobs that are currently in process?  Do the jobs fail?  If not, do the jobs continue to process until the head node is brought back on-line?  Or do jobs that are currently process go into a "suspended" state and resume once the head node is brought back on-line?  Is this resumption process automatic, or a manual process?

    Basically, I'm trying to make the determination if I need to cluster the head node right away, or if having the head node be a single point-of-failure within the HPC grid is a low-risk decision.

    Thanks,
    Matt

Answers

  • Monday, October 19, 2009 1:55 AMAlex SuttonMSFT, OwnerUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer

    The tasks and jobs that are running keep on running. When the head node is back up, it will re-establish communication with the compute nodes.

    Whether or not you need a failover head node depends on your type of jobs (long or short running tasks), type of hardware, tolerance for risk, etc.

All Replies

  • Monday, October 19, 2009 1:55 AMAlex SuttonMSFT, OwnerUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer

    The tasks and jobs that are running keep on running. When the head node is back up, it will re-establish communication with the compute nodes.

    Whether or not you need a failover head node depends on your type of jobs (long or short running tasks), type of hardware, tolerance for risk, etc.