I am researching MS HPC 2008 as a grid computing solution, and have a question regarding head node operations that I have not yet been able to find an answer to.
In the event of a head node failure (assuming a single head node configuration), what happens to any jobs that are currently in process? Do the jobs fail? If not, do the jobs continue to process until the head node is brought back on-line? Or do jobs that are currently process go into a "suspended" state and resume once the head node is brought back on-line? Is this resumption process automatic, or a manual process?
Basically, I'm trying to make the determination if I need to cluster the head node right away, or if having the head node be a single point-of-failure within the HPC grid is a low-risk decision.
Thanks,
Matt