HPC Services - failure recovery RRS feed

  • Question

  • I periodically get nodes which seem to hang.  When I run an HPC diagnostic "all services running", I get failures.  I checked into the nodes, and there appear to be three HPC services running normally.  When I check on the "recovery" properties, they are set to the default "restart on first failure" and no action on secondary or subsequent failures.  I'm wondering if this is perhaps not ideal?  I'd tend to want to set all 3 failure modes to "restart the service" but am wondering if net.wisdom on HPC would consider this a bad thing?

    Since these nodes do nothing else, and since without these services running, my grid won't work, it seems to me that it'd be worth a try?


    Monday, June 28, 2010 5:02 PM


  • Hi Tim

    This is an opinion piece really. In the past I have set recovery properties as you state, and that worked out well in the environment in question. There are a couple of provisos with this though. First off you may gloss over regular service errors which would really be better off being fixed (i.e. why are the services failing?). Secondly, you should ensure that your applications are not effected.

    You do say that when you check the state of services on the nodes in question they are all running despite your diagnostics returning failures. This suggests that something else is amiss here, & that service restart settings will not resolve the diagnostic failures. Are you seeing any other issues?



    • Marked as answer by Don Pattee Friday, February 4, 2011 9:32 PM
    Wednesday, June 30, 2010 10:38 AM