none
how to handle infrastructure failures for workers/brokers/head nodes? RRS feed

  • Question

  • Hi,

    We ran into a unexpected behavior this weekend.  We had a long running HPC job going and our company's patching ran and caused the head node to reboot.   It appears when the head node came back up, the job restarted (including the tasks which had previously in progress).  This is not the behavior we want as the individual tasks are stateful and cannot just be started from the beginning.  I think the ideal behavior would be if the head node went down, the tasks on the worker nodes would stop running, and when the head node came back up the job would NOT restart.  

    Related to this, what happens if a worker node goes down while it is running a task?   Will the task be re-distributed to a different machine, or will the task just show up as failed?   I did see the job template property "Task Execution Failure Retry Limit" but wasnt sure if that was relevant in this case.

    Thanks!

    -Jason

    Tuesday, June 28, 2016 12:42 PM

Answers

  • Hi Jason,

    For a SOA job, if it is a durable session, when the headnode which has also the broker role according to the error message restarts, the broker service will try to recove the durable session and process the remaining requests previously persisted in the MSMQ when it starts. For failed requests, the broker would re-dispatch them to the service hosts for process. It is a similar senario when the compute node restarts and the service hosts on the compute node fail to process the requests, SOA broker service would re-dispatch the failed requests to other available healthy service hosts for process. If you would like to limit the maximum request re-dispatch count, e.g. to 0 so that the request can only be processed once, you may use the messageResendLimit setting in the loadBalancing section in the service registration file. By default it is 3, setting it to 0 so that the SOA requests will not be re-dispatched after the infrastruture failure.

    Regards,

    Yutong Sun

    Thursday, June 30, 2016 8:05 AM
    Moderator

All replies

  • Hi Jason,

      Could you give me more details on your job? Is it a batch job, SOA job or MPI job? Whether the task will finish during the headnode restart? And how long it takes for the headnode to patch and restart?

      In our system design, when the headnode restarted and thus the scheduler restarted, the batch job will be resumed as long as the tasks are still running on the compute nodes. Unless some task finished during that time period and the compute node try to report the result back to the scheduler and it failed after several retry.

      And when the above case happens, the task will be retried according to your setting on "TaskRetryCount" in the system (Check output of "cluscfg listparams" ). And then check your task auto requeue count (You can check the "task view <jobid.taskId> /detailed" with the property of "AutoRequeueCount", this is the requeue count for the system reason on task failure (For example, node unreachable, task crash ...)).

      And we usually recommend customer to set up the cluster in HA mode if reliability is critical and have the patching to be scheduled one headnode by another.


    Qiufang Shi

    Wednesday, June 29, 2016 3:30 AM
  • It is a SOA job.  No tasks finished while the headnode was restarted (very long running tasks, somewhere between 24 and 48 hours).    

    How would a restart of the head node be handled in the case of a SOA job?   I may have misread what the job manager was telling me so I'll double check and get back to you.   I know our calling app got the following error message:

    ERROR: Microsoft.Hpc.Scheduler.Session.SessionException: Broker node is unavailable due to loss of heartbeat. Make sure you can connect to the broker node and the HpcBroker service is running on the broker node.

    but the tasks continued on the slaves.  When I can get on the head node I will check the job manager and report back the exact status of the jobs during this time.

    Wednesday, June 29, 2016 12:47 PM
  • Hi Jason,

    For a SOA job, if it is a durable session, when the headnode which has also the broker role according to the error message restarts, the broker service will try to recove the durable session and process the remaining requests previously persisted in the MSMQ when it starts. For failed requests, the broker would re-dispatch them to the service hosts for process. It is a similar senario when the compute node restarts and the service hosts on the compute node fail to process the requests, SOA broker service would re-dispatch the failed requests to other available healthy service hosts for process. If you would like to limit the maximum request re-dispatch count, e.g. to 0 so that the request can only be processed once, you may use the messageResendLimit setting in the loadBalancing section in the service registration file. By default it is 3, setting it to 0 so that the SOA requests will not be re-dispatched after the infrastruture failure.

    Regards,

    Yutong Sun

    Thursday, June 30, 2016 8:05 AM
    Moderator
  • Very good to know this . Had the same question on assigning laptops as workstation nodes in my scenario - the users would keep them on and docked for most of the time, but when undock it and close the lid (laptop goes to sleep) and was wondering what would happen to a task that is running on those laptops...

    I could totally do away by *not* adding laptops to the cluster, but i have several 'small' tasks running most (> 98%) of the time which take very less time to complete and wanted to harness laptop's computing power as well. 

    thanks for the answer, @Yutong!

    Saturday, July 16, 2016 5:39 PM