What to do in case of unresponsive compute nodes RRS feed

  • Question

  • Hi,

    I am using HPC 2012 R3 version 4.5.5161.0
    We added a couple of virtual machines as compute nodes to our cluster.
    For reasons beyond our control, sometimes those nodes are completely unresponsive because they are migrated to another hypervisor. The machines will come back online a few minutes after the migration starts.

    I was wondering if anyone has experience with this type of issues.
     - Do compute nodes emit some kind of heartbeat ? Will HPC head node notice the machines are unresponsive ?
     - What happens if the machine goes unresponsive after the HPC head node has scheduled some tasks on it ? Can we configure the head node to reschedule those tasks elsewhere ?
     - Imagine the unresponsive machine comes back online and resumes the tasks that were assigned to it. However the head node had already rescheduled those tasks on another node. The result is that the same task will be completed twice. How does HPC head node handles these edge cases ?

    Monday, June 18, 2018 9:08 AM

All replies

  • Hi, 

      How long it takes for the hypervisor migrate your compute node to another hosts? If this happens a lot and the OS state could be kept after the migration (Live migration), you could try tune the "heartbeat options" within HPCClusterManager GUI-->Options-->Job Scheduler Configuration--> Error Handling Page. By default:

        "heartbeat Interval: 30 seconds"

        "Missed heartbeats: 3"

    Which means if scheduler can't receive any heartbeats from compute node, the node will be marked as "unreachable" and all tasks on those node will be requeued (By default, we will retry those tasks 3 times, the setting is on the same page). And if the node comes back, the scheduler will cancel all running tasks on the node, but sometime the task may already finished. If you never want a task to be retried, you can set the retry tasks to "0", or set the "Rerunnable" property to false when creating the task

    For your scenario, you could increase the "Missed heartbeats to 10" if "migration can be completed within 5 minutes".

    Qiufang Shi

    Tuesday, June 19, 2018 3:14 AM