Hi,
How long it takes for the hypervisor migrate your compute node to another hosts? If this happens a lot and the OS state could be kept after the migration (Live migration), you could try tune the "heartbeat options" within HPCClusterManager
GUI-->Options-->Job Scheduler Configuration--> Error Handling Page. By default:
"heartbeat Interval: 30 seconds"
"Missed heartbeats: 3"
Which means if scheduler can't receive any heartbeats from compute node, the node will be marked as "unreachable" and all tasks on those node will be requeued (By default, we will retry those tasks 3 times, the setting is on the same page). And
if the node comes back, the scheduler will cancel all running tasks on the node, but sometime the task may already finished. If you never want a task to be retried, you can set the retry tasks to "0", or set the "Rerunnable" property to
false when creating the task
For your scenario, you could increase the "Missed heartbeats to 10" if "migration can be completed within 5 minutes".
Qiufang Shi