Reason for Error Messsage: Job failed to start on some nodes or some nodes became unreachable. RRS feed

  • Question

  • Hello Together,

    We are using Microsoft HPC 2012R2.

    On several Jobs we get the Error Message: Job failed to start on some nodes or some nodes became unreachable.

    It all the time happens, when the HPC want to Run the NodeRelease Task. The only additional Information we get is the Date and Time when this happend, but we don't get the Information, on which Node this happens.

    So maybe we have a buggy Node, but we have no idea to find out where this exact happens, and why.

    Could anybody give us a hint, how Fix this kind of Problem ?

    Thank you very much in advance,

    best regards,


    Tuesday, February 17, 2015 10:24 PM

All replies

  • Hi Bobby,

    Suppose you may 'View All Tasks' from the job view dialogue and filter the failed tasks, then the allocated nodes for the failed tasks could indicate which nodes are bad ones.



    Thursday, February 26, 2015 8:36 AM
  • Hi Yutong,

    Thank you very much for your answer. When i do this like you describe and select teh Tab below the Task List which is called "Allocated Nodes" for the selected Task, then the List is just empty. So i have no information, on which node (with which name) this kind of error happend.

    This behaviour is only on Errors, which have exactly this Error message: Job failed to start on some nodes or some nodes became unreachable.

    So how to find out, which node was affected by this Error ?

    Thanks in advance,

    best regards,


    Thursday, February 26, 2015 9:11 AM
  • Hello Together,

    this Issue still exists in our HPC Cluster. some months ago we updated everything to Microsoft HPC Pack 2012 Update 1, and we still have this Problem.

    We figured out that this Error mostly happens with the NodeRelease Task. So in the end all other Tasked runed through without an Error, and then the NodeRelease Task will fail with this Error Message. Somehow strange...

    Do somebody knows, if this Issue is fixed with the Update 2 which is available since a few days?

    Unfortunatly i didin't found a complete list with all the changes what was improved with this Update 2 for Microsoft HPC 2012.

    Any help sould be welcome,

    Thank you very much in advance,


    • Edited by Bobby013 Tuesday, July 14, 2015 6:52 AM Added more details
    Tuesday, July 14, 2015 6:48 AM
  • Hi Bobby,

    Can you share the details of a troublesome job?

    Like how many node prep tasks, normal tasks and node release tasks and their status.

    Job's error message and allocated nodes.

    Failed tasks' error message, allocated nodes and outputs.

    And other information you think abnormal or helpful.

    If some of your nodes being error is the root cause and you didn't catch them in Error state, there could be a network issue disturbs the connection between HN and CNs occasionally.

    I'll try to repro this with your detailed information.

    • Edited by SnOoPy1214 Wednesday, July 15, 2015 4:38 AM
    Wednesday, July 15, 2015 4:38 AM
  • Hi Bobby, i have similar issue with HPC Compute nodes, in our environment. were you able to resolve the issue?
    Tuesday, July 7, 2020 12:37 PM
  • Hi HPCMan,

    When did you see the nodes becoming unreachable and for how many nodes in the cluster? Are your nodes on premises or on Azure Cloud?

    If you have network connectivity issue in the environment, you may try to increase the heartbeat interval and lost count in scheduler error handling configurations and see if this would mitigate the problem.


    Yutong Sun

    Wednesday, July 8, 2020 9:07 AM
  • Hi Yutong, all our servers are on-prem. i have increased the heartbeat interval to 90 seconds  but Misses heartbeat to 30. but no luck. where i can find the "lost count" setting.
    Monday, July 13, 2020 3:55 PM