none
Node became unreachable issue

    Frage

  • Dear all,

    For last few days, our environment consistently observing "node became unreachable" error message. From IT side, everything looks good and no network bottleneck was observed during that time. We have below stack trace

    03/14/2018         17:36:46.638      e             HpcScheduler     1688      4712               [RemotingCommunicator] Job 328, Resource 0, Node ***************, .Exception detail: Microsoft.Hpc.Scheduler.Properties.SchedulerException: Node *************** became unreachable. Ensure that all nodes are available and submit the job again...   at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeController.GetNodeManager(String nodeName)..   at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeController.EndJob(String nodeName, EndJobArg arg, NodeCommunicatorCallBack`1 callback).Current stack:    at Microsoft.Hpc.Scheduler.SchedulerTracingUtil.GenMessageFormat(String message, Object[] args, String e, String& newMessage, Object[]& newArgs)..   at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Int32 jobId, Int32 taskId, Int32[] resourceId, String nodeName, Exception e, TraceEventType level, String message, Object[] args)..   at Microsoft.Hpc.Scheduler.ResourceController.NodeCommunicatorTracer.TraceException(Int32 JobId, Int32 TaskId, Int32 ResourceId, String NodeName, Exception exception, String message, Object[] args)..   at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeController.EndJob(String nodeName, EndJobArg arg, NodeCommunicatorCallBack`1 callback)..   at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.CloseResources(IEnumerable`1 nodes, Boolean closeAll, Boolean forceAll)..   at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.ProcessTransitionalResources(IEnumerable`1 nodes)..   at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.OnTaskFinished(TaskMonitor task)..   at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.ProcessEvent(SchedulerEvent e)..   at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.Run()..   at Microsoft.Hpc.Scheduler.Res\߱󿠠troller.MonitorThread.RunMonitor𑏷C  

    We are somehow blocked and doesn't know where to look for. Is there any other log we can look which provide more details and help us in pin pointing the exact reason.

    Thanks,

    Puneet


    Puneet Sharma

    Freitag, 16. März 2018 17:46

Alle Antworten

  • Hi Sharma,

      Could you give me the whole log file and explain what's the symptom? And you also need to check the nodemanager log from that node to check what's happening.


    Qiufang Shi

    Montag, 19. März 2018 08:28
  • It looks like our certificate got corrupted and due to which this issue happened. I will let you know if face this issue again. Thanks Qiufang.

    Puneet Sharma

    Dienstag, 20. März 2018 15:32