none
HeadNode - No connection could be made because the target machine actively refused it.

    Question

  • This is regarding HPC Pack 2016 (not Update 1), single HeadNode, running on Server 2012 R2, with about 7 ComputeNodes and 100 WorkstationNodes.  A couple things happened over the past week.

    First was we migrated the Head Node from one HyperVisor platform to another. During which the IP address changed, and the HOSTS file management really had us confused for a while; not understanding why ComputeNodes couldn't connect to the HeadNode, and then once discovering the HOSTS file management by HPC, confused why HPC wasn't updating it with the new HeadNode IP.

    Second was we had a power failure last night, which took down the HyperVisor environment.  Now the HeadNode seems to be in a bad state.  The OS seems healthy, all HPC and Service Fabric services start, but we are getting a number of errors:

    EventLog/Microsoft/HPC/Runtime/Operational:
    [SchedulerHelper] Failed to load broker recover info: System.ServiceModel.EndpointNotFoundException: Could not connect to net.tcp://<headnode>:9092/SchedulerDelegation/Internal. The connection attempt lasted for a time span of 00:00:02.0312873. TCP error code 10061: No connection could be made because the target machine actively refused it <correct_headnode_ip>:9092.  --->

    EventLog/Microsoft/HPC/SOADiag/Operation:
    [DiagCleanerBase] TimerCallback: Error happened when GetAllSessionId, System.Net.Http.HttpRequestException: An error occurred while sending the request. ---> System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it <correct_headnode_ip>:8939

    When running the HPC Console, it will load and seemingly display jobs and nodes, but any actions inside it result in:
    [Error] The operation could not be completed because a head node could not be reached.  Ensure that the HPC Management Service is running on at least one head node in the cluster.  Also, check that the IP address of each head node is correctly listed in DNS server by using the ping or nslookup command-line tools.  If an IP address is wrong, check the Windows\System32\drivers\etc\hosts file on each head node and remove the incorrect entries.  If you remove any entries, flush the DNS resolver cache on each head node by running the ipconfig /flushdns command, and then test again with ping or nslookup.

    It was not updating the Hosts file, so we manually set it on HeadNode and all ComputeNodes, which didn't help.  Currently have the ManageFile line set to false, and all HPC entries commented out, no change.

    Not sure if the Hosts file issue, and this latest "Connection actively refused" issue are related, or two separate issues.

    Any guesses on what happened and how to fix it!?!

    Thanks!

    Tuesday, March 27, 2018 5:54 PM

All replies

  • Hi Matt,

    On which node did you see No connection could be made because the target machine actively refused it <correct_headnode_ip>:8939 ? Can you visit http://<headnode>:8939/api/fabric/nodes on the same node?

    Thanks,
    Zihao

    Wednesday, March 28, 2018 6:46 AM
  • Hi, Matt,

      For the first issue, ComputeNodes should update their hostfile with the new IP, but workstationNodes won't as we don't manage hostfile by default for workstation nodes and unmanaged server nodes. If you do observed that computenodes didn't update their hostfile please share us the management logs on the headnode with the time you changed the IP.

      For the second issue, looks like there is DNS resolve issue for your cluster and thus the management service didn't start successfully. Thus could you help collect and check the management service log in %CCP_LOGROOT_SYS%Management from all three headnodes? And you shall be able to examine the logs through this tool: https://hpconlineservice.blob.core.windows.net/logviewer/LogViewer.UI.application

    And also, try to restart the management service in service fabric cluster.


    Qiufang Shi

    Wednesday, March 28, 2018 9:28 AM
  • Zihao, this error is ON the headnode.  Yes, I can browse to that URL you provided, and get a "nodes.json" file returned, with only the HeadNode listed in the JSON.
    Wednesday, March 28, 2018 4:59 PM
  • Qiufang, thanks for clarification on the HOSTS file management. I have found where in the HpcManagementHN logs we completed the HyperVisor conversion and started up the VM in the new environment.  Is there a way I can get these log files to you? Or specific entries in the logfiles I should search for?  The cluster actually was able to run some jobs for about two weeks after this conversion and IP address change. We did notice that some the Compute Nodes weren't active in the cluster because their HOSTS file were still pointing to the old HeadNode IP Address. Manually updated the HOSTS file and some of them were then available for jobs, but some quickly reverted their HOSTS file back to the incorrect/old IP address.

    It wasn't until the power outage the other night that the environment came completely crumbling down.

    Some interesting log entries, we have this sequence of events repeating over and over (with lots of other non-errors between):
    LogName            Content
    HpcManagement  [Store   ] WCF Connection need to be rebuilt.

    HpcManagement  [InstSpace] The store may be corrupt, instance 28632b28-fe43-4fbe-b6be-4aa349eca463,46e5d359-52cd-413c-b9a6-ec62d3f4e075,243 and instance 28632b28-fe43-4fbe-b6be-4aa349eca463,30b15833-4c20-4df3-a67f-5d57077989e9,241 were both returned in the current instance set. If this error occurs continually then you will need to call support. If it is only temporary then it can be ignored.

    HpcManagement  [HpcManagement] Exception:
    Microsoft.SystemDefinitionModel.InstanceCacheLoadException: The instance collection of ids cannot be resolved in the current instance view.
       at Microsoft.SystemDefinitionModel.InstanceSpace.CommittedInstancesView.ResolveToFullId(IList`1 instanceIds)
       at Microsoft.SystemDefinitionModel.ModelQuery.ResolveToFullInstanceId(List`1 instanceIds)
       at Microsoft.SystemDefinitionModel.ModelQuery.PrefetchMemberInstances(IEnumerable`1 references, String memberName)
       at Microsoft.ComputeCluster.Management.HpcClusterManager.PopulateComputeNodeList()
       at Microsoft.ComputeCluster.Management.HpcClusterManager.<Initialize>d__26.MoveNext()

    HpcManagementStateless.exe  [ServiceFabric]Microsoft.SystemDefinitionModel.InstanceCacheLoadException: The instance collection of ids cannot be resolved in the current instance view.
       at Microsoft.SystemDefinitionModel.InstanceSpace.CommittedInstancesView.ResolveToFullId(IList`1 instanceIds)
       at Microsoft.SystemDefinitionModel.ModelQuery.ResolveToFullInstanceId(List`1 instanceIds)
       at Microsoft.SystemDefinitionModel.ModelQuery.PrefetchMemberInstances(IEnumerable`1 references, String memberName)
       at Microsoft.ComputeCluster.Management.HpcClusterManager.PopulateComputeNodeList()
       at Microsoft.ComputeCluster.Management.HpcClusterManager.<Initialize>d__26.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ManagementHeadNodeService.<StartService>d__4.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ComputeCluster.Management.ManagementServiceBase.<Start>d__4.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.Hpc.Management.HpcManagementStatelessService.<RunAsync>d__2.MoveNext()

    We have restarted the server numerous times, and it continually comes back to that loop above.

    Thanks!
    -Matt

    Wednesday, March 28, 2018 6:08 PM
  • That second error listed above, regarding possible corruption and duplicate instanceIds, lead me to dig through the HPCManagement database.  Using the following query I was able to find the duplicate items:

    select instanceId,instanceName,Min(instanceVersion) as OlderVersion
    from Instances
    where instanceState = 2
    group by instanceId,instanceName
    having Count(*) > 1
    


    And then using this query I was able to set the instances with an older instanceVersion number to an instanceState of 3, which I assume means old/deprecated:

    update Instances
    set instanceState = 3
    from Instances
    	inner join (
    select instanceId,instanceName,Min(instanceVersion) as OlderVersion
    from Instances
    where instanceState = 2
    group by instanceId,instanceName
    having Count(*) > 1
    ) OlderInstances on Instances.instanceId = OlderInstances.instanceId and Instances.instanceVersion = OlderInstances.OlderVersion
    

    The database now seems healthy and the Service Fabric is correctly starting up, with none of the previous errors recurring.

    I am now back to the issue we had prior to the database corruption, which is HOSTS file management.  During all this troubleshooting, we had just removed all the HPC entries in the HOSTS files on all Head and Compute Nodes.  The Compute Nodes have all repopulated their HOSTS files, but with the old/incorrect HeadNode IP, and the HeadNode has not repopulated its HOSTS file yet.

    Where are these Compute Nodes getting this incorrect IP?

    How can I force the Head Node to repopulate its HOSTS file, with its correct IP?

    Thanks!
    -Matt

    Thursday, March 29, 2018 12:35 PM
  • A little more info, this HeadNode was migrated from VMWare to HyperV.  In HPC Cluster Manager -> Configuration -> Network, the network adapter listed is still the Device Name, IP Address, and MAC Address from when the Head Node was on VMWare. Obviously, HyperV uses different NIC drivers and MAC Addresses. How can we update this Network section to reflect the new HyperV NIC Device Name, IP Address, and MAC Address?  And if we can do that, and re-run the "To-Do -> Configure Your Network", will that fix the HOSTS file issue automatically?

    Thanks!
    -Matt

    Thursday, March 29, 2018 1:59 PM
  • You shall re-configure the network from the HPC Pack Cluster Manager, that should help solve the issue.

    Qiufang Shi

    Friday, March 30, 2018 6:45 AM