none
NonDomain joined nodes in error state

    Question

  • Head node is domain joined, trying to use ephemeral compute nodes that are not domain joined but attached to the same network,

    Same version of HPC Pack installed on all nodes, compute nodes show up in the master but are in an error state with the error HPC Node Manager Service Unreachable.

    Name resolution between hosts is working without issue.  I can ping between hosts and the firewalls are configured to be open.

    Looking at the SOA logs, I saw the following:

    06/24/2018 17:05:20.878 i HpcSoaDiagMon.exe 5796 5280 [GetCertificateValidationCallback] Bypass certificate CN validation.  
    06/24/2018 17:05:20.878 i HpcSoaDiagMon.exe 5796 5280 [GetCertificateValidationCallback] Bypass certificate CN validation.  
    06/24/2018 17:05:21.096 i HpcSoaDiagMon.exe 5796 5280 [HpcFabricRestContext] Calling resolve/singleton on PWAWOSCM552400C:443 with parameter SessionLauncherStatefulService.  
    06/24/2018 17:05:23.096 i HpcSoaDiagMon.exe 5796 5372 cert issuer CN=HPC Pack 2016 Communication, cert subject CN=HPC Pack 2016 Communication, sslPolicyErrors RemoteCertificateNameMismatch  
    06/24/2018 17:05:23.096 i HpcSoaDiagMon.exe 5796 5372 cert issuer CN=HPC Pack 2016 Communication, cert subject CN=HPC Pack 2016 Communication, sslPolicyErrors RemoteCertificateNameMismatch  

    So I generated a new cert and redeployed it to the compute node.

    Now I am seeing things like this in the SOA logs:

    06/24/2018 18:20:49.853 e HpcSoa 5848 2148 [DiagCleanerBase] TimerCallback: Error happened when GetAllSessionId, System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '23:59:59.9060000'. ---> System.IO.IOException: The write operation failed, see inner exception. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '23:59:59.9060000'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host..   at System.ServiceModel.Channels.SocketConnection.HandleSendAsyncCompleted()..   at System.ServiceModel.Channels.SocketConnection.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout, WaitCallback callback, Object state)..   --- End of inner exception stack trace ---..   at System.ServiceModel.Channels.SocketConnection.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout, WaitCallback callback, Object state)..   at System.ServiceModel.Channels.BufferedConnection.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout, WaitCallback callback, Object state)..   at System.ServiceModel.Channels.ConnectionStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncCallback callback, Object state)..   at System.Net.Security._SslStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at S�<�E�  t.Security._SslStream.ProcessWri���1�   �����     t32                 unt, Asy                       JG�  .   ---                ���1�           ce ---..   at System.Net.Sec       slSt�       cessWrit         buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security._SslStream.BeginWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncCallback asyncCallback, Object asyncState)..   at System.ServiceModel.Channels.StreamConnection.BeginWrite(Byte[] buffer, Int32 offset, Int32 size, Boolean immediate, TimeSpan timeout, WaitCallback callback, Object state)..   --- End of inner exception stack trace ---..   at Microsoft.Hpc.Scheduler.Session.Internal.RetryHelper`1.<InvokeOperationAsync>d__7.MoveNext()..--- End of stack trace from previous location where exception was thrown ---..   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()..   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)..   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.RetryableSchedulerAdapterClient.<GetAllSessionId>d__6.MoveNext()..--- End of stack trace from previous location where exception was thrown ---..   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()..   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)..   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.DiagCleanerBase.<TimerCallback>d__13.MoveNext() 

    Not sure where else I should be looking at this point.

    Sunday, 24 June 2018 6:35 PM

All replies

  • Looking at the scheduler log on the worker node I can see the following:

    06/24/2018 18:37:11.496 i HpcNodeManager.exe 2196 6120 [HpcFabricRestContext] Calling resolve/singleton on <HEAD NODE NAME REDACTED>:443 with parameter SchedulerStatefulService.  
    06/24/2018 18:37:12.512 i HpcNodeManager.exe 2196 4984 [HpcFabricRestContext] resolve/singleton on <HEAD NODE NAME REDACTED>:443 with parameter SchedulerStatefulService returned <HEAD NODE NAME REDACTED>.  
    06/24/2018 18:37:12.606 i HpcNodeManager.exe 2196 6120 [WcfProxy] Begin connect to endpointStr net.tcp://10.193.64.13:5970/CCPSchedulerListener.remote, dnsIdentityName HPC Pack 2016 Communication  
    06/24/2018 18:37:12.606 i HpcNodeManager.exe 2196 6120 [WcfProxy] End to create internal wcf proxy  
    06/24/2018 18:37:12.621 w HpcNodeManager.exe 2196 6120 [WcfProxy] Channel to net.tcp://10.193.64.13:5970/CCPSchedulerListener.remote faulted.  
    06/24/2018 18:37:12.621 e HpcScheduler 2196 6120 [RemotingNMCommImpl] Reset.Exception detail: System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:10:00'. ---> System.IO.IOException: The read operation failed, see inner exception. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:10:00'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host..   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)..   at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing)..   --- End of inner exception stack trace ---..   at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing)..   at System.ServiceModel.Channels.SocketConnection.Read(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout)..   at System.ServiceModel.Channels.DelegatingConnection.Read(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout)..   at System.ServiceModel.Channels.ConnectionStream.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)..   at System.Net.Security._SslStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security._SslStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   --- End of inner exception stack trace ---..   at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.ServiceModel.Channels.StreamConnection.Read(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout)..   --- End of inner exception stack trace ---....Server stack trace: ..   at System.ServiceModel.Channels.StreamConnection.Read(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout)..   at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.SendPreamble(IConnection connection, ArraySegment`1 preamble, TimeoutHelper& timeoutHelper)..   at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.DuplexConnectionPoolHelper.AcceptPooledConnection(IConnection connection, TimeoutHelper& timeoutHelper)..   at System.ServiceModel.Channels.ConnectionPoolHelper.EstablishConnection(TimeSpan timeout)..   at System.ServiceModel.Channels.ClientFramingDuplexSessionChannel.OnOpen(TimeSpan timeout)..   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)..   at System.ServiceModel.Channels.ServiceChannel.OnOpen(TimeSpan timeout)..   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)..   at System.ServiceModel.Channels.ServiceChannel.CallOpenOnce.System.ServiceModel.Channels.ServiceChannel.ICallOnce.Call(ServiceChannel channel, TimeSpan timeout)..   at System.ServiceModel.Channels.ServiceChannel.CallOnceManager.CallOnce(TimeSpan timeout, CallOnceManager cascade)..   at System.ServiceModel.Channels.ServiceChannel.EnsureOpened(TimeSpan timeout)..   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)..   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)..   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)....Exc.Current stack: (null) 

    Based on this it looks like it is properly resolving the headnode IP so I don't think its an issue with DNS, but it does look like something is interrupting the networking.  Firewall on the master node is set to allow all the HPC ports.

    Sunday, 24 June 2018 6:42 PM
  • Hi,

    So I generated a new cert and redeployed it to the compute node.

    You don't need to do this. In fact, if the newly generated cert does not exist in head node's trust chain, the two node won't talk to each other.

    cert issuer CN=HPC Pack 2016 Communication, cert subject CN=HPC Pack 2016 Communication, sslPolicyErrors RemoteCertificateNameMismatch 

    This message is expected (You can see the message level is information). HPC Pack will ignore subject name mismatch by default. The message is only used to trace down which error happened.

    Thanks,
    Zihao

    Monday, 25 June 2018 2:42 AM
  • Where else should I be looking to see why the nondomain joined compute node won't go green for health.

    The services are running, nodes can communicate and name resolution is working as expected.

    Not sure what else I can check at this point to figure out what is happening.

    Monday, 25 June 2018 4:24 AM
  • Hi,

    You can send us scheduler log from headnode and node manager log from both headnode and compute node to investigate.

    Scheduler log at %CCP_LOGROOT_SYS%Scheduler\HpcScheduler_*.bin

    NodeManager log at %CCP_LOGROOT_SYS%Scheduler\HpcNodeManager_*.bin

    You can send them to hpcpack@microsoft.com. Please send more then 3 logs for each kind of log.

    Thanks,

    Zihao

    Monday, 25 June 2018 4:31 AM
  • Thanks Zhao, 

    I sent over what I had which was only about 2 of each since the cluster is brand new.

    thanks!

    -Zach

    Monday, 25 June 2018 5:45 AM
  • Hi Zach,

    Thanks for the log. I see this from scheduler log:

    06/25/2018    04:16:01.325    e    HpcScheduler    2552    6044    [RemotingCommunicator] Resource 0, Node EC2AMAZ-093T1TT, .Exception detail: Microsoft.Hpc.Scheduler.Properties.SchedulerException: Node EC2AMAZ-093T1TT is unreachable because no IPV4 address could be found for it...   [...]

    Can you ping CN from HN using host name EC2AMAZ-093T1TT?

    Thanks,
    Zihao

    Monday, 25 June 2018 6:09 AM
  • Zihao,

          Looks like I can't resolve the computer name from the master node.  This is a problem since the whole point of doing a non-domain joined node was to get around having to add it to AD since we are going to be using ephemeral nodes that will be spun up on demand.  We aren't in Azure and I am working within the constraints of an existing enterprise network that I don't have access to the AD controls on.  The thought was that we could use ephemeral non-domain joined nodes and everything would be fine.

    What is particularly frustrating is the fact that the IP address of the node in question shows up just fine in the HPC console but for some reason HPC Pack insists on doing a name lookup in order to establish communication back to the compute node.

    Is there any way around this?  Entries in the Hosts file is not an option in this scenario.

    Thanks!

    Tuesday, 26 June 2018 5:31 PM
  • Hi Zach,

    Head node needs to resolve compute node using its host name in order to start communication.

    In this scenario, it is necessary to either fix the DNS problem you are facing or change Hosts file.

    Thanks,
    Zihao

    Wednesday, 27 June 2018 2:56 AM