none
Workstations showing null heatmap

    Question

  • I just added several workstations to our hpc cluster and they are showing a null heatmap.

    We had an issue where the reverse dns entry for the headnode did not exist. I got it created and rebooted the client and it is still showing these errors. 

    Does anyone have any ideas?

    When I check the logs...

    This is what it says in the event viewer

    Failed to initialize collector. Retrying in 60 seconds. System.IO.IOException: The read operation failed, see inner exception. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
       at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
       at System.Runtime.Remoting.Channels.SocketStream.Read(Byte[] buffer, Int32 offset, Int32 size)
       at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)
       at System.Net.Security.NegotiateStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
       at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
       --- End of inner exception stack trace ---

    Server stack trace: 
       at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
       at System.Net.Security.NegotiateStream.Read(Byte[] buffer, Int32 offset, Int32 count)
       at System.Runtime.Remoting.Channels.SocketHandler.ReadFromSocket(Byte[] buffer, Int32 offset, Int32 count)
       at System.Runtime.Remoting.Channels.SocketHandler.Read(Byte[] buffer, Int32 offset, Int32 count)
       at System.Runtime.Remoting.Channels.Tcp.TcpFixedLengthReadingStream.Read(Byte[] buffer, Int32 offset, Int32 count)
       at System.IO.BinaryReader.ReadBytes(Int32 count)
       at System.Runtime.Serialization.Formatters.Binary.SerializationHeaderRecord.Read(__BinaryParser input)
       at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadSerializationHeaderRecord()
       at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run()
       at System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)
       at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)
       at System.Runtime.Remoting.Channels.CoreChannel.DeserializeBinaryResponseMessage(Stream inputStream, IMethodCallMessage reqMsg, Boolean bStrictBinding)
       at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)

    Exception rethrown at [0]: 
       at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
       at Microsoft.Hpc.Monitoring.IHpcMonitoringStore.GetMetrics(Nullable`1 target)
       at Microsoft.Hpc.Monitoring.MetricCollector.Initialize()

    Here is what it says in the hpcmonitoring client log on the client

    12/05/2016 20:23:13.394 e HpcMonitoringClient 2552 3252 Failed to initialize collector. Retrying in 60 seconds. System.Net.Sockets.SocketException (0x80004005): No such host is known....Server stack trace: ..   at System.Net.Dns.GetAddrInfo(String name)..   at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)..   at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)..   at System.Runtime.Remoting.Channels.RemoteConnection.CreateNewSocket()..   at System.Runtime.Remoting.Channels.SocketCache.GetSocket(String machinePortAndSid, Boolean openNew)..   at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.SendRequestWithRetry(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream)..   at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Stream& responseStream)..   at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)....Exception rethrown at [0]: ..   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)..   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)..   at Microsoft.Hpc.Monitoring.IHpcMonitoringStore.GetMetrics(Nullable`1 target)..   at Microsoft.Hpc.Monitoring.MetricCollector.Initialize()  
    12/05/2016 20:23:14.175 i HpcMonitoringClient 2552 3084 NodeId timer invoked  
    12/05/2016 15:24:12.454 i HpcMonitoringClient 2552 3252 Querying metrics for target WorkstationNode  
    12/05/2016 15:24:31.469 e HpcMonitoringClient 2552 3252 Failed to initialize collector. Retrying in 60 seconds. System.IO.IOException: The read operation failed, see inner exception. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host..   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)..   at System.Runtime.Remoting.Channels.SocketStream.Read(Byte[] buffer, Int32 offset, Int32 size)..   at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)..   at System.Net.Security.NegotiateStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   --- End of inner exception stack trace ---....Server stack trace: ..   at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security.NegotiateStream.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.Runtime.Remoting.Channels.SocketHandler.ReadFromSocket(Byte[] buffer, Int32 offset, Int32 count)..   at System.Runtime.Remoting.Channels.SocketHandler.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.Runtime.Remoting.Channels.Tcp.TcpFixedLengthReadingStream.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.IO.BinaryReader.ReadBytes(Int32 count)..   at System.Runtime.Serialization.Formatters.Binary.SerializationHeaderRecord.Read(__BinaryParser input)..   at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadSerializationHeaderRecord()..   at System.Runtime.Serialization.Formatters.Binary.__BinaryParse&x' �  .   at System.Runtime.Serializatp�S�E   ��ٵ     y.Ob                ialize(H                       J� �  rser ser               ؏S�E           isCrossAppDomain, IMethodCal        met�       essage).        ystem.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)..   at System.Runtime.Remoting.Channels.CoreChannel.DeserializeBinaryResponseMessage(Stream inputStream, IMethodCallMessage reqMsg, Boolean bStrictBinding)..   at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)....Exception rethrown at [0]: ..   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)..   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)..   at Microsoft.Hpc.Monitoring.IHpcMonitoringStore.GetMetrics(Nullable`1 target)..   at Microsoft.Hpc.Monitoring.MetricCollector.Initialize()  


    • Edited by nicka345 Monday, December 5, 2016 3:40 PM
    Monday, December 5, 2016 3:39 PM

Answers

  • Hi, Nicka,

    seems this is the issue related DSN, you can see Windows Sockets Error codes https://msdn.microsoft.com/en-us/library/ms740668%28VS.85%29.aspx?f=255&MSPPError=-2147217396

    Can you try Resetting the Windows TCP/IP stack with netsh int ip reset, then try again.

    and in your hpc cluster, besides this workstation node, whether has other compute node or workstation node, whether has same issue?
    • Marked as answer by nicka345 Thursday, December 8, 2016 2:46 PM
    Wednesday, December 7, 2016 1:32 AM

All replies

  • ok - I created an entry in /etc/hosts for the headnode on the client ( even though nslookup returns the correct value for both the forward and reverse entries) 

    and now this is the error

    12/05/2016 22:32:39.457 i HpcMonitoringClient 2640 3200 Querying metrics for target WorkstationNode  
    12/05/2016 22:32:39.488 e HpcMonitoringClient 2640 3200 Failed to initialize collector. Retrying in 60 seconds. System.Net.Sockets.SocketException (0x80004005): The requested name is valid, but no data of the requested type was found....Server stack trace: ..   at System.Net.Dns.GetAddrInfo(String name)..   at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)..   at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)..   at System.Runtime.Remoting.Channels.RemoteConnection.CreateNewSocket()..   at System.Runtime.Remoting.Channels.SocketCache.GetSocket(String machinePortAndSid, Boolean openNew)..   at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.SendRequestWithRetry(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream)..   at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Stream& responseStream)..   at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)....Exception rethrown at [0]: ..   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)..   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)..   at Microsoft.Hpc.Monitoring.IHpcMonitoringStore.GetMetrics(Nullable`1 target)..   at Microsoft.Hpc.Monitoring.MetricCollector.Initialize()  
    12/05/2016 22:32:40.301 i HpcMonitoringClient 2640 3064 NodeId timer invoked  
    12/05/2016 17:33:39.584 i HpcMonitoringClient 2640 3200 Querying metrics for target WorkstationNode  
    12/05/2016 17:33:58.600 e HpcMonitoringClient 2640 3200 Failed to initialize collector. Retrying in 60 seconds. System.IO.IOException: The read operation failed, see inner exception. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host..   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)..   at System.Runtime.Remoting.Channels.SocketStream.Read(Byte[] buffer, Int32 offset, Int32 size)..   at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)..   at System.Net.Security.NegotiateStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   --- End of inner exception stack trace ---....Server stack trace: ..   at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)..   at System.Net.Security.NegotiateStream.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.Runtime.Remoting.Channels.SocketHandler.ReadFromSocket(Byte[] buffer, Int32 offset, Int32 count)..   at System.Runtime.Remoting.Channels.SocketHandler.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.Runtime.Remoting.Channels.Tcp.TcpFixedLengthReadingStream.Read(Byte[] buffer, Int32 offset, Int32 count)..   at System.IO.BinaryReader.ReadBytes(Int32 count)..   at System.Runtime.Serialization.Formatters.Binary.SerializationHeaderRecord.Read(__BinaryParser input)..   at System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadSerializationHeaderRecord()..   at System.Runtime.Serialization.Formatters.Binary.__BinaryParse&xK�  .   at System.Runtime.Serializat��0�   ���_     y.Ob                ialize(H                       Jl�  rser ser               (�0�           isCrossAppDomain, IMethodCal        met�       essage).        ystem.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage)..   at System.Runtime.Remoting.Channels.CoreChannel.DeserializeBinaryResponseMessage(Stream inputStream, IMethodCallMessage reqMsg, Boolean bStrictBinding)..   at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)....Exception rethrown at [0]: ..   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)..   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)..   at Microsoft.Hpc.Monitoring.IHpcMonitoringStore.GetMetrics(Nullable`1 target)..   at Microsoft.Hpc.Monitoring.MetricCollector.Initialize()  
    12/05/2016 17:34:58.631 i HpcMonitoringClient 2640 3200 Querying metrics for target WorkstationNode  

    Monday, December 5, 2016 6:07 PM
  • Hi, Nicka,

    what is the HPC version of head node and the workstation node?

    and on workstation node, can you try to ping head node name, whether it can connect to head node?

    suppose head node and workstation node are in same domain, right?

    Tuesday, December 6, 2016 1:24 AM
  • Yes the head and the client can ping each other - and yes they are in the same domain. The HPC version is windows 2012 r2 4.5.5079.0 and the clients are windows 10. 
    Tuesday, December 6, 2016 2:00 PM
  • I wanted to clarify - the head node can ping the workstation node via hostname and IP address and  the workstation node can ping the headnode via hostname and ip address as well. Both can resolve an nslookup of the forward and reverse entries. 

    The thing that I don't understand - is after this error

    "The requested name is valid, but no data of the requested type was found"

    I get these errors every minute on the workstation node

    "The read operation failed, see inner exception. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host"

    Does that mean that the head node is not letting the client connect?

    I also noticed that if I try to run any get-hpcjob or set-hpcjob on the workstation node it failed with 

    Set-HpcJob : The read operation failed, see inner exception.

    But normal hpc jobs run fine on those nodes (ones that dont call back to the headnode like have get-hpcjob or something like that in them) and I can run remote commands on the workstation nodes through the cluster manager as well. 

    I've rebooted the workstation node and even reimaged it - still same thing. 

    The only thing that is the same when we reimage is the ip address and the hostname. I deleted the object out of AD so it should have gotten a new sid. 

    I restarted the SDM service on the head node and did an ipconfig /flushdns - I have not yet rebooted the head node. I really want to avoid that if it is at all possible. 

    It is almost like the head node locked out any communication from the workstation node to the headnode. 

    when I first imaged these machines I did a lot of robocopies using the remote command in the cluster manager, and during that time the reverse dns entry for the head node was missing. I'm wondering if during that time the head node locked out the workstation node. If I'm on the right track is there some cache or something on the head node that has this stored, that I can clear out?


    • Edited by nicka345 Tuesday, December 6, 2016 9:15 PM
    Tuesday, December 6, 2016 2:40 PM
  • Hi, Nicka,

    seems this is the issue related DSN, you can see Windows Sockets Error codes https://msdn.microsoft.com/en-us/library/ms740668%28VS.85%29.aspx?f=255&MSPPError=-2147217396

    Can you try Resetting the Windows TCP/IP stack with netsh int ip reset, then try again.

    and in your hpc cluster, besides this workstation node, whether has other compute node or workstation node, whether has same issue?
    • Marked as answer by nicka345 Thursday, December 8, 2016 2:46 PM
    Wednesday, December 7, 2016 1:32 AM
  • That fixed it!!! 

    I had to run it on the headnode and reboot the headnode. 

    Thanks so much!

    Nicki

    Thursday, December 8, 2016 2:46 PM