none
Transient TCP timeout errors - need help troubleshooting

    Frage

  • Hello - 

    I had a bunch of failures in my application this morning that I'd like help troubleshooting. The error messages from my application are below. I'm not really sure where to start in diagnosing what caused this error to start occuring. I rebooted the HPC head node and all worker nodes in the fleet and the issue did not reoccur. Since this is a high priority application I'd like to understand what the problem is so that I can try to mitigate it going forward (or best case ensure it doesn't happen again). The below failures were transient errors - I retried my application multiple times and ran into the same issues on seemingly random nodes. As I stated earlier, a reboot seems to have resolved. 


    FYI, i redacted the IP address from the initial error message and replaced with [HOSTNAME_OF_HEAD_NODE]:[PORT]


    Most instances failed with this error:
    --> Could not connect to net.tcp://[HOSTNAME_OF_HEAD_NODE]:[PORT]/31568/NetTcp. 
    The connection attempt lasted for a time span of 00:00:21.0049179. 
    TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [IP_ADDRESS_OF_HEAD_NODE]:[PORT]
    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [IP_ADDRESS_OF_HEAD_NODE]:[PORT]


    One instance failed with this error:
    --> Could not connect to net.tcp://[HOSTNAME_OF_HEAD_NODE]:[PORT]/31644/NetTcp/Controller. 
    The connection attempt lasted for a time span of 00:00:21.0058951. 
    TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [IP_ADDRESS_OF_HEAD_NODE]:[PORT]
    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [IP_ADDRESS_OF_HEAD_NODE]:[PORT]

    Another instance failed with this error:
    --> Could not connect to net.tcp://[HOSTNAME_OF_HEAD_NODE]:[PORT]/BrokerLauncher. 
    The connection attempt lasted for a time span of 00:00:21.0029630. 
    TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [IP_ADDRESS_OF_HEAD_NODE]:[PORT]
    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond [IP_ADDRESS_OF_HEAD_NODE]:[PORT]

      
    Montag, 24. September 2018 17:22

Alle Antworten

  • Could you tell us the version of you HPC Cluster?

    And is your application "batch job", "MPI job" or "SOA job"?


    Qiufang Shi

    Dienstag, 25. September 2018 08:38