none
Broker is unavailable due to loss of heartbeat - HPC 2008 R2 RRS feed

  • Question

  • We've been getting the error, "Broker is unavailable due to loss of heartbeat.  Make sure you can connect to the broker node, the HPC Broker service is running on the broker node and the session is still running" pretty regularly, which is causing our jobs to fail. I've searched for information on this error and found very little. Our setup:

    • Broker node running on a Windows 2008 R2 SP1 server with two cores and 12 GB of RAM. WCF service that creates jobs and submits them to the broker runs on this server, too.
    • Two compute nodes running Windows 2008 R2 SP1 with 120GB RAM and 30 cores each.

    It seems like we get the error more often when the cluster is running a lot of jobs, but that isn't always the case.

    Is this error safe to ignore? If so, do I need to reconnect to the session to get the remaining results as they finish? Is there some configuration changes I should make? I did change the heartbeat interval to 60 seconds and the missed heartbeats setting to five, but that didn't seem to help. 

    Monday, March 18, 2013 4:03 PM

All replies

  • We've been working with Microsoft Support on this. While it's still a little early to say for sure, Symantec Endpoint Protection was installed on the head/broker node and the two compute nodes. Disabling it made no difference. Our network team finally uninstalled it Monday, and we haven't had the error since (two days and counting).
    Wednesday, April 10, 2013 4:53 PM
  • It appears it was a bug on Microsoft's end, as this hotfix installed on all servers in the HPC cluster - and the home of the client application (has the HPC client components installed there) - seems to have eliminated the error.

    https://www.microsoft.com/en-us/download/details.aspx?id=38420

    The fix mentioned in there:

    BrokerResponseEnumerator.MoveNext() method and BrokerResponse.Result property return error message “Heartbeat lost for broker node” when clients using SOA session API attempt to retrieve more than 632 responses.

    Monday, May 20, 2013 2:49 PM