none
error when a large number of jobs are being submitted?

    Frage

  • Could not connect to net.tcp://servername:9087/BrokerLauncher. The connection attempt lasted for a time span of 00:00:20.9942045. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.13.70.166:9087.  ---> System.ServiceModel.EndpointNotFoundException: Could not connect to net.tcp://servername:9087/BrokerLauncher. The connection attempt lasted for a time span of 00:00:20.9942045. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.13.70.166:9087.  ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.13.70.166:9087    at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult)    at System.ServiceModel.Channels.SocketConnectionInitiator.ConnectAsyncResult.OnConnect(IAsyncResult result)    --- End of inner exception stack trace --- Server stack trace:    at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)    at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.End(SendAsyncResult result)    at System.ServiceModel.Channels.ServiceChannel.EndCall(String action, Object[] outs, IAsyncResult result)    at System.ServiceModel.Channels.ServiceChannelProxy.InvokeEndService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)    at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)


    We are getting the following error message on job submission when there are large number of processes submitting jobs (or jobs already on the queue).  Is there a limit to the numbers of jobs that can be queued up or submitted at one time?  This is HPC 2012R2.  The head node is a windows server 2012 machine with 32gb of memory, though the memory does not seem maxed out).  These errors seem to start coming out when we have ~150 or so jobs queued up.  When the errors start coming up, we are observing a few different behaviors:

    1) jobs get cancelled immediately after submission (within 1 minute)

    2) jobs show completed at 100% but never actually finish/return (1 task per job)

    3) jobs fail

    Dienstag, 16. Oktober 2018 19:56