none
error when a large number of jobs are being submitted?

    Frage

  • Could not connect to net.tcp://servername:9087/BrokerLauncher. The connection attempt lasted for a time span of 00:00:20.9942045. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.13.70.166:9087.  ---> System.ServiceModel.EndpointNotFoundException: Could not connect to net.tcp://servername:9087/BrokerLauncher. The connection attempt lasted for a time span of 00:00:20.9942045. TCP error code 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.13.70.166:9087.  ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.13.70.166:9087    at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult)    at System.ServiceModel.Channels.SocketConnectionInitiator.ConnectAsyncResult.OnConnect(IAsyncResult result)    --- End of inner exception stack trace --- Server stack trace:    at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)    at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.End(SendAsyncResult result)    at System.ServiceModel.Channels.ServiceChannel.EndCall(String action, Object[] outs, IAsyncResult result)    at System.ServiceModel.Channels.ServiceChannelProxy.InvokeEndService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)    at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)


    We are getting the following error message on job submission when there are large number of processes submitting jobs (or jobs already on the queue).  Is there a limit to the numbers of jobs that can be queued up or submitted at one time?  This is HPC 2012R2.  The head node is a windows server 2012 machine with 32gb of memory, though the memory does not seem maxed out).  These errors seem to start coming out when we have ~150 or so jobs queued up.  When the errors start coming up, we are observing a few different behaviors:

    1) jobs get cancelled immediately after submission (within 1 minute)

    2) jobs show completed at 100% but never actually finish/return (1 task per job)

    3) jobs fail

    Dienstag, 16. Oktober 2018 19:56

Alle Antworten

  • it sounds like you're talking to one of my co-workers about this issue.  Thanks

    Mittwoch, 28. November 2018 17:15
  • Hi Jason,

      There is no limit for the job queue length in the scheduler. 32GB RAM sounds good enough for cluster with 100 compute nodes. From the symptom you're describing, it is more likely your SQL get heavy stressed and caused SQL transaction timeout. You could take a check of the scheduler logs under %CCP_DATA%LogFiles\Scheduler\*.bin (LogParser.exe tool can convert the bin file to text file).

    And also, if you have many many jobs with one task, I will suggest:

    1. Use A job with multiple batch tasks (As you get high throughput for tasks)

    2. If there is many tasks (Say hundreds of) within a job, try to use parametric sweep task. As you can submit a job with millions of parametric sweep task within one second. And this will heavy ease the SQL pressure

    3. If you have millions of parametric sweep task, try to use HPC SOA job, which you can get more than 10K tasks per second throughput

    Please also tell us your SQL configuration


    Qiufang Shi

    Donnerstag, 29. November 2018 03:54
  • I didn't get that?

    Qiufang Shi

    Donnerstag, 29. November 2018 03:54