none
Submitting more than 1 task per job using the SOA architecture? RRS feed

  • Question

  • We are trying to submit jobs using the SOA architecture and are unable to figure out how to submit more than 1 task within a job at a time. We use the BrokerClient.SendRequest method with a job template marked as exclusive which creates 1 job with 1 task. We would like to create 1 job with multiple tasks where each task is exclusive.  Is this possible at all?  We can make it work with 1 task per job but as we scale up the number of tasks we want to submit, we start getting errors from the server presumably because of too many connections at a time.

    This is with 2012R2.

    Thanks!

    -Jason

    Thursday, August 27, 2015 10:23 PM

Answers

  • If all the requests/calls are from a same client to a same service, it would be much better to create one SOA session with all the requests instead of multiple SOA sessions each with only one request. As you may have noticed, creating a SOA session is more costly with regard to time and resource when compared to creating a SOA request/call. One SOA session would occupy one hpcbrokerworker.exe process on the broker node to dispatch the requests to the compute nodes and receive the responses from them.

    If there is a real need to create many SOA sessions at a time, it would be wise to add multiple broker nodes in the cluster accordingly to balance the load. The number of broker nodes would depends on the number of concurrent SOA sessions and their requests, the number of compute nodes and cores in the cluster, and the hardware spec of the broker machines.

    BR,

    Yutong Sun

    • Marked as answer by Jason Lee 1234 Wednesday, September 2, 2015 4:54 PM
    Wednesday, September 2, 2015 3:53 AM
    Moderator

All replies

  • What kind of tasks you are trying to run through BrokerClient.SendRequest?

    The advantage of HPC SOA is the low latency and high throughput of your tasks (Or requests). In SOA, you shall look at the number of requests instead of the tasks. When your SOA job asks 20 cores across 5 nodes, the SOA job will dynamically expand to 20 tasks on the allocated nodes and waiting for requests coming from client (Through broker).

    Could you share more details on the error you met? Who reported the error? What's exact error message? And I would recommend to download the SDK where there are SOA samples you can get start with.


    Qiufang Shi

    Friday, August 28, 2015 2:08 AM
  • Thanks for the quick response.  This is with starting 100 concurrent jobs.  Within Job manager, the jobs that threw these errors seems to show up as canceled with a message of "Canceled by user. Message: The client tries to terminate the session".

    In terms of tasks, we're trying to call 1 a function within our SOA app which takes minimal parameters(something like 2 ints, and 2 strings) and the task itself runs for about 5-10 minutes.  The ideal point we want to do is submit 500 tasks and have them process as resources are available until all is finished.  In my testing, we're running issues once we go above submitting 50-60 jobs at once with the errors below.  Logging onto the head node, the machine seems to get into a hung state as if its overwhelmed.  The stats on the server is 4core/16gb/windows server 2012.  We could scale up the hardware, but given its having issues with > 50 tasks, and we're looking to submit over 500, I'm guessing we're doing something incorrectly here on how we're submitting these.  

    On the server, i see a large # of HPCBrokerWorker.exe's running (guessing 1 for each job) and while the jobs were submitting the CPU was stuck at 100% utilization.  We have both the broker and head node running on the same machine (but it is not setup as a worker).

    ==

    I see a few different errors.  Will list them all here

    1)


    Unknown error occured: Exception of type 'System.OutOfMemoryException' was thrown..

       at Microsoft.Hpc.Scheduler.Session.BrokerClient`1.Flush(Int32 timeoutMilliseconds, Boolean endOfMessage)
       at Microsoft.Hpc.Scheduler.Session.BrokerClient`1.EndRequests(Int32 timeoutMilliseconds)
       at Microsoft.Hpc.Scheduler.Session.BrokerClient`1.EndRequests()

    .....(our stack trace)

    2) Additional information: Cannot get broker worker process within 1 minute. Broker node is busy. Retry later.

    Microsoft.Hpc.Scheduler.Session

    (doesnt seem to have a stacktrace for me here)

    3) 

    An exception of type 'Microsoft.Hpc.Scheduler.Session.SessionException' occurred in Microsoft.Hpc.Scheduler.Session.dll but was not handled in user code

    Additional information: Unknown error occured: The underlying connection was closed: A connection that was expected to be kept alive was closed by the server..

       at Microsoft.Hpc.Scheduler.Session.Internal.V3BrokerFactory.CreateBroker(SessionStartInfo startInfo, Int32 sessionId, DateTime targetTimeout, String[] eprs)
       at Microsoft.Hpc.Scheduler.Session.Internal.OnPremiseSessionFactory.CreateSession(SessionStartInfo startInfo, Boolean durable, Int32 timeoutMilliseconds)
       at Microsoft.Hpc.Scheduler.Session.DurableSession.CreateSession(SessionStartInfoBase startInfo)
    ... our stack trace

    4) 

    An exception of type 'Microsoft.Hpc.Scheduler.Session.SessionException' occurred in Microsoft.Hpc.Scheduler.Session.dll but was not handled in user code

    Additional information: Unknown error occured: The server did not provide a meaningful reply; this might be caused by a contract mismatch, a premature session shutdown or an internal server error..

    Microsoft.Hpc.Scheduler.Session

    (no trace)


    Friday, August 28, 2015 12:08 PM
  • Within the activity log for 1 of the canceled jobs I see:

    9:07:56 AM Created by xxx\yyy

    9:07:56 AM Submitted

    9:08:13 AM Cancel Submitted by xxx\yyy (same user as the one submitting it)

    9:08:13 AM Job Canceled

    I compared what we're doing to the EchoService example and the big thing I'm seeing different is that we are creating multiple DurableSessions (however many jobs we're trying to submit) each with a BrokerClient, and then submits 1 request whereas the correct way appears to be to create 1 DurableSession/BrokerClient, and use that to submit all of the jobs.   Does it seem correct that the system starts bogging down/throwing errors if we end up creating a large number of DurableSession/BrokerClients?  Unfortunately it's not super straight forward for us to switch our code given some auto-code gen we have to create the SOA service at runtime so wondering if there are other options, or we need to re-think how we're doing this if we're going to submit a large number of tasks at a time?


    Friday, August 28, 2015 1:14 PM
  • Could you share your client code with us?

    For question: Does it seem correct that the system starts bogging down/throwing errors if we end up creating a large number of DurableSession/BrokerClients?

    50-60 jobs are not many at all to the scheduler system.

    For question: we have to create the SOA service at runtime wondering if there are other options, or we need to re-think how we're doing this if we're going to submit a large number of tasks at a time?

    We need understand your scenario to give more appropriate suggestion. For example, you can have one generic SOA service instead of creating different one at runtime. for example you can define SOA service as:  Output ServiceCall(Input) where Input and Output are serializable objects which defined at runtime. And your serviceCall can be translated to the right call based on the Input you receive.

    And it would also be good to understand your design goal: how many requests/tasks you need submit to the cluster? How long it takes for every request/task when running on compute node? What's the throughput of requests/tasks for execution? Submitting Batch and Parametric Sweep job is much easier for client.


    Qiufang Shi

    Monday, August 31, 2015 1:11 AM
  • I'll have to check on whether or not we'll be able to send the auto-generated code that actually has the connections to HPC, but the structure is that for every task, we are creating a new DurableSession/BrokerClient and submitting the job separately, so we end up with as many DurableSession/BrokerClients are tasks (~1000 in this case, though the errors start coming out somewhere between 50 and 100 of them).  


    RE: design goals, at this time the maximum number of tasks is what we're talking about here (up to 1000).  Each jobs takes 5-10 minutes, and currently our cluster is made up of 50 compute nodes (with maybe scaling up to 100 machines).   The actual function being called is the same across all of these calls, with the input parameter list changing slightly to denote running the process for a different set of data.  

    From the sounds of it though, you're saying 50-60 concurrent job submissions shouldn't be an issue at all.  Could it be related to the time frame we are submitting the jobs?  We are trying to submit everything at once with no delays between job submissions.

    Regardless, we are going see if we are able to modify our code to submit all the jobs under 1 task, as that seems to be the correct way of doing this.


    Monday, August 31, 2015 11:59 AM
  • Job submission rate should be around 10 jobs/seconds based on your SQL configuration and job details (Jobs with complex dependencies will take more time). And if you can wrapper your submission in one job (All tasks in one job), it will be much faster. At this scale, you are okay to use batch/parametric sweep job and our task throughput can be more than 200 tasks per second dispatching rate (While using SOA, you can easily reach more than 2000 requests per seconds -- based on your configuration).

    Qiufang Shi

    Tuesday, September 1, 2015 1:37 AM
  • If all the requests/calls are from a same client to a same service, it would be much better to create one SOA session with all the requests instead of multiple SOA sessions each with only one request. As you may have noticed, creating a SOA session is more costly with regard to time and resource when compared to creating a SOA request/call. One SOA session would occupy one hpcbrokerworker.exe process on the broker node to dispatch the requests to the compute nodes and receive the responses from them.

    If there is a real need to create many SOA sessions at a time, it would be wise to add multiple broker nodes in the cluster accordingly to balance the load. The number of broker nodes would depends on the number of concurrent SOA sessions and their requests, the number of compute nodes and cores in the cluster, and the hardware spec of the broker machines.

    BR,

    Yutong Sun

    • Marked as answer by Jason Lee 1234 Wednesday, September 2, 2015 4:54 PM
    Wednesday, September 2, 2015 3:53 AM
    Moderator