locked
Azure Nodes not using all their Cores but On-Premise is (SOA) RRS feed

  • Question

  • Hi,

    I have a cluster that is partly on-premise and partly in the cloud with azure workers but are seeing a problem with how the azure workers are performing.

    Lets say for simplicity we have 1 on-premise node with 8 cores and 1 azure node with 8 cores and submit a SOA job with 16 requests. The behaviour we are seeing is that all 16 requests are sent and start calculating and on the on-premise node we see 8 processes are started which leads to 100% cpu utilization.

    However on the azure nodes only 2 processes are started and the rest are somehow queued, even though job manager reports them as being calculated. Once they are done two more processes are started etc until all 8 are completed on the azure node. This leads to slower total execution time as the on-premise node is finished and idling before the azure node has completed its load.

    Has anyone experienced anything like this?

    Thursday, September 10, 2015 8:12 AM

Answers

  • Using ConcurrencyMode=ConcurrencyMode.Multiple in the service code is a correct solution for the scenario when Azure nodes and SessionUnitType.Node are used in the SOA session.

    Using SessionUnitType.Node means there would be only one service host (HpcServiceHost.exe) running on the node. For On Premise nodes, the broker would create multiple client sessions to the service host according to the value of maxConcurrentCalls in the service registration file. If the maxConcurrentCalls is set to 0 (default value), the number of client sessions could be the number of cores on the compute node. In this situation, though the service code is with the default ConcurrencyMode as Single, there are multiple service instances (one per client session) to process the requests in parallel.

    For Azure nodes, the broker node is actually communicating to the Proxy nodes (normally there are two proxy nodes for a deployment), and the Proxy nodes would route the requests to the Worker nodes. Each Proxy node would have one service client connecting to the service host on the Worker node, so we need to configure the ConcurrencyMode as Multiple to allow more than 2 requests to process at the same time.

    Note at the host service host side, we also applied the maxConcurrentCalls (if 0, then the number of cores) in the ServiceThrottlingBehavior.

    BR,

    Yutong Sun

    Monday, September 21, 2015 2:59 PM

All replies

  • Hi Erik,

    Please check the allocationAdjustInterval setting in the SOA service registration file and see if its value is set to a small value e.g. 3,000 (ms). A small value means the broker node would shrink the idle cores/service hosts aggressively sometimes even when the service hosts haven't yet received any requests to process, especially for the Azure nodes due to network latency. You may want to set this value to a large value e.g. 30,000 (ms) for the hybrid cluster. Besides, if you want to achieve the result to dispatch 16 requests to 16 cores/service hosts for one request each, make sure another setting serviceRequestPrefetchCount in the SOA service registration file is set to 0 and the processing time for the reqeusts is long enough, so that no 2 requests would be dispatched to one service host.

    For details about the settings in SOA service registration file, please check this TechNet article.

    BR,

    Yutong Sun

    Friday, September 11, 2015 10:10 AM
  • Hi Yutong,

    Thank you for your reply!

    I checked my configuration values and allocationAdjustInterval is set to 30,000 already and serviceRequestPrefetchCount is set to 0. 

    To verify that my SOA service wasn't to blame i tried the sample SOA service code from http://blogs.technet.com/b/windowshpc/archive/2013/03/14/soa-based-application-tutorial-i-write-your-first-soa-service-and-client-hpc-pack-.aspx and modified it do use SessionUnitType=Node as well as let the worker service on the nodes do a dummy calculation for 30 seconds before returning. The reason for setting SessionUnitType=Node is that we are using shared file input which doesn't work with SessionUnitType=Core.

    The behavior i saw was that all 8 requests to my azure nodes were being processed 2 at a time, yet sent immediately (8 set to calculating), but on the on-premise worker all 8 were calulating at the same time.

    Best regards,

    Erik


    Friday, September 11, 2015 10:19 AM
  • Another question i have, will my azure nodes be using the service configuration from ServiceRegistration, or do the use the configuration from the service package uploaded to my Azure Storage Account?
    Friday, September 11, 2015 10:30 AM
  • I could not repro this behavior with the build-in CcpEchoSvc, all the 8 requests dispatched in order to the 8 service hosts though there are some latency due to the nettcp connection from on-premise client to the Azure proxy nodes. Could you enable and check the message level traces for the SOA service or print the service side log to see when the SOA requests are received at the service side on the Azure node? There should be no broker side configuration to limit the 2 requests each time for Azure nodes.

    C:\tests\soaecho>EchoSvcClient.exe HPC-4617537 8 30 1
    Creating a session for EchoService...done session id = 122580
    Sending 8 requests...

    [5:47:08 AM] : 0
    [5:47:08 AM] : 1
    [5:47:08 AM] : 2
    [5:47:08 AM] : 3
    [5:47:08 AM] : 4
    [5:47:08 AM] : 5
    [5:47:08 AM] : 6
    [5:47:08 AM] : 7
    Sent 8 requests in 0.3069725 seconds.
    Retrieving responses...
    [5:47:45 AM] : response received start: 5:47:09 AM - end: 5:47:39 AM
    [5:47:45 AM] : response received start: 5:47:11 AM - end: 5:47:41 AM
    [5:47:45 AM] : response received start: 5:47:11 AM - end: 5:47:41 AM
    [5:47:45 AM] : response received start: 5:47:11 AM - end: 5:47:41 AM
    [5:47:45 AM] : response received start: 5:47:12 AM - end: 5:47:42 AM
    [5:47:45 AM] : response received start: 5:47:13 AM - end: 5:47:43 AM
    [5:47:45 AM] : response received start: 5:47:13 AM - end: 5:47:43 AM
    [5:47:45 AM] : response received start: 5:47:15 AM - end: 5:47:45 AM
    Received 8 responses in 37.3903631 seconds.

    BR,

    Yutong Sun

    Monday, September 14, 2015 1:07 PM
  • The service configuration uploaded in the service package would be used for the service hosts, however some of the broker side configurations e.g. allocationAdjustInterval and serviceRequestPrefetchCount would be retrieved only from the central service registration folder by the broker.

    BR,

    Yutong Sun

    Monday, September 14, 2015 1:11 PM
  • I'm not sure what exactly you mean but i did find this repeatedly in my hpcworker.bin log files:

    09/15/2015 08:45:01.455 v AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].FindNextFile: Find the next file, index=0  
    09/15/2015 08:45:01.455 v AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].UploadCurrentFile: Try to open the file C:\SoaTraceRepository\session8558\DumpOnly.0.dat  
    09/15/2015 08:45:01.455 v AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].UploadCurrentFile: seek position=0  
    09/15/2015 08:45:01.455 v AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].UploadCurrentFile: Upload block to the Azure blob, size=3722  
    09/15/2015 08:45:01.455 i AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventBlobTransferer].GetBlockList: Enter GetBlockList method.  
    09/15/2015 08:45:01.470 v AzureSoaDiagMon 3484 1896 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].Upload: Try to find next file for uploading.  
    09/15/2015 08:45:01.470 e AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventBlobTransferer].WriteTraces: Error occurs, System.NullReferenceException: Object reference not set to an instance of an object...   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.TraceEventBlobTransferer.GetBlockList(CloudBlockBlob blob)..   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.TraceEventBlobTransferer.WriteTraces(MemoryStream buffer)  
    09/15/2015 08:45:01.486 e AzureSoaDiagMon 3484 3656 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].Upload: Exception occurs, reader index=0, writer index=0, retry=0, System.NullReferenceException: Object reference not set to an instance of an object...   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.TraceEventBlobTransferer.GetBlockList(CloudBlockBlob blob)..   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.TraceEventBlobTransferer.WriteTraces(MemoryStream buffer)..   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.TraceEventRecyclableFileUploader.UploadCurrentFile()..   at Microsoft.Hpc.Scheduler.Session.Internal.Diagnostics.TraceEventRecyclableFileUploader.Upload()  
    09/15/2015 08:45:01.486 v AzureSoaDiagMon 3484 1896 [AzureSoaDiagMon] NodeName=AZURECN-0087 PhysicalIP=10.146.124.46 ..[TraceEventRecyclableFileUploader].FindNextFile: Find the next file, index=0  

    After just a few minutes those logfiles are huge

    Tuesday, September 15, 2015 9:03 AM
  • When enabling the message level tracing for running SOA on Azure nodes, you may need to configure the Azure storage connection string before deploying the Azure nodes. Please see the note below from here,

    If you will be running the SOA service on Windows Azure nodes and you want to collect trace logs from those nodes, you must also specify Windows Azure storage for the trace logs. To do this, in Configuration, in the Deployment To-do List, under Optional Deployment Tasks, click Set Windows Azure Storage Connection String. For configuration details, see Configuring Connection Strings.

    You may check the 'Microsoft.Hpc.Azure.StorageConnectionString' setting in the service deployment configurations. If it is not set, the AzureSoaDiagMon on Azure nodes would fail to upload trace files because it doesn't have the storage account info.

    BR,

    Yutong Sun

    Wednesday, September 16, 2015 5:51 AM
  • Hi,

    I have set a connection string in the way you describe;

    PS C:\Program Files\Microsoft HPC Pack 2012\Bin> Get-HpcClusterProperty -Parameter -Name AzureStorageConnectionString

    Name                                     Value
    ----                                     -----
    AzureStorageConnectionString             DefaultEndpointsProtocol=https;Accoun...


    Wednesday, September 16, 2015 8:12 AM
  • Could you double check the 'Microsoft.Hpc.Azure.StorageConnectionString' setting is correctly set in the service deployment configurations via Azure management portal? If not, you may need to stop all the Azure nodes and start them again for a new deployment.

    Besides the message level tracing, you may also add traces in the service code, e.g. write console output, to see when the requests are received at the service hosts.

    BR,

    Yutong Sun

    Friday, September 18, 2015 12:54 PM
  • I added some console logging to my sample soa service:

    public class CalculatorService : ICalculator
        {
            public double Add(double a, double b)
            {
                var requestId = Guid.NewGuid();
                Console.WriteLine("Received request ({0}) {1:s}", requestId, DateTime.UtcNow);
                var start = DateTime.UtcNow;
                while (DateTime.UtcNow - start < TimeSpan.FromSeconds(10))
                {
                    var c = 1 + 1;
                }
                Console.WriteLine("Finished request ({0}) {1:s}", requestId, DateTime.UtcNow);
                return a + b;
            }
        }

    The result when ran on an A10 instance with 8 cores:
    Received request (ffa808b8-8569-4047-aa55-494affda96f2) 2015-09-18T14:23:26
    Received request (21b2b7fe-2d78-4094-a532-d76835157d78) 2015-09-18T14:23:26
    Finished request (ffa808b8-8569-4047-aa55-494affda96f2) 2015-09-18T14:23:36
    Finished request (21b2b7fe-2d78-4094-a532-d76835157d78) 2015-09-18T14:23:36
    Received request (8bb6e783-2ad0-4b4c-b555-2b78208c498e) 2015-09-18T14:23:36
    Received request (8e692fcc-02dd-4cd5-a59d-a743947dc87d) 2015-09-18T14:23:36
    Finished request (8e692fcc-02dd-4cd5-a59d-a743947dc87d) 2015-09-18T14:23:46
    Finished request (8bb6e783-2ad0-4b4c-b555-2b78208c498e) 2015-09-18T14:23:46
    Received request (50898aa4-22b8-43b7-9c9d-0ddf9eb3f45a) 2015-09-18T14:23:46
    Received request (2c833d70-1a44-42f9-8f2e-39f0f58a6c30) 2015-09-18T14:23:46
    Finished request (2c833d70-1a44-42f9-8f2e-39f0f58a6c30) 2015-09-18T14:23:56
    Finished request (50898aa4-22b8-43b7-9c9d-0ddf9eb3f45a) 2015-09-18T14:23:56
    Received request (fb3bc06c-9ecb-4431-b933-770c679878ba) 2015-09-18T14:23:56
    Received request (cec89725-a4ca-47a1-8b0e-38dbec76be8f) 2015-09-18T14:23:56
    Finished request (fb3bc06c-9ecb-4431-b933-770c679878ba) 2015-09-18T14:24:06
    Finished request (cec89725-a4ca-47a1-8b0e-38dbec76be8f) 2015-09-18T14:24:06

    When i run it on my 8 core on-premise machine i get this result:

    Received request (3bc63fa2-9934-4253-80c8-13d646902c2a) 2015-09-18T14:26:49
    Received request (1d4edc1d-8cf7-4700-ad64-c48dfe73d15f) 2015-09-18T14:26:49
    Received request (5a50db43-12b3-45ca-bcaf-557bba4ea481) 2015-09-18T14:26:49
    Received request (14c74994-4bf0-46d1-942a-237e16b73d67) 2015-09-18T14:26:49
    Received request (61271062-28ac-44db-8c18-f23e4df67b7e) 2015-09-18T14:26:49
    Received request (498fcda4-f07c-4af5-a35f-53b4ff56a32b) 2015-09-18T14:26:49
    Received request (51c114aa-da43-4260-8f45-8617480ca815) 2015-09-18T14:26:49
    Received request (34ba40dd-e4b1-4fa3-a779-fd673d17638c) 2015-09-18T14:26:49
    Finished request (61271062-28ac-44db-8c18-f23e4df67b7e) 2015-09-18T14:26:59
    Finished request (1d4edc1d-8cf7-4700-ad64-c48dfe73d15f) 2015-09-18T14:26:59
    Finished request (5a50db43-12b3-45ca-bcaf-557bba4ea481) 2015-09-18T14:26:59
    Finished request (14c74994-4bf0-46d1-942a-237e16b73d67) 2015-09-18T14:26:59
    Finished request (3bc63fa2-9934-4253-80c8-13d646902c2a) 2015-09-18T14:26:59
    Finished request (34ba40dd-e4b1-4fa3-a779-fd673d17638c) 2015-09-18T14:26:59
    Finished request (51c114aa-da43-4260-8f45-8617480ca815) 2015-09-18T14:26:59
    Finished request (498fcda4-f07c-4af5-a35f-53b4ff56a32b) 2015-09-18T14:26:59

    Notice how on the azure node it starts two requests at a time but on the on-premise node it starts them all, looking at the job details in HPC Cluster Manager both jobs say that they are calculating all 8 requests immediately.

    My SOA client looks like this:

    class SoaClient
        {
            static void Main(string[] args)
            {
                //change the head node name here
                var info = new SessionStartInfo("<nodename>", "CalculatorService");
                info.NodeGroupList.Add("<nodegroup>");
                info.SessionResourceUnitType = SessionUnitType.Node;
    
                //create an interactive session 
                using (var session = Session.CreateSession(info))
                {
                    Console.WriteLine("Session {0} has been created", session.Id);
    
                    //create a broker client
                    using (var client = new BrokerClient<ICalculator>(session))
                    {
                        //send request
                        AddRequest request = new AddRequest(1, 2);
                        for (var i = 0; i < 8;i++)
                            client.SendRequest<AddRequest>(request);
                        client.EndRequests();
    
                        //get response
                        foreach (var response in client.GetResponses<AddResponse>())
                        {
                            double result = response.Result.AddResult;
                            Console.WriteLine("Add 1 and 2, and we get {0}", result);
                        }
    
                        //This can be omitted if a BrokerClient object
                        //is created in a "using" clause.
                        client.Close();
                    }
    
                    //This should be explicitly invoked
                    session.Close();
                }
    
                Console.WriteLine("Done invoking SOA service");
    
                Console.WriteLine("Press any key to exit");
                Console.ReadKey();
            }
        }
    My service registration for the demo service looks like this:
    <?xml version="1.0" encoding="utf-8" ?>
    <configuration>
      <configSections>
        <sectionGroup name="microsoft.Hpc.Session.ServiceRegistration"
                      type="Microsoft.Hpc.Scheduler.Session.Configuration.ServiceRegistration, Microsoft.Hpc.Scheduler.Session, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35">
          <section name="service"
                   type="Microsoft.Hpc.Scheduler.Session.Configuration.ServiceConfiguration, Microsoft.Hpc.Scheduler.Session, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"
                   allowDefinition="Everywhere"
                   allowExeDefinition="MachineToApplication"
    	               />
        </sectionGroup>
      </configSections>
      <microsoft.Hpc.Session.ServiceRegistration>
        <!--Change assembly path below-->
        <service assembly="\\<headnode>\CalculatorService\CalculatorService.dll">
        </service>
      </microsoft.Hpc.Session.ServiceRegistration>
    </configuration>


    As for my connectionstring, looking in Azure Management i see the connection string looks like this:
    Microsoft.Hpc.Azure.StorageConnectionString: DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>

    With values that match what i get from the storage account keys management.

    Update: I tested setting 

    [ServiceBehavior(ConcurrencyMode=ConcurrencyMode.Multiple)]
    for my SOA service and that seems to have solved the problem, however i am still very unsure what exactly the difference is between the azure nodes and on-premise nodes since this was never needed for the on-premise nodes.

    Best regards



    Friday, September 18, 2015 2:34 PM
  • Using ConcurrencyMode=ConcurrencyMode.Multiple in the service code is a correct solution for the scenario when Azure nodes and SessionUnitType.Node are used in the SOA session.

    Using SessionUnitType.Node means there would be only one service host (HpcServiceHost.exe) running on the node. For On Premise nodes, the broker would create multiple client sessions to the service host according to the value of maxConcurrentCalls in the service registration file. If the maxConcurrentCalls is set to 0 (default value), the number of client sessions could be the number of cores on the compute node. In this situation, though the service code is with the default ConcurrencyMode as Single, there are multiple service instances (one per client session) to process the requests in parallel.

    For Azure nodes, the broker node is actually communicating to the Proxy nodes (normally there are two proxy nodes for a deployment), and the Proxy nodes would route the requests to the Worker nodes. Each Proxy node would have one service client connecting to the service host on the Worker node, so we need to configure the ConcurrencyMode as Multiple to allow more than 2 requests to process at the same time.

    Note at the host service host side, we also applied the maxConcurrentCalls (if 0, then the number of cores) in the ServiceThrottlingBehavior.

    BR,

    Yutong Sun

    Monday, September 21, 2015 2:59 PM