locked
SOA jobs failing RRS feed

  • Question

  • We are attempting to deploy a new application in our HPC environment. We are running HPC R2 SP1. We have clustered head nodes and the cluster is deployed in topology 1. When running a certain type of SOA job we are experiencing a failure. The jobs seem to fail when between 7 to 9 tasks have been executed by each core available. So for example, if there are 200 tasks in the job and we want to run them on 16 cores, the job will fail at about 16 X 8 = 128 tasks. Its never exact and it could fail anywhere between 115 and 135 tasks in this case. The job will then restart from where it failed and run to completion because there are only 72 tasks left. If the total tasks were 300 then the job would fail at about 128 and then again at 256 and then the third attempt would take it to completion. If the cores were increased to 32 and the tasks were 200, the same job completes in the first attempt because 32 X 8 = 256 tasks. The job runs fine in the vendor's environment and it runs fine on the desktop version of the software. The vendor's environment is running on VM's and it only has a single head node.  Once we discovered this problem we limited the tests to 200 task and one compute node (16 hyper threaded cores). We have tried several things and also redeployed the broker node software and re-added it to the cluster. The SOA diagnostic jobs complete without problems.

    We enabled SOA tracing on the broker node and just before the failure we see the following errors and warnings. There are several warnings and errors but i cannot definitively associate them with the failures. I am posting some of the errors that seem to occur close to the failure below. If you need further information, please let me know. Any help and guidance with this issue is greatly appreciated.

    ----------------------------------------------------------------------------------------------------------------------------------------------------

    System

      - Provider

       [ Name]  Microsoft-HPC-Runtime
       [ Guid]  {8979efb0-97da-4729-8296-f118f3562a53}
     
       EventID 13
     
       Version 0
     
       Level 2
     
       Task 1
     
       Opcode 0
     
       Keywords 0x4000000000000001
     
      - TimeCreated

       [ SystemTime]  2012-10-03T20:57:35.425284800Z
     
       EventRecordID 5281
     
       Correlation
     
      - Execution

       [ ProcessID]  52812
       [ ThreadID]  63628
       [ ProcessorID]  0
       [ KernelTime]  0
       [ UserTime]  3
     
       Channel
     
       Computer: <Primary Head node> 
     
       Security

    - EventData

      SessionId 445
      String [Dispatcher] .ResponseReceived: Exception happens in client Service Client (Faulted) a657138b-3931-41f0-b5c3-f6b37c6edd51, net.tcp://private.c<computenode>:9105/445/14656/_defaultEndpoint, message id = urn:uuid:1f08b86e-b712-4df6-bf32-c131ece35358

    ------------------------------------------------------------------------------------------------------------------------------------------------------

     System

      - Provider

       [ Name]  Microsoft-HPC-Runtime
       [ Guid]  {8979efb0-97da-4729-8296-f118f3562a53}
     
       EventID 13
     
       Version 0
     
       Level 2
     
       Task 1
     
       Opcode 0
     
       Keywords 0x4000000000000001
     
      - TimeCreated

       [ SystemTime]  2012-10-03T20:57:35.069561200Z
     
       EventRecordID 5263
     
       Correlation
     
      - Execution

       [ ProcessID]  52812
       [ ThreadID]  62732
       [ ProcessorID]  8
       [ KernelTime]  0
       [ UserTime]  0
     
       Channel
     
       Computer <primary head node> 
     
       Security 
     

    - EventData

      SessionId 445
      String [DispatcherManager] Service instance failed! Task id = 14662, node name = <Compute node>

    -----------------------------------------------------------------------------------------------------------------------------------------------------

    - System

      - Provider

       [ Name]  Microsoft-HPC-Runtime
       [ Guid]  {8979efb0-97da-4729-8296-f118f3562a53}
     
       EventID 3019
     
       Version 0
     
       Level 3
     
       Task 7
     
       Opcode 240
     
       Keywords 0x2000000000000000
     
      - TimeCreated

       [ SystemTime]  2012-10-03T20:57:35.069449000Z
     
       EventRecordID 5255
     
       Correlation
     
      - Execution

       [ ProcessID]  52812
       [ ThreadID]  62736
       [ ProcessorID]  8
       [ KernelTime]  1
       [ UserTime]  4
     
       Channel
     
       Computer <Primary head node>

     
       Security 
     

    - EventData

      SessionId 445
      TaskId 14662
      MessageId {ADFC85C8-9127-41DC-A47D-DC5E1EA99BAA}
      Exception System.ServiceModel.EndpointNotFoundException: Could not connect to net.tcp://private.
    <ip of compute node>:9111/445/14662/_defaultEndpoint. The connection attempt lasted for a time span of 00:00:01. TCP error code 10061: No connection could be made because the target machine actively refused it <ip of compute node>:9111. ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 192.168.232.57:9111 at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult) at System.ServiceModel.Channels.SocketConnectionInitiator.ConnectAsyncResult.OnConnect(IAsyncResult result) --- End of inner exception stack trace --- Server stack trace: at System.ServiceModel.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.End(SendAsyncResult result) at System.ServiceModel.Channels.ServiceChannel.EndCall(String action, Object[] outs, IAsyncResult result) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeEndService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message) Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at Microsoft.Hpc.ServiceBroker.BackEnd.IService.EndProcessMessage(IAsyncResult ar) at Microsoft.Hpc.ServiceBroker.BackEnd.ServiceClient.EndProcessMessage(IAsyncResult ar) at Microsoft.Hpc.ServiceBroker.BackEnd.Dispatcher.ResponseReceived(IAsyncResult ar)

    Friday, October 5, 2012 5:12 AM