none
Is it possible to find out on the client if a SOA call failed and was re-submitted? RRS feed

  • Question

  • I would like to find out on the client or the service, if during a SOA job one of the calls fails - whether because the compute node unexpectedly went offline, or an exception was thrown, or any other reason.

    • What settings determine if and how many times a failed SOA call will be retried?
    • Can I get some sort of feedback if a SOA call fails, even if it gets resent automatically by the broker to a different compute node?

    Thanks.

    Monday, August 8, 2011 1:51 PM

Answers

  • To answer your two questions:

    1. For your use case, in case there's no resource for the job to proceed, you should get job being queued. So I think you could implement an event handler to be triggered when job state changed and do the work you want. An example could be found at: http://msdn.microsoft.com/en-us/library/cc853482(v=VS.85).aspx

    2. SOA uses service tasks to host HpcServiceHost process. Service tasks is designed to be reliable so if one service task is killed/crashed, another instance of service task would be created to satisfy job's TargetResouceCount. So if you want the killed tasks not replaced by the new ones, you should first lower the count of TargetResourceCount and then kill the tasks. For instance, you want to kill one task and do not let it replace a new one, you could first:

    session.ServiceJob.TargetResourceCount = session.ServiceJob.TargetResourceCount - 1;
    session.ServiceJob.Commit();

    Another possible way is to blacklist the downed node when you gets the failed task running on, so no task would be scheduled onto that node again. To blacklist a node for the job, you could refer to http://msdn.microsoft.com/en-us/library/microsoft.hpc.scheduler.ischedulerjob.addexcludednodes(v=VS.85).aspx . After you finish the cleanup work, you could just free that node by removing it from job's blacklist.

    Hope it helps.


    • Marked as answer by krolley Thursday, September 8, 2011 8:48 AM
    Wednesday, August 24, 2011 8:13 AM

All replies

  • take a look at http://msdn.microsoft.com/en-us/library/microsoft.hpc.scheduler.session.configuration.loadbalancingconfiguration.messageresendlimit(VS.85).aspx for description of messageResendLimit. Also, the HPC SOA whitepaper has some description on it.

    There is no programming interface to get those information. You can use trace to analyze them manually. If you don't want the request to be retried, one way is to set the resend limit to 0, which will send all errors back to the brokerclient and handle the exception on client side.

    Tuesday, August 9, 2011 3:29 AM
  • Just to be clear: There is no way to be notified when a call failed and has been re-scheduled by the broker? For example, if a compute node fails, and we were half finished running a SOA service call on that node, there is no way to get a callback that the call was rescheduled? The only solution would be to set the messageResendLimit to 0 and handle an exception on the client side?
    Wednesday, August 10, 2011 9:37 AM
  • You are right. You either handle the exception (and re-send it), or let the broker handle it for you. What exactly is your scenario?
    Wednesday, August 10, 2011 10:14 AM
  • We are trying to adapt an existing distributed framework to use HPC. In the existing framework, each compute node writes data to a central DB server. In the case that any nodes throw an exception, it is caught on the client side and we do some cleanup of the DB, such as remove all the half-written data from the partitioned table.

    The biggest problem is that I need to know the task instance id (CCP_TASKINSTANCEID) of the call that failed, because I use that ID when writing data to the partitioned table (e.g. the table name is TableName_{TaskInstanceId}), and therefore need that ID from the failed task to do the cleanup.

    I will try to take down a compute node in the middle of a running job to see what exceptions come back to the client, when I set messageResendLimit to 0. Perhaps when I get the exception back I can connect to the scheduler and query the tasks in the job to find the cancelled tasks, if no better solution presents itself. If so, is it possible to get the task instance ID of a task using the scheduler API?

    Wednesday, August 10, 2011 3:31 PM
  • If you are using SP2 and WCF client proxy to send requests, you could try to use the field LastFailedServiceId to identify the task failed:

    For instance:

    client.BeginEcho("a", delegate(IAsyncResult r)
    {
              try
              {
                      ComputerInfo info = client.EndEcho(r);
              }
              catch (FaultException<RetryOperationError> e)             // This is the exception you will get by request retry count exceeds MessageResendLimit

              {
                      Console.WriteLine("Last failed service id: {0}", e.Detail.LastFailedServiceId);
              }

    }

    Wednesday, August 17, 2011 4:17 AM
  • Thanks for the reply, that was exactly what I was after.

    To simulate the error I deploy the service and create a new SOA session. When HpcServiceHost.exe loads on the compute node, I kill that process. When I catch the RetryOperationError, I do some db cleanup, and "re-schedule" it again by making another call to the SOA service. At that time the session/job has no nodes that can do the calculation, and the call is listed as calculating or incoming on the job until the receive timeout. Is there any way the broker could throw an exception when resources are no longer available for the calculation? Or could I manually check whether all tasks are cancelled on the job before "re-scheduling" the SOA call?

    I have set allocationAdjustInterval to -1 on the service config so hopefully no new nodes should be brought into the job, but when I kill the HpcServiceHost process on the computer node, a new task is created in the job for the downed node and listed as Running.

    Tuesday, August 23, 2011 4:24 PM
  • Hi krolley,

    I'm a bit confused about your reply in paragragh 2 & 3. Do you want (1) cancel all tasks and do not allocate new tasks in case the error happen and before your "reschedule"? or (2) stop task on the downed node only? or (3) you only want to be notified when resources are no longer available for the calculation?

    Wednesday, August 24, 2011 6:14 AM
  • Hi Mingqing,

    I want (3). If a node fails or goes down, I do some db cleanups and "re-schedule" the failed call by making another call to the SOA service. This works fine in the case that I have two compute nodes, and one node fails because when I make another call to the service there is a compute node that can do the calculation. But if I have two compute nodes, and both nodes fail, I get a RetryOperationError and I re-schedule the call to the SOA service. At that point, there are no nodes that can do any calculation, so it would be great to get an exception or notification of some kind when I make a call to the SOA service, but instead the client app receives nothing until the ReceiveTimeout is hit, when I get a TimeoutException.

    Also, to simulate a compute node going down, I kill the HpcServiceHost process that is started on my two compute nodes. I notice that when I do this, the corresponding tasks on the job are set to Cancelled, but two more tasks are created in their place and are set to Running. I was wondering, if these two Cancelled tasks were somehow _not_ replaced with new tasks, would the job itself be set to Cancelled? Perhaps I am missing a setting that could stop new tasks from being created to replace cancelled tasks?

    Thanks so much for your help.

    Wednesday, August 24, 2011 7:38 AM
  • To answer your two questions:

    1. For your use case, in case there's no resource for the job to proceed, you should get job being queued. So I think you could implement an event handler to be triggered when job state changed and do the work you want. An example could be found at: http://msdn.microsoft.com/en-us/library/cc853482(v=VS.85).aspx

    2. SOA uses service tasks to host HpcServiceHost process. Service tasks is designed to be reliable so if one service task is killed/crashed, another instance of service task would be created to satisfy job's TargetResouceCount. So if you want the killed tasks not replaced by the new ones, you should first lower the count of TargetResourceCount and then kill the tasks. For instance, you want to kill one task and do not let it replace a new one, you could first:

    session.ServiceJob.TargetResourceCount = session.ServiceJob.TargetResourceCount - 1;
    session.ServiceJob.Commit();

    Another possible way is to blacklist the downed node when you gets the failed task running on, so no task would be scheduled onto that node again. To blacklist a node for the job, you could refer to http://msdn.microsoft.com/en-us/library/microsoft.hpc.scheduler.ischedulerjob.addexcludednodes(v=VS.85).aspx . After you finish the cleanup work, you could just free that node by removing it from job's blacklist.

    Hope it helps.


    • Marked as answer by krolley Thursday, September 8, 2011 8:48 AM
    Wednesday, August 24, 2011 8:13 AM