none
How to turn off task failover for WCF tasks RRS feed

  • Question

  • Hello,

    WCF tasks in HPC have a great capability of failover. Some exceptions (not all) cause failed tasks to be resubmitted on other nodes. This feature is really useful, but in my case it is not required, and, moreover, may lead to wrong job results (because of side-effects). Is there any capability to turn off task failover for a session?

    P.S. I have seen a 'fail job if one task fails' proprty in job, but not sure if it is what I need, because other it will fail other tasks, which may not be desirable.
    In theory, it will be useful to turn off task failover in two ways:
    • Turn off for session - every task will not fail over to a different node
    • Turn off for one task - in the beginning of task call some method, which will not fail over this particular task

    P.P.S. Sorry for a possibly rtm question, but I haven't found any documentation on WCF tasks failover.


    Thanks,
    Max
    • Edited by McSIMM Tuesday, May 19, 2009 4:13 PM
    Tuesday, May 19, 2009 4:08 PM

Answers

  • HPC tasks failover has no impact on WCF calls failover.

    WCF calls failover is a HPC SOA model feature handled by HPC WCF broker. If a WCF call failed on cluster (either because of HPC task failure or some intermittent network/hardware failure), the WCF call will be tried on other CN. This behavior is controlled by HPC WCF Broker. And this behavior can be configured by changing messageResendLimit.

    HPC tasks failover is a feature of HPC job/task model handled by job scheduler. In case of HPC SOA job, the task is not marked as Rerunnable. If a HPC task (in the HPC WCF service job) failed, it'll not be rerunned. However, the HPC WCF Broker will create new HPC tasks in the job so there will be always enough HPC tasks running. By changing allocationGrowLoadRatioThreshold and allocationShrinkLoadRatioThreshold, you control this behavior.

    For 3#, WCF broker will resend the calls to some other service host, either newly created or existing one. No deterministic behavior in this case.

    For 4#, WCF broker doesn't resend upon user code failure. I.e., user code exception will be presented as a fault to client.

    • Marked as answer by McSIMM Saturday, May 23, 2009 11:04 AM
    Friday, May 22, 2009 6:20 PM

All replies

  • Hi,

    First of all, these are tasks newly created by HPC WCF broker and not simple rerun. We designed the HPC SOA system so that all service host (WCF tasks) are stateless and can be run for unlimited times.

    However, there is a trick to accomplish this. Edit your HpcWcfBroker.exe.config, find the allocationGrowLoadRatioThreshold and allocationShrinkLoadRatioThreshold setting, and modify them to really large numbers (say, 2147483647 - int32.maxvalue). Note that allocationShrinkLoadRatioThreshold should be smaller than allocationGrowLoadRatioThreshold. 

    By setting those 2 values, WCF broker will stop growing the service job and in most of the case there should be no new task added.
    Tuesday, May 19, 2009 5:29 PM
  • Hi,

    Thanks for the help, I'll try this solution.

    Regarding WCF tasks, I meant a bit different thing - service calls from client to server, which are executed by service hosts. So, I just looked at the problem from a different scope.

    If I understood you correctly, there is no failover for service calls - only for service hosts. Is it true? It will be also great if you can suggest me some reading related to task failover in HPC Server. It will be really useful to understand how it works.

    Thanks,
    Max
    Wednesday, May 20, 2009 1:01 PM
  • If you are talking about WCF calls (the request-reply message), HPC SOA provides service call failover. Basically, it'll be retried for 3 times if the service host somehow failed the requests. To disable this behavior, you can update HpcWcfBroker.exe.config, find "messageResendLimit="3"", and change it to 1.

    If you are talking about HPC tasks, HPC scheduler provides a rerun-on-fail feature. This can be achieved by setting the Rerunnable property on the task.

    I'm not sure which one you are talking about when you said "tasks".
    Wednesday, May 20, 2009 10:14 PM
  • Initially, I referred to WCF calls. Thank you for the information, it is really helpful. Could you please also answer a couple of additional questions for better understanding of logic behid HPC's failover:
    1. How HPC tasks failover is related to WCF calls failover? In my current understanding, since WCF calls are implemented on top of regular HPC tasks, WCF failover possibly relies of HPC tasks failover somehow. For example, if after creating a session I will enumerate over all tasks in service job and set Rerunnable to 'false', will WCF broker try to create new tasks on failure?
    2. How the first option of setting allocationGrowLoadRatioThreshold relates to the second option of setting messageResendLimit? Are they interchangeable or they should be used in different situations?
    3. Can WCF broker resend failed service calls to existing HPC task, not a newly created one? I suppose this will differentiate two options above. If yes, then first option will not work, but second one will. In my current understanding, it can do so, and allocationGrowLoadRatioThreshold is created only to maintain required number of running service hosts.
    4. Does WCF broker resend failed service calls only on service host failure? Can it resend failed calls if service host doesn't die and if call just returned some predefined exception?
    Thanks,
    Max
    Thursday, May 21, 2009 3:02 PM
  • HPC tasks failover has no impact on WCF calls failover.

    WCF calls failover is a HPC SOA model feature handled by HPC WCF broker. If a WCF call failed on cluster (either because of HPC task failure or some intermittent network/hardware failure), the WCF call will be tried on other CN. This behavior is controlled by HPC WCF Broker. And this behavior can be configured by changing messageResendLimit.

    HPC tasks failover is a feature of HPC job/task model handled by job scheduler. In case of HPC SOA job, the task is not marked as Rerunnable. If a HPC task (in the HPC WCF service job) failed, it'll not be rerunned. However, the HPC WCF Broker will create new HPC tasks in the job so there will be always enough HPC tasks running. By changing allocationGrowLoadRatioThreshold and allocationShrinkLoadRatioThreshold, you control this behavior.

    For 3#, WCF broker will resend the calls to some other service host, either newly created or existing one. No deterministic behavior in this case.

    For 4#, WCF broker doesn't resend upon user code failure. I.e., user code exception will be presented as a fault to client.

    • Marked as answer by McSIMM Saturday, May 23, 2009 11:04 AM
    Friday, May 22, 2009 6:20 PM