none
Parenet job stuck in finishing state and child job can't be closed RRS feed

  • Question

  • Hi

    Please help a couple of questions below:
    (1) A parent job stuck in finishing state or finished and a child job (Broker for service job) is running for a long time in Active job window.  There are two job ids associated with one SOA/HPC session like an example below (one parent job and one child job). Generally when a session is closed, both jobs should be closed together. However, we observed that sometime when a parent job is in finishing state or finished, a child job is still running in the queue for hours and basically block the resource (no more WCF service process can be run in those resources). We have to manually kill/cancel the child job process (like 21234 WCF service – Broker for service job 21233). Do you know what may be the causes? Why a job stuck in finishing state for hours? Is there any way we can prevent this behavior in the SOA or HPC configuration or code level? Current we use  a default job template.

    21234 WCF service – Broker for service job 21233             (child job)  
    21233 WCF service          (parent job)

     (2)  What is the retry logic in HPC? Is it messageResendLimit setting in the loadbalancing section of HPCbroker.exe.config? Is the retry used in the same computer node or across different component nodes?

     Thanks!

    Wednesday, December 9, 2009 1:48 PM

Answers

  • Hi Xmhuang25,

    The first quesiton looks like a HPC job scheduler issue.

    It seems you are using our V2 product. Can you tell me the version number?

    If haven't done so done so, can you upgrade it to our latest V2 SP1. It may fix the scheduler issue.

    Thanks,

    Liwei
    Friday, January 8, 2010 5:36 PM

All replies

  • Hi,

    For your 2nd question, yes, messageResendLimit is used to set error retry. This counter is not used by computer node though, it' used by broker node. Everytime the broker get a timeout or other connection error for certain request, the broker will increase the retry count on the request and dispatch it to another compute node. Once the retry count exceeds messageResendLimit the request is marked as failed and client will see an exception.

    Hope this explanation helps.
    Yiding
    Thursday, December 17, 2009 12:23 AM
  • Thanks for the explanatioin! Could you look at the first question? We observed a small percentage of our WCF/SOA jobs (less than 5%) stuck in the Configuring/Finishing states. As a result, resources cannot be allocated/deallocated respectively.
    Friday, January 8, 2010 1:56 PM
  • Hi Xmhuang25,

    The first quesiton looks like a HPC job scheduler issue.

    It seems you are using our V2 product. Can you tell me the version number?

    If haven't done so done so, can you upgrade it to our latest V2 SP1. It may fix the scheduler issue.

    Thanks,

    Liwei
    Friday, January 8, 2010 5:36 PM

  • We use window hpc 2008 server version 2.1.1703.0. Which version shall we upgrade to?
    Thanks!

    Saturday, January 9, 2010 3:21 AM