Poll for Job Status or use Job State EventHandler RRS feed

  • Question

  • Your thoughts would be highly appreciated.

    We are using the Scheduler API to run a batch of Jobs on HPC, we run the C# application which submits jobs from a separate App Server (which is different from HPC Head Node). I am trying figure out what would be a more robust design  between the following two for monitoring the status of the HPC Jobs.

    1. Submit the Job and then periodically poll for the Job Status.

    2. Use EventHandler for Job Status changes

    I am in favor of the 1st approach to use polling because its gives us more control in case of intermittent connection failures between our App Server (where we run out code for job submission) and the HPC Head Node. I can put retry logic and recover from transient failure.

    Not sure what's the underlying EventHandler implementation.  What if there is transient connection failure between HPC Head Node and AppServer? Will there be any retries done by HPC Infrastructure in communicating the event? We are running Cloud environment (not Auzre) where retries to recover from transient failures are recommended.

    Please do keep in mind we do some extra processing to the results after the Job on grid has completed successfully so its important to know when the job completes so that we can run post-processing code.



    Monday, May 4, 2015 5:24 PM

All replies

  • I've done this exact same set up and I used a combination of both events and polling.

    The Job/Task StatusChangedEvents are not 100% reliable, and I was not always receiving all status events for very quick jobs. (This was reproducible by submitting 1000 short jobs in a loop I would normally not get all Statuses for ~2 or 3). HPC does not try to resend any missed events.

    What I did was maintain a Dictionary in memory of job IDs and what I think their status is.

    If the events work correctly you can remove this job ID from the Dictionary when you get the "Completed" status change event.

    I then also polled every X seconds and requested the job state from HPC for the IDs I think haven't completed. If there are any differences between jobstate in memory vs jobstate from HPC I would raise the missing events myself.

    This would also support your requirement of loss of connectivity between HPC head and your app server.

    Thursday, May 7, 2015 2:12 PM
  • Jim's answer is correct. The event handler is not guaranteed when the scheduler is busy. And it will put scheduler in stress if you poll the scheduler very often while there is long active job queue. Thus a combination of poll (Long interval) and event handler will be a better choice.

    Qiufang Shi

    Thursday, May 14, 2015 3:40 AM