Your thoughts would be highly appreciated.
We are using the Scheduler API to run a batch of Jobs on HPC, we run the C# application which submits jobs from a separate App Server (which is different from HPC Head Node). I am trying figure out what would be a more robust design between the following
two for monitoring the status of the HPC Jobs.
1. Submit the Job and then periodically poll for the Job Status.
2. Use EventHandler for Job Status changes
I am in favor of the 1st approach to use polling because its gives us more control in case of intermittent connection failures between our App Server (where we run out code for job submission) and the HPC Head Node. I can put retry
logic and recover from transient failure.
Not sure what's the underlying EventHandler implementation. What if there is transient connection failure between HPC Head Node and AppServer? Will there be any retries done by HPC Infrastructure in communicating the event? We are running Cloud environment
(not Auzre) where retries to recover from transient failures are recommended.
Please do keep in mind we do some extra processing to the results after the Job on grid has completed successfully so its important to know when the job completes so that we can run post-processing code.
Thanks,
Inder