Thursday, 13 May 2010 8:32 AM
Sunday, 16 May 2010 8:51 PM
The client can indeed stop receiving events from the server under certain conditions. The most common probable causes are the following:
- The scheduler service on the head node has been stopped and re-started (or the headnode itself has been re-booted).
- There is an intermittent connection failure between the client and the head node.
- The headnode is under heavy load, and is unable to keep up with the large number of job/task events it needs to send out to a particular connected client, in which case it will explictly close the connection.
If the connection becomes temporarily unavailable, the client will do its best to re-connect with the head node. However, in HPC Server 2008 there is a known issue that, upon re-connection, the client may not be able to re-subscribe to events from the server. Furthermore, under scenarios 2 and 3 above, there will be no log entries either on the client side or server side that would point to any connection problems.
To circumvent this problem, I would suggest adding periodic polling logic to your progress tool, that, once every few minutes, would load the task counters associated with the job and make sure that their values are what you currently expect based on job/task events. If there is a mismatch, you can close the current scheduler connection (by disposing of the current IScheduler object), and create a new one. You would then need to re-open all your currently-used ISchedulerJob objects, and re-subscribe for all job and task events. Alternatively, if you do not need high-resolution progress information, you can switch to a periodic polling model entirely for your application, without relying on scheduler events.
Finally, please note that in HPC Server 2008 R2, we have made the re-connection logic substantially more robust, and significantly improved the reliability of the scheduler event model. We have also added a built-in mechanism for keeping track of a job's progress, which works in a manner very similar to your application. You can find out more about the current HPC Server 2008 R2 Beta at https://connect.microsoft.com/HPC/content/content.aspx?ContentID=6923&wa=wsignin1.0
- Marked As Answer by Rae WangMicrosoft Employee, Moderator Wednesday, 19 May 2010 11:21 PM