locked
Client stop getting events from server RRS feed

  • Question

  • Hi all, 

    Last week (before weekend) I tested my cluster with my usual job (usually it takes 100 hr)  - after the weekend I found out that the client tool progress bar stuck on 25 % and the job had finished on hpc cluster . our client tool progress bar depends on task / job states events - what make me believe that some how the connection between client tool and head node was lost - but I cant find the reason - nothing at client tool log , no exceptions , nothing on server  .

    I wonder what could be the reason of clients stop getting events from head server ? 

    How can my clients know head node state ( online / offline / not reachable) is there any server state event ? 

    Where can I find log of head - server states ?

    Regards,

    Shay.
    Thursday, May 13, 2010 8:32 AM

Answers

  • Hi Shay,

    The client can indeed stop receiving events from the server under certain conditions.  The most common probable causes are the following:

    1. The scheduler service on the head node has been stopped and re-started (or the headnode itself has been re-booted).
    2. There is an intermittent connection failure between the client and the head node.
    3. The headnode is under heavy load, and is unable to keep up with the large number of job/task events it needs to send out to a particular connected client, in which case it will explictly close the connection.

    If the connection becomes temporarily unavailable, the client will do its best to re-connect with the head node.  However, in HPC Server 2008 there is a known issue that, upon re-connection, the client may not be able to re-subscribe to events from the server.  Furthermore, under scenarios 2 and 3 above, there will be no log entries either on the client side or server side that would point to any connection problems.

    To circumvent this problem, I would suggest adding periodic polling logic to your progress tool, that, once every few minutes, would load the task counters associated with the job and make sure that their values are what you currently expect based on job/task events.  If there is a mismatch, you can close the current scheduler connection (by disposing of the current IScheduler object), and create a new one.  You would then need to re-open all your currently-used ISchedulerJob objects, and re-subscribe for all job and task events.  Alternatively, if you do not need high-resolution progress information, you can switch to a periodic polling model entirely for your application, without relying on scheduler events. 

    Finally, please note that in HPC Server 2008 R2, we have made the re-connection logic substantially more robust, and significantly improved the reliability of the scheduler event model.  We have also added a built-in mechanism for keeping track of a job's progress, which works in a manner very similar to your application.  You can find out more about the current HPC Server 2008 R2 Beta at https://connect.microsoft.com/HPC/content/content.aspx?ContentID=6923&wa=wsignin1.0

    Regards,
    Leonid.

    Sunday, May 16, 2010 8:51 PM