locked
IScheduler.OnSchedulerReconnect Exception when submitting jobs immediately after headnode starts running. RRS feed

  • Question

  • If I begin with the HPC headnode windows services stopped, then start them and submit jobs using .Net SDK as soon as the headnode is online I sometimes get an IScheduler.OnSchedulerReconnect event:

    Code: Exception

    Exception: 

    System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
       at System.Collections.Generic.Dictionary`2.Enumerator.MoveNext()
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.ReRegisterForEvents()
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc._MonitorThread()

    I then no longer receive TaskStateChanged events.

    HPC Pack 2012 R2 Server and Client version 4.5.5079.0

    Is this due to submitting jobs too early? If so, what is an appropriate event to wait for before submitting? I currently wait until I can connect to the headnode and see that it has more than 1 computenode online.

    • Edited by TimJRoberts1 Thursday, March 2, 2017 11:08 AM I can reproduce without using AWS - just stopping/starting the HPC services on the headnode
    Thursday, March 2, 2017 10:06 AM

All replies

  • Hi Tim,

    This is a bug of HPC.

    Could you provide the following information so we can make a better fix:

    Does your client register or unregister the OnTaskState event on SchedulerJob object frequently?

    Does your client create RemoteCommand instance?

    Does your client connect to headnode over HTTP?

    Any time this happened, did you try Unregister the taskstatechanged event and reregister it?

    Thanks,
    Evan

    Friday, March 3, 2017 8:07 AM
  • Does your client register or unregister the OnTaskState event on SchedulerJob object frequently?

    Only once per job.

    Does your client create RemoteCommand instance?

    No

    Does your client connect to headnode over HTTP?

    Yes. Well, HTTPS

    Any time this happened, did you try Unregister the taskstatechanged event and reregister it?

    No but I register for the IScheduler.OnSchedulerReconnect event.

    When I receive this event with Code ConnectionEventCode.EventReconnect I query IScheduler for Job and Task State for any jobs that are still running. (i.e. any jobs/tasks that I have not received a Finished/Failed/Cancelled State for).

    I use this logic to detect any missed events that were not raised as OnTaskState OnJobState.

    Wednesday, March 8, 2017 9:42 AM
  • Hi Tim,

    The workaround to this bug should be: Unregister the taskstatechanged event and reregister it, or use non-HTTPS protocol to connect.

    Thanks,
    Evan

    Wednesday, March 8, 2017 12:57 PM
  • Unfortunately I need to use HTTPS as the headnode and client are in different domains where there is no trust relationship.

    I managed to reproduce a similar issue where no "OnSchedulerReconnect" event was raised.

    I wrote a test that submits some jobs and then restarts HpcScheduler and HpcManager windows service on the headnode a few times.

    As a workaround I now poll the headnode every minute or so, and confirm the state of jobs/tasks that I think are in progress.

    Wednesday, March 15, 2017 1:26 PM