none
No endpoint listening at net.tcp://<headnode>:5802/SchedulerStoreService

    Question

  • Hi,

    I'm getting the following exception:

    System.ServiceModel.EndpointNotFoundException: There was no endpoint listening at net.tcp://<headnode>:5802/SchedulerStoreService that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details.

    The code throwing this exception (see snippet below) often executes without throwing an exception.

    MyScheduler.Connect(Cluster);
    var job = MyScheduler.OpenJob(jobId);
    job.Progress = percentageComplete;
    job.Commit();
    Has anyone seen this before? This method is being called a lot as our cluster often has many concurrently running Jobs that are having their progress properties updated. Is it possible the SchedulerStoreService cannot cope with several concurrent calls?


    Tuesday, July 10, 2018 10:09 AM

Answers

  • Hi Matt, we do the test and find there is some issues on this through the error is a little different: System.ServiceModel.FaultException`1[Microsoft.Hpc.ExceptionWrapper]: The communication object, System.ServiceModel.Dispatcher.ChannelDispatcher, cannot be used for communication because it has been Aborted. (Fault Detail is equal to Microsoft.Hpc.ExceptionWrapper).<--

    We will try to fix at out side. Meanwhile, you could switch to the .net remoting method when connect to the scheduler which should handle the concurrent call well for you to report progress. Here is my test sample code for your reference:

                    Console.WriteLine($"Job {jobId} will run {secondsToRun} seconds, now reporting pregress! ");
                    while(elapsedSeconds < secondsToRun)
                    {
                        double progress = (100 * elapsedSeconds) / secondsToRun;
                        Console.Write($"Progress {progress}%");
                        using (var scheduler = new Scheduler())
                        {
                            if (remoting)
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"), Microsoft.Hpc.Scheduler.Properties.ConnectMethod.Remoting);
                            }
                            else
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"));
                            }
                            var job = scheduler.OpenJob(jobId);
                            job.Progress = (int)progress;
                            job.Commit();
                        }
                        System.Threading.Thread.Sleep(1000 * r.Next(intervalToReport));
                        elapsedSeconds = (int)((DateTime.Now - startTime).TotalSeconds);
                    }
                    return 0;


    Qiufang Shi

    We will update this thread when the issue is fixed.


    Friday, July 20, 2018 4:28 AM

All replies

  • Hi Matt,

      SchedulerStoreService can cope with concurrent calls. From your description, this call sometime works sometime throw exception, right?


    Qiufang Shi

    Friday, July 13, 2018 4:02 AM
  • Yes - I'd say it works about 99% of the time, but it's receiving a lot of calls so even a 1% failure rate is quite a lot of failures!

    Friday, July 13, 2018 7:55 AM
  • Hi Matt,

      Could you share the version of HPC Pack you're using? And the load you have? The system currently don't have throttling design in place, thus under heavy load situation, calls may fail due to underlying SQL query/transaction failures.


    Qiufang Shi

    Monday, July 16, 2018 4:35 AM
  • Hi,

    My HPC Pack is HPC Pack 2016 v5.1.6086.0.

    An example of the load on the SchedulerStoreService is about 30 concurrently running HPC Jobs, each making regular progress update method calls as shown in my original post.

    Cheers, Matt.

    Monday, July 16, 2018 8:36 AM
  • Got it, will do local repro. and report back to this thread

    Qiufang Shi

    Tuesday, July 17, 2018 2:40 AM
  • Hi Matt, we do the test and find there is some issues on this through the error is a little different: System.ServiceModel.FaultException`1[Microsoft.Hpc.ExceptionWrapper]: The communication object, System.ServiceModel.Dispatcher.ChannelDispatcher, cannot be used for communication because it has been Aborted. (Fault Detail is equal to Microsoft.Hpc.ExceptionWrapper).<--

    We will try to fix at out side. Meanwhile, you could switch to the .net remoting method when connect to the scheduler which should handle the concurrent call well for you to report progress. Here is my test sample code for your reference:

                    Console.WriteLine($"Job {jobId} will run {secondsToRun} seconds, now reporting pregress! ");
                    while(elapsedSeconds < secondsToRun)
                    {
                        double progress = (100 * elapsedSeconds) / secondsToRun;
                        Console.Write($"Progress {progress}%");
                        using (var scheduler = new Scheduler())
                        {
                            if (remoting)
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"), Microsoft.Hpc.Scheduler.Properties.ConnectMethod.Remoting);
                            }
                            else
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"));
                            }
                            var job = scheduler.OpenJob(jobId);
                            job.Progress = (int)progress;
                            job.Commit();
                        }
                        System.Threading.Thread.Sleep(1000 * r.Next(intervalToReport));
                        elapsedSeconds = (int)((DateTime.Now - startTime).TotalSeconds);
                    }
                    return 0;


    Qiufang Shi

    We will update this thread when the issue is fixed.


    Friday, July 20, 2018 4:28 AM
  • That works - thank you. I will mark this as the answer.
    Thursday, August 2, 2018 9:55 AM