none
No endpoint listening at net.tcp://<headnode>:5802/SchedulerStoreService

    Frage

  • Hi,

    I'm getting the following exception:

    System.ServiceModel.EndpointNotFoundException: There was no endpoint listening at net.tcp://<headnode>:5802/SchedulerStoreService that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details.

    The code throwing this exception (see snippet below) often executes without throwing an exception.

    MyScheduler.Connect(Cluster);
    var job = MyScheduler.OpenJob(jobId);
    job.Progress = percentageComplete;
    job.Commit();
    Has anyone seen this before? This method is being called a lot as our cluster often has many concurrently running Jobs that are having their progress properties updated. Is it possible the SchedulerStoreService cannot cope with several concurrent calls?


    Dienstag, 10. Juli 2018 10:09

Antworten

  • Hi Matt, we do the test and find there is some issues on this through the error is a little different: System.ServiceModel.FaultException`1[Microsoft.Hpc.ExceptionWrapper]: The communication object, System.ServiceModel.Dispatcher.ChannelDispatcher, cannot be used for communication because it has been Aborted. (Fault Detail is equal to Microsoft.Hpc.ExceptionWrapper).<--

    We will try to fix at out side. Meanwhile, you could switch to the .net remoting method when connect to the scheduler which should handle the concurrent call well for you to report progress. Here is my test sample code for your reference:

                    Console.WriteLine($"Job {jobId} will run {secondsToRun} seconds, now reporting pregress! ");
                    while(elapsedSeconds < secondsToRun)
                    {
                        double progress = (100 * elapsedSeconds) / secondsToRun;
                        Console.Write($"Progress {progress}%");
                        using (var scheduler = new Scheduler())
                        {
                            if (remoting)
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"), Microsoft.Hpc.Scheduler.Properties.ConnectMethod.Remoting);
                            }
                            else
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"));
                            }
                            var job = scheduler.OpenJob(jobId);
                            job.Progress = (int)progress;
                            job.Commit();
                        }
                        System.Threading.Thread.Sleep(1000 * r.Next(intervalToReport));
                        elapsedSeconds = (int)((DateTime.Now - startTime).TotalSeconds);
                    }
                    return 0;


    Qiufang Shi

    We will update this thread when the issue is fixed.


    Freitag, 20. Juli 2018 04:28

Alle Antworten

  • Hi Matt,

      SchedulerStoreService can cope with concurrent calls. From your description, this call sometime works sometime throw exception, right?


    Qiufang Shi

    Freitag, 13. Juli 2018 04:02
  • Yes - I'd say it works about 99% of the time, but it's receiving a lot of calls so even a 1% failure rate is quite a lot of failures!

    Freitag, 13. Juli 2018 07:55
  • Hi Matt,

      Could you share the version of HPC Pack you're using? And the load you have? The system currently don't have throttling design in place, thus under heavy load situation, calls may fail due to underlying SQL query/transaction failures.


    Qiufang Shi

    Montag, 16. Juli 2018 04:35
  • Hi,

    My HPC Pack is HPC Pack 2016 v5.1.6086.0.

    An example of the load on the SchedulerStoreService is about 30 concurrently running HPC Jobs, each making regular progress update method calls as shown in my original post.

    Cheers, Matt.

    Montag, 16. Juli 2018 08:36
  • Got it, will do local repro. and report back to this thread

    Qiufang Shi

    Dienstag, 17. Juli 2018 02:40
  • Hi Matt, we do the test and find there is some issues on this through the error is a little different: System.ServiceModel.FaultException`1[Microsoft.Hpc.ExceptionWrapper]: The communication object, System.ServiceModel.Dispatcher.ChannelDispatcher, cannot be used for communication because it has been Aborted. (Fault Detail is equal to Microsoft.Hpc.ExceptionWrapper).<--

    We will try to fix at out side. Meanwhile, you could switch to the .net remoting method when connect to the scheduler which should handle the concurrent call well for you to report progress. Here is my test sample code for your reference:

                    Console.WriteLine($"Job {jobId} will run {secondsToRun} seconds, now reporting pregress! ");
                    while(elapsedSeconds < secondsToRun)
                    {
                        double progress = (100 * elapsedSeconds) / secondsToRun;
                        Console.Write($"Progress {progress}%");
                        using (var scheduler = new Scheduler())
                        {
                            if (remoting)
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"), Microsoft.Hpc.Scheduler.Properties.ConnectMethod.Remoting);
                            }
                            else
                            {
                                scheduler.Connect(Environment.GetEnvironmentVariable("CCP_SCHEDULER"));
                            }
                            var job = scheduler.OpenJob(jobId);
                            job.Progress = (int)progress;
                            job.Commit();
                        }
                        System.Threading.Thread.Sleep(1000 * r.Next(intervalToReport));
                        elapsedSeconds = (int)((DateTime.Now - startTime).TotalSeconds);
                    }
                    return 0;


    Qiufang Shi

    We will update this thread when the issue is fixed.


    Freitag, 20. Juli 2018 04:28
  • That works - thank you. I will mark this as the answer.
    Donnerstag, 2. August 2018 09:55