none
Getting "The exception is The server did not provide a meaningful reply; this might be caused by a contract mismatch" on long running task within a job?

    Frage

  • Hi,

    I was hoping to get some information on what this error means.  We have a long running job which has 10-20 jobs which can take multiple days.  Randomly, I am seeing the following being returned on our client machine that kicked off the job:

    Request failed 1 time(s) and is marked as failed by the broker. The exception is The server did not provide a meaningful reply; this might be caused by a contract mismatch, a premature session shutdown or an internal server error.. and stack trace: at Microsoft.Hpc.Scheduler.Session.BrokerResponse`1.get_Result

    The other tasks are running fine on their individual machines.   I took a look at our application log, and the application seems to have crashed/exited non-gracefully.  Within the app code, there is a parent try/catch around the entire operation where it appears to be running, but it does not get picked up.   It appears the app is crashing in some way. 

    Can you provide more information on when the above message would get returned?  I have noticed this happening a few times across the jobs I've kicked off, and in the 2 times i looked at it appears to happen in the same spot, so seems plausible its related to the code, though how it doesnt trigger my try/catch I dont understand.  The code runs fine thousands of other things without error so if it is a code issue, its transient in nature.

    Thanks!

    Montag, 22. Januar 2018 22:45

Alle Antworten

  • i went on the worker node that had the issue and saw the following in eventviewer.  Any ideas?

    1)

    Windows cannot access the file  for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program HpcServiceHost because of this error.

    Program: HpcServiceHost

    File:

    The error value is listed in the Additional Data section.

    User Action

    1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.

    2. If the file still cannot be accessed and

                    - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.

                    - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.

    3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.

    4. If the problem persists, restore the file from a backup copy.

    5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.

    Additional Data

    Error value: C00000C4

    Disk type: 0

    2)

    Faulting application name: HpcServiceHost.exe, version: 4.5.5079.0, time stamp: 0x5641e146

    Faulting module name: ntdll.dll, version: 6.3.9600.18821, time stamp: 0x59ba86db

    Exception code: 0xc0000006

    Fault offset: 0x0000000000055ce8

    Faulting process id: 0x2234

    Faulting application start time: 0x01d3944d0fd3aab0

    Faulting application path: C:\Program Files\Microsoft HPC Pack 2012\Bin\HpcServiceHost.exe

    Faulting module path: C:\Windows\SYSTEM32\ntdll.dll

    Report Id: 11077de0-007c-11e8-81b3-ca4d2b2c0ab7

    Faulting package full name:

    Faulting package-relative application ID:

    Dienstag, 23. Januar 2018 21:40
  • Anyone have thoughts on this?  Thanks.
    Mittwoch, 14. März 2018 16:42
  • Hi Jason,

    This two errors are kind of general. We need more information to identify what happened in your cluster.

    First let me confirm, from the error message you post, you are using HPC Pack 2012 R2 Update 3(4.5.5079), right?

    Do you still have a live repro now? If yes, can you help us to collect logs with include the information of reproduced jobs?
    From broker node you can collect brokerworker log at %CCP_LOGROOT_SYS%SOA\HpcBrokerWorker_*.bin, and from compute node in question you can collect service host log at %CCP_LOGROOT_USR%SOA\HpcServiceHost\%CCP_JOBID%\%CCP_TASKINSTANCEID%\Host_*.bin files.

    You can send related logs to hpcpack@microsoft.com. Please also tell us the job/task id in question.

    Thanks,
    Zihao

    Freitag, 16. März 2018 06:19
  • thanks, I sent an email. I mentioned it in the email, but i was unable to find the directory on the compute node.

    And to confirm, yes we are on HPC pack 2012 R2 Update 3.  Thanks!

    -jason

    Montag, 19. März 2018 15:03
  • Hi Jason,

    Thank you for your reply.

    I checked the log files you sent, there is no job 421 in them. Please note that HPC logs with largest index are empty, so send the latest a few logs usually is not helpful, and we will need more log files to do diagnoses. If you want to make sure log files you are going to send contain the information we need, you can install logviewer, and check the content.

    For %CCP_LOGROOT_USR%, it should be set to %LOCALAPPDATA%\Microsoft\Hpc\LogFiles\. Please find the service host log files in the job owner account's %LOCALAPPDATA% folder, who starts the soa session.

    Thanks,
    Zihao

    Dienstag, 20. März 2018 01:35
  • thanks, I sent an email which I believe contains the right HpcBrokerWorker file after looking at the full set in logviewer.

    I am still unable to find the service host log file (I am looking on the worker machine).  I sent the screenshot where I don't see the Hpc/ directory at all.  Is there some logging flag which needs to be enabled first?

    Thanks!

    Dienstag, 20. März 2018 13:26
  • Hi Jason,

    Yes, you need to enable service host log in your service registration file. You can take %CCP_HOME%ServiceRegistration\Ccp_EchoSvc.config as a sample and check its <system.diagnostics> section.

    From the HpcBrokerWorker log you sent, we found some similar log lines at 01:08:55 and 01:47:57

    Id Time Tid Tag Text Pid File Level
    37845 2018-03-18T01:08:55 11548
    [Session:421] [Dispatcher] .ResponseReceived: TaskId=11534, ClientIndex=0, MessageId=edbe16e4-d5bc-4cf6-bcfa-a2b68e70e1d4, Enter method. Client=Service Client (Opened) bb6bc738-bf74-4f5a-97c8-e354a21b4279, net.tcp://a-hpwork203v0t:9100/421/11534/_defaultEndpoint 26176 HpcBrokerWorker_AF_000011.bin Detail
    37846 2018-03-18T01:08:55 11548
    [Session:421] [Dispatcher] .HandleException: TaskId=11534, ClientIndex=0, MessageId=edbe16e4-d5bc-4cf6-bcfa-a2b68e70e1d4, Exception happens. Client=Service Client (Opened) bb6bc738-bf74-4f5a-97c8-e354a21b4279, net.tcp://a-hpwork203v0t:9100/421/11534/_defaultEndpoint Exception=System.ServiceModel.CommunicationException: The server did not provide a meaningful reply; this might be caused by a contract mismatch, a premature session shutdown or an internal server error. Server stack trace: at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.End(SendAsyncResult result) at System.ServiceModel.Channels.ServiceChannel.EndCall(String action, Object[] outs, IAsyncResult result) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeEndService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message) Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at Microsoft.Hpc.ServiceBroker.BackEnd.IService.EndProcessMessage(IAsyncResult ar) at Microsoft.Hpc.ServiceBroker.BackEnd.ResponseReceiver.ReceiveResponse(DispatchData data) 26176 HpcBrokerWorker_AF_000011.bin Error
    37847 2018-03-18T01:08:55 11548
    [Session:421] [Dispatcher] .HandleMessageLevelPreemption: TaskId=11534, ClientIndex=0, MessageId=edbe16e4-d5bc-4cf6-bcfa-a2b68e70e1d4, (Preemption) Attempt to get the SchedulerAdapterClient to retrieve the task error code. Client=Service Client (Opened) bb6bc738-bf74-4f5a-97c8-e354a21b4279, net.tcp://a-hpwork203v0t:9100/421/11534/_defaultEndpoint 26176 HpcBrokerWorker_AF_000011.bin Detail


    Is one of the machines (a-hpwork203v0t, a-hpwork209v0, a-hpwork211v0t, a-hpwork207v0t, a-hpwork202v0t, a-hpwork204v0t, a-hpwork205v0t, a-hpwork210v0t, a-hpwork208v0t, a-hpwork213v0t, a-hpwork206v0t) the compute node in question? If yes, service host on it was shut down because of preemption. There was another job requesting more resource, and hpc scheduler decided to take those resources from job 421, thus the service host is exited.

    If you want to let HPC takes resources away only when tasks complete, you can select graceful preemption in Job Scheduler Configuration.

    Thanks,
    Zihao





    Mittwoch, 21. März 2018 01:45
  • Thanks, I will turn on the settings and provide the client logs whenever the error re-occurs.

    RE: the pre-emption, this doesn't particularly make sense for this specific case for a few reasons:

    1) we only had 2 jobs running, with 11 tasks each and out fleet has 50 machines.  No additional jobs were started or running at this time.

    2) for the 2 jobs that were running at this time, all the tasks are running at the same time since we have enough machines, so there should actually be no remaining tasks that needed to be assigned/run (the tasks themselves are very long running, from a few hours to multiple days).

    3) we dont set the priority on the jobs / I'd expect they all have the default/same priority so there shouldnt be any jobs with more priority.

    Our client only logged 1 error, but its possible there were more.  I was looking at our error handling (below).  Will it detect/report any errors on any of the tasks?  or because its a foreach, it will only start reporting once the first task/batch has returend an error (i.e. if batch #2 fails, it will only report after there is a response from batch 0 and 1, whether success or failure).  That said, I took a look at our cilent logs, and all tasks were progressing without issue until 3/17/2018 9:09:14PM when we received the error response, and the rest of the tasks I see progressing until I killed the job at 9:48pm EST to restart it.

    ==

                            foreach (var response in client.GetResponses<RunPCRunOverRunResponse>())
                            {
                                try
                                {
                                    if (response.Result.RunPCRunOverRunResult != 0) //this means failure.
                                    {
                                        failures.Add(signalSetGroupings.ElementAt(response.GetUserData<int>()));
                                    }
                                }
                                catch (Exception ex) //also catch any exceptions and write to the console.
                                {

                                    var job = _scheduler.OpenJob(session.Id);
                                    if (job.State == JobState.Canceled)
                                    {
                                        statusThread.Abort();
                                        Console.WriteLine("Canceling sim run {0}.  The job has been detected as canceled.",
                                            _simRunId);
                                        throw new Exception(
                                            String.Format("Canceling sim run {0}.  The job has been detected as canceled.",
                                                _simRunId), ex);
                                    }
                                    var failedBatchId = response.GetUserData<int>();
                                    Console.WriteLine(
                                        "Error with batch id : {0} with error: {1} and stack trace: {2}", failedBatchId,
                                        ex.Message, ex.StackTrace);

                                    var remainingBatchSignalSets = SimHelper.GetRemainingBatchSignalSets(_simRunId, _runArgs);
                                    var signalSet = remainingBatchSignalSets.Single(c => c.Item1 == failedBatchId);
                                }
                            }



    Mittwoch, 21. März 2018 11:34
  • Hi Jason,

    I suppose we don't ensure the order returning responses, so message in batch #2 will not necessarily get returned after batch #1. But yes, as it is a foreach, you will need to receive some messages before you reach the failing request.

    If the premature exiting of service host was not caused by preemption, we'll need more logs to find out the reason. As there is no service host log for job 421, can you help to collect scheduler log (which is at %CCP_LOGROOT_SYS%Scheduler\HpcScheduler_*.bin on headnode), and session launcher log (which is at %CCP_LOGROOT_SYS%SOA\HpcSession_*.bin on headnode) instead?

    Thanks,
    Zihao

    Donnerstag, 22. März 2018 04:17
  • I added the diagnostics config to our service registration file and will send those logs over the next time we run into this error.  It happens somewhat infrequently, but often enough I'm confident it will happen at some point. 

    Thanks for the help

    Donnerstag, 22. März 2018 13:48
  • Hi Zihao,

    The issue occurred last night (actually twice on 2 different jobs about 5 hours apart on different machines).  I've sent over the relevant logs (including the worker log), but I couldn't find anything useful in there unfortunately.  The log just terminated with no error message.

    -Jason

    Mittwoch, 28. März 2018 13:01
  • Hi Jason,

    Thanks for reaching us. The Service Host logs are helpful.

    The way service host log terminating is interesting. If a service host exited gracefully, it will write down following two lines, just like in log 423\26\Host_00000.bin:

    Id Time Tid Tag Text ActId Pid File Level
    33 2018-03-28T04:24:00 8608 Current Application Domain ProcessExit event invoked 00000000-0000-0000-0000-000000000000 4412 Host_000000.bin Verbose
    34 2018-03-28T04:24:00 8608 Cosmos Logger is being closed 00000000-0000-0000-0000-000000000000 4412 Host_000000.bin Verbose

    From BrokerWoker log at 4:23GMT we can see the failing task is task 11827, which corresponding to the log 423\7\Host_000000.bin. And it terminated without printing above lines. This indicate the service host is terminated unexpectedly (thus it is likely not caused by preemption).

    To further investigate this, we'll have to collect more information:

    1. Is there any error in Windows Event log about process 5320 on node a-hpwork208v0t around 12:23 EST?
    2. Does your SOA service write down logs? Is there anything special at that time?
    3. From HPC side, scheduler log (which is at %CCP_LOGROOT_SYS%Scheduler\HpcScheduler_*.bin on headnode), and session launcher log (which is at %CCP_LOGROOT_SYS%SOA\HpcSession_*.bin on headnode) may also contain information about the failing task (Job:423 Task:11827 NiceId:7). Could you collect them and send to us by email?

    Thanks,
    Zihao




    Donnerstag, 29. März 2018 02:29
  • Hi,

    1) there were 2 event logs at exactly the same time at 12:23 EST as follows:

    Windows cannot access the file  for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program HpcServiceHost because of this error.

    Program: HpcServiceHost

    File:

    The error value is listed in the Additional Data section.

    User Action

    1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.

    2. If the file still cannot be accessed and

                    - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.

                    - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.

    3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.

    4. If the problem persists, restore the file from a backup copy.

    5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.

    Additional Data

    Error value: C00000C4

    Disk type: 0

    Faulting application name: HpcServiceHost.exe, version: 4.5.5079.0, time stamp: 0x5641e146

    Faulting module name: ntdll.dll, version: 6.3.9600.18895, time stamp: 0x5a4b1b67

    Exception code: 0xc0000006

    Fault offset: 0x0000000000055d68

    Faulting process id: 0x14c8

    Faulting application start time: 0x01d3c6276f8dff80

    Faulting application path: C:\Program Files\Microsoft HPC Pack 2012\Bin\HpcServiceHost.exe

    Faulting module path: C:\Windows\SYSTEM32\ntdll.dll

    Report Id: cec9b1fc-323f-11e8-81b7-d0a6425b4bad

    Faulting package full name:

    Faulting package-relative application ID:

    2) Yes, it write logs but nothing interesting.  It just stops running right in the middle (not at a particular spot).   Our SOA service just calls an EXE, and waits for it to return before proceeding.  If just our EXE had crashed, I would have expected our error handling to pick it up (and log and then retry).  That doesnt happen.  When this occurs, everything seems to just crash/no error handling anywhere.

    3) sent the logs over.  

    Thanks!

    -Jason

    Donnerstag, 29. März 2018 15:51
  • Hi Jason,

    From the log you sent, we found no error. Seems HPC Pack is working properly.

    Are you trying to access something shared file through SMB share in your SOA service? If yes, could it be related to opportunistic locking of SMB share? As stated in this article.

    Thanks,
    Zihao

    Freitag, 30. März 2018 01:59
  • Hi Zihao,

    I think I found the same or similar article you provided below.   Yes, there is quite a bit of access to a SMB share in our app, but the way the app behaves isn't what I'd expect.

    The way its all setup is as follows:

    client >> SOA service >> executable.

    All the SOA service is doing is calling the executable and then waiting for an exit code from the exe.  The service itself does not access any files at all.   If the executable ran into file access issues (the executable itself lives on a share if that makes a difference), I'd expect it handle it gracefully, or crash, but in either case just return a failed exit code to the service and then that be handled gracefully in our error handling code.   The SOA service itself's DLL/etc does live on a file share as well, so that is a possibility.  Could that possibly be it?  Or would it have gotten loaded locally when the service is invoked where losing a connection to the SOA service DLL would not cause a problem/this type of problem?  Is there any reason to think that moving to the latest HPC 2016 could help this issue at all?  We plan on moving at some point, but its a pretty high cost for us to do so.

    Thanks again for the help!

    Montag, 2. April 2018 11:41
  • Hi Jason,

    > the executable itself lives on a share if that makes a difference
    How does your service code call this executable? Through an interface? Or just run it?

    > The SOA service itself's DLL/etc does live on a file share as well, so that is a possibility.  Could that possibly be it?
    I would suggest you to try putting both you SOA service DLL and the executable to a local folder to see if this problem will repro.

    >Or would it have gotten loaded locally when the service is invoked where losing a connection to the SOA service DLL would not cause a problem/this type of problem?
    As far as I know, the DLL will not get loaded locally.

    > Is there any reason to think that moving to the latest HPC 2016 could help this issue at all?
    I don't think moving to HPC 2016 will help.

    Thanks,
    Zihao






    Dienstag, 3. April 2018 02:01
  • > How does your service code call this executable? Through an interface? Or just run it?

    Just using the Process.Start function in C#.

                    var process = Process.Start(_driverLocation, args);
                    process.WaitForExit();

    I was wrong in stating we got an exit code.  We have a separate check after the WaitForExit() is reached which checks our application logs to confirm if the instance finished as expected, and if not, handles it.

    I would suggest you to try putting both you SOA service DLL and the executable to a local folder to see if this problem will repro.

    So in the service registration file I can just specify something like "C:\HPC\Whatever.dll" as the service location, and when it attempts to run, it will attempt to find that path on the HPC worker machine?   

    Thanks for the help.  I will try some of these suggestions and let you know either way if they were worked or not.

    Dienstag, 3. April 2018 12:56
  • Hi Jason,

    > So in the service registration file I can just specify something like "C:\HPC\Whatever.dll" as the service location, and when it attempts to run, it will attempt to find that path on the HPC worker machine?

    Yes, it'll try to load the DLL from local disk.


    Thanks,
    Zihao

    Mittwoch, 4. April 2018 02:25