none
long running tasks not notifying the head node that it has completed? RRS feed

  • Question

  • Hi,

    I am having an issue that I was hoping to get some assistance on.   As of a a couple weeks ago, the following issue started occurring which I don't really understand.  We have a job which has 20 tasks, each of them very long running (anywhere from a few hours to a few days).   We have 20 machines, so each task gets assigned to its own machine.  The issue is that when the tasks finish, the master does not detect that get notified and the job continues running indefinitely (until the timeout).  I have confirmed that the task themselves have finished based on our app logs, just the communication back from the worker to the head node does not seem to be working.  

    Any ideas?   This worked fine before a couple weeks ago, so I'm thinking it may be some WINDOWS patching/firewall/antivirus updates on our hpc fleet.  Do you have any suggestions?  If you have any thoughts on what the issue could be given the behavior, I can pass that that along to our IT/infra team to ask them to look into the possible changes that could have caused this.

    EDIT: forgot to mention we connect to hpc via SOA durable session.

    Thanks.


    Thursday, September 22, 2016 3:28 PM

All replies

  • Hi Jason,

    Since it is SOA job, could you double check: 1) if the job progress is 100% with Total Requests = Succeeded via the Admin/Job GUI console -> Job Management -> SOA Jobs -> Job view? 2) if you client explicitly called SessionBase.Close() and succeeded to close the session after attaching the durable session to retrieve the responses? Note if the session was not closed explicitly by the client, it will rely on the sessionIdleTimeout specified in the service registration file or the SessionStartInfo.BrokerSettings.SessionIdleTimeout when creating the session to automatically close itself after all requests are processed and the idle period times out.

    Regards,

    Yutong Sun

    Friday, September 23, 2016 7:59 AM
  • it does not show them all completed successfully.

    the numbers are something like 5 completed, 14 running even though the 14 which shows as running I know are completed based on the app logs.  

    I looked through out code and do not see any calls to SessionBase.Close().  Where exactly should this be happening?    Here is what I have (mainly taken from the SOA sample provided in the docs).  Before a couple weeks ago, the job would finish immediately once the tasks had finished (also based on our app log).   It wouldn't run until the timeout before it finished.

                    SessionStartInfo info = GetSessionInfo(signalSetGroupings.Count());

                    Console.WriteLine("Creating a session for {0}", info.ServiceJobName);
                    using (DurableSession session = DurableSession.CreateSession(info))
                    {
                        Console.WriteLine("done session id = {0}", session.Id);
                        NetTcpBinding binding = new NetTcpBinding(SecurityMode.Transport);


                        using (
                            BrokerClient<IXXXYYYZZZService> client =
                                new BrokerClient<IXXXYYYZZZService>(session, binding))
                        {
                            for (int i = 0; i < signalSetGroupings.Count(); i++)
                            {
                                var signalSetGrouping = signalSetGroupings.ElementAt(i);
                                var request = new XXXYYYZZZRequest(param1, param2, param2);
                                client.SendRequest(request, i);
                            }
                            client.EndRequests();

                           

                            foreach (var response in client.GetResponses<XXXYYYZZZResponse>())
                            {
                                try
                                {
                                    if (response.Result.XXXYYYZZZResult != 0) //this means failure.
                                    {
                                        failures.Add(signalSetGroupings.ElementAt(response.GetUserData<int>()));
                                    }
                                }
                                catch (Exception ex) //also catch any exceptions and write to the console.
                                {

                                    var job = _scheduler.OpenJob(session.Id);
                                    if (job.State == JobState.Canceled)
                                    {
                                        statusThread.Abort();
                                        Console.WriteLine("Canceling sim run {0}.  The job has been detected as canceled.",
                                            _simRunId);
                                        throw new Exception(
                                            String.Format("Canceling sim run {0}.  The job has been detected as canceled.",
                                                _simRunId), ex);
                                    }
                                    var signalSetId = response.GetUserData<int>();
                                    Console.WriteLine(
                                        "Error with signal set id : {0} with error: {1} and stack trace: {2}", signalSetId,
                                        ex.Message, ex.StackTrace);
                                    failures.Add(signalSetGroupings.ElementAt(response.GetUserData<int>()));
                                }
                            }

                            statusThread.Abort();
                        }

                        session.Close();

                        //retry
                        if (failures.Count() != 0)
                        {
                            Console.WriteLine("Errors found, failed for for {0} signal batch groupings", failures.Count());
                            return false;
                        }
                    }
                }



    Friday, September 23, 2016 1:04 PM
  • Hi Jason,

    session.Close(); <-- this is SessionBase.Close()

    Since there were still calculating requests, so I suppose the client.GetResponses enumerator would not end for not all responses are retrieved, thus the session.Close() were not called, so the session job were not finished. Could you add logs in this client code to confirm this?

    As for why the 14 requests were still calculating, we can enable message level tracing (see how-to) and add logs in the service code to see if the requests were completed for processing without halt at the service side.

    Regards,

    Yutong Sun

    Monday, September 26, 2016 2:14 AM
  • Thanks for the quick response.

    I set it to verbose logging.  Is there anything else I need to do?  What is the logging in the service code that you are recommending?   Just put some standard logging at the endpoint of our service code that it has completed and should now be returning back it has finished to the head node?

      Re: your suggestion on adding logging to the client code, are you just saying to log within the foreach

         foreach (var response in client.GetResponses<XXXYYYZZZResponse>())
                            {
                                try

    ...

    the responses that we get back, or something else?   How exactly does that work?  Does it run immediately as a response is received, or only after all of the responses it expects to get has been returned back from the head node?

    Monday, September 26, 2016 1:37 PM
  • I went through the tasks within the different jobs that failed and haven't been able to find any patterns.  The head node is the same across the different jobs (i'm working on getting a different cluster of machines to see if the issue replicates there as well) and some of the tasks ran for 8-10 hours and reported successful, whereas others did not (I didn't dig deep enough to tie the tasks within HPC to what each was running on my end to see how long those should have run before finishing).   I had originally thought it was fastest finishing tasks that were reporting as finishing but I don't believe that to actually be the case anymore.  

    The worker node that reported successful in 1 job was also 1 of the tasks that didn't complete in a different, so it doesn't seem to be specific to the specific worker node.  We do have 2 broker nodes, though I didnt see a way to tie which broker node each of the workers connected to to see if it was somehow related to that.

    Monday, September 26, 2016 2:02 PM
  • Hi,

    Can you explain how the broker nodes work and load is allocated if we have multiple broker nodes?   I was unable to replicate the issue in our QA environment with the biggest difference being that we have 2 dedicated broker nodes in our production environment, whereas the head node in the QA environment also works as the broker.   

    I have 2 jobs which were very similar to each other, both of them finished all the tasks in ~30-40 minutes.  1 job picked up responses properly (shows all completed, 0 processing) and my code picked up the job had failures.  On the second job, the tasks had all finished within 30-40 minutes as well, but it was never picked up as completed and stayed that way until I canceled the job 10 hours later (each task showed a start time, followed by an end time 10 hours later when the job was cancelled).   Unfortunately I had not turned on the SOA logging just yet.  They are re-enabled now so the next time this occurs I will hopefully have usable logs.

    Tuesday, September 27, 2016 4:36 PM
  • I was able to turn on SOA logs and had the behavior where 1 of the tasks finished successfully, and 1 finished but continued to show as processing.

    Here is what I got from the SOA logging:

    the finished one:

    2016-09-29T13:57:21.8527889Z,364,7896,5,"[Session:1761] MessageId = 4b89940a-6390-483a-9138-40b7975e1187, DispatchId = 20c2fd20-811f-4920-afd4-0f9eee35a5f0: [HpcServiceHost]: Request is received."

    2016-09-29T14:49:11.7466403Z,364,7896,5,"[Session:1761] MessageId = 4b89940a-6390-483a-9138-40b7975e1187, DispatchId = 20c2fd20-811f-4920-afd4-0f9eee35a5f0: [HpcServiceHost]: Response is sent back. IsFault = False"

    2016-09-29T14:49:21.0968494Z,364,5944,5,"[Session:1761] [HpcServiceHost]: received request from XXX\BROKERNODE$"

    2016-09-29T14:49:21.1029070Z,364,5368,4,"Host got canceled event."

    the one which showed as running:

    2016-09-29T13:57:18.7755567Z,4400,6672,5,"[Session:1761] MessageId = b497618c-94b3-4e4c-a060-9028c314e3be, DispatchId = abf22b2c-263a-4f1f-9be3-3e6042274ad3: [HpcServiceHost]: Request is received."

    2016-09-29T15:05:44.5367700Z,4400,6672,5,"[Session:1761] MessageId = b497618c-94b3-4e4c-a060-9028c314e3be, DispatchId = abf22b2c-263a-4f1f-9be3-3e6042274ad3: [HpcServiceHost]: Response is sent back. IsFault = False"

    the response log at 2016-09-29T15:05:44 is the time the job finished according to my log but it didnt get the other entries in the log.

    Within the SOA details page within cluster manager, i see the following:

    Request received 9/29/2016 9:56:11 AM

    response retrieved N/A

    DIspatch Time: 9/29/2016 9:57:11 Am

    Response Time: N/A

    Response Type: Incomplete

    Thursday, September 29, 2016 3:56 PM
  • Hi Jason,

    Enabling Verbose logging for the service is good. This can help view the SOA message details that shows when/where the requests are processed.

    You may also add custom logging in your service code, e.g. write to console or local files, which may help to confirm that the service has received the requests, done the job and returned the responses.

    For the SOA client logs, you may write to console or local files, when any response is received and after the session.Close() is called. This would help to see if the SOA client has received all the responses and do close the session. According to the job process, if it is not 100% and there are still calculating requests, the client should be waiting for the rest of the responses without calling session.Close().

    One SOA session would be allocated with one broker node. You may find the broker node name from the HPC_BrokerNode property by running 'job view [job id] /detailed' command. The broker node is responsible to receive the requests from the SOA client and dispatch them to the service hosts on the compute nodes. The broker nodes are selected in round robin when creating SOA sessions.

    From the message details, it looks the broker didn't receive the response from the service host for the request with MessageId = b497618c-94b3-4e4c-a060-9028c314e3be. You mentioned at the time 2016-09-29T15:05:44 the job was finished, was it finished by the SOA client? If the SOA session was finished and the broker was unloaded, the broker should not be able to receive the responses from the service hosts.

    Last but not least, for long running requests, you may also check the serviceOperationTimeout setting in the service registration file and make sure it is set to a value large enough for the request to be processed.

    Regards,

    Yutong Sun

    Friday, September 30, 2016 7:40 AM
  • Thanks, our service code already has logging so I am able to confirm the individual tasks themselves had finished (this most recent example only had 2 tasks.  They were the 2 I listed earlier).   Is there anything else in particular to log in the service code?

    RE: "You mentioned at the time 2016-09-29T15:05:44 the job was finished, was it finished by the SOA client?" - Sorry I wasnt clear.   That is when the task finished processing based on our app log for each individual task, not the the entire job.  The job still shows as processing (with that 1 task which had finished as active/processing as well).  The SOA client never picked up that the job was now finished, it was still waiting for all the responses to come in I assume.   The issue seems to be that the task finishes, but somewhere in the chain it does not make it back to the right place where it is marked as completed.    If I manually cancel the task in the job manager, the client picks up the job has been canceled and reports as such.

    The serviceOperationTimeout should be set properly.   We have set it very high (3 days), though the specific example above the tasks in the jobs were much shorter (< 1 hour before they completed).

    Thanks for the help.   This issue is very puzzling so appreciate any help/suggestions you might have on what to do next.

    Friday, September 30, 2016 12:21 PM
  • We can collect and check the broker logs to see if there is any error happens when the broker is retrieving the response from the service host on the compute node.

    I posted the method to collect broker logs before and copy it as below,

    To collect SOA broker logs (with HPC Pack 2012 R2 Update 2/3) on the broker node, add an attribute PerSessionLogging="1" for the shared listener “SoaListener” in HpcBrokerWorker.exe.config under %CCP_HOME%Bin on the broker nodes and then restart the HpcBroker service. After that, when a SOA session with id [SessionId] finishes, there would be a file named HpcBrokerWorker_[LogIdentifier]_[SessionId] under %CCP_DATA%LogFiles\SOA folder on the broker node. All the broker log files for this SOA session are named like HpcBrokerWorker_[LogIdentifier]_*.bin, they are by default 1MB files.

    The *.bin log files can be parsed by the built-in command line tool hpctrace.exe. Just run "hpctrace parselog <Bin file>".

    Note the *.bin file with the largest ending number is by design empty. You may search the log by the message Id for the log entries could be many.


    Regards,

    Yutong Sun

    Sunday, October 9, 2016 9:00 AM
  • Here are the versions I see in the job manager in case this isnt the right version:

    server version 4.4.4864.0
    client version 4.4.4864.0

    I added the line to the broker config and ran a job which had the issue.

    There was the  HpcBrokerWorker_[LogIdentifier]_[SessionId].bin file, which was 0kb 

    There were 2 HpcBrokerWorker_[LogIdentifier]_*.bin files associated with that LogIdentifier.   I ran the hpctrace parselog, and got the following:

    Rendering C:\Program Files\Microsoft HPC Pack 2012\Data\LogFiles\SOA\HpcBrokerWo
    rker_DFCCIJFFG_000000.bin
    *** Skipped 1037488 bytes
      Rendered 139 entries, 0.0MB (0x), 16ms

    But when I opened the corresponding file that was outputted, 

    it only had the following in it:

    Date(UTC) Time(UTC) Level Source PID TID Message
    10/11/2016 12:39:34.419 i HpcSoa 21560 15492 [Broker Worker 21560] [Main] Try open broker management service at http://localhost:9093/BrokerManagementService/21560.  
    10/11/2016 12:39:34.534 i HpcSoa 21560 15492 [Broker Worker 21560] [Main] Open broker management service succeeded.  

    The timestamp is when the job started.  The job had 6 tasks in it, with 4 that showed up as successful in the GUI, and 2 that were still processing.  I ended up having to cancel the job because it was never finishing even though the 2 remaining tasks appeared to have finished based on the app log.  Should I be expecting more log entries than this?

    Thanks!



    Tuesday, October 11, 2016 3:05 PM
  • The version of 4.4.4864.0 is HPC Pack 2012 R2 Update 2, in which the broker per-session logging should work.

    There should have been more entries in the parsed .log file. Could you send me the HpcBrokerWorker_DFCCIJFFG_000000.bin to me via email: yutongs@microsoft.com so that I can take a look?

    Regards,

    Yutong Sun

    Wednesday, October 12, 2016 4:21 AM
  • Changed the WCF service to just run as local system account instead of specifying it in the service panel.
    Sunday, November 10, 2019 6:19 PM