none
Jobs canceled without any message RRS feed

  • General discussion

  • We've recently upgraded the HPC from 2008 R2 to 2012 R2, however the excel (soa) jobs sometimes failed.Looking at the activity log, it was canceled and started for a few times, but some of the jobs that completed successfully also got canceled and started for a few times. And it didn't even show by whom were the jobs canceled.

    Those failed jobs usually had requests sent to the cluster but seemed never processed. There were incoming requests but never progressed.

    In the job details, there were also warnings saying “This sub-task was canceled because it could not be requeued along with the rest of the job.  Another sub-task will be created to replace it.” I believe this was the reason the job got canceled and started again and again as I aslo saw such warnings in the jobs details of those done successfully.

    But the error message saying "task canceledd during execution" eventually caused terminated the job, correct?
    However, no one canceled the job, anyone has any idea?

    Each time the excel job failed, the client side printed such an error log

    "Session is failed or canceled. Please refer to the reason of the session's job for more information.
       at Microsoft.Hpc.Scheduler.Session.BrokerResponse`1.get_Result()"

    The Hpc Borker.exe consumed around 930MB memory while the Hpc Session.exe process consumed  around 980MB memory, even when the job failed or finished,  these two process did release any memory, is this corret? I'm thinking that previously some job failed and what I could see from the log was everything was fine and suddenly no more calculate response combing back and after a few mins, the job failed.
    Similar to the recent failed jobs, but recent ones didn't recent even one single calculate response. So most probably there was some issue with the Hpc session, right?

    I noticed several errors from Microsoft\HPC\Scheduler, saying " Client 8 was previously registered with version 2.0, now has version 4.2", what is this? 

    Sometimes after several canceled/started, the job could start processing data and completed successfully. But sometimes, it couldn't, when the server where the broker node deployed ws busy, it got below error sometimes and the job failed..

    Broker is unavailable due to loss of heartbeat. Make sure you can connect to the broker node, the HpcBroker service is running on the broker node and the session is still running.
       at Microsoft.Hpc.Scheduler.Session.BrokerResponse`1.GetUserData[T]()

    • Edited by MChen19th Sunday, November 13, 2016 9:52 AM
    Friday, November 11, 2016 5:53 AM

All replies

  • Hi MChen19th,

    It looks like the broker worker (HpcBrokerWorker.exe) on the broker node failed to dispatch requests to the service hosts on the compute nodes. We need to look at the broker worker logs to investigate further. Please follow the steps below to collect broker worker logs,

    1. Stop the broker service on the broker nodes by running 'net stop hpcbroker'. If you have multiple broker nodes, it's better to bring all but one offline and then repro with single online broker node.

    2. Delete all old HpcBrokerWorker_*.bin files under the %CCP_HOME%Data\LogFiles\SOA on the broker nodes. Then start the broker service by 'net start hpcbroker'.

    3. Repro this issue, collect the failed SOA job details by 'job view <jobId> /detailed' and all the HpcBrokerWorker_*.bin files on the broker nodes, send the zip file to me via yutongs@microsoft.com for investigation.

    Btw, could you confirm the 2012 R2 version in the HPC Cluster Manager GUI? Is it latest Update 3 + KB3161422  (build 4.5.5111.0)? What' the Excel version on the compute nodes, 2010 or 2013? Is the workload Excel VBA or UDF?

    Regards,

    Yutong Sun

    Friday, November 18, 2016 3:56 AM
  • Appreciate very much for looking into this. Here is the info that I can get:

    It's build 4.2.4400.0

    Excel 2010

    Workbook offload mode

    I'll pass over your instructions to the Ops team to gather the logs for broker node.

    Thursday, November 24, 2016 2:15 AM
  • BTW, the client is creating session and broker client with HPC Pack 2008 R2, could this be an issue?
    Thursday, November 24, 2016 2:25 AM
  • Changed the WCF service to just run as local system account instead of specifying it in the service panel.

    Sunday, November 10, 2019 6:16 PM