Penanya
Jobs canceled without any message

Diskusi Umum
-
We've recently upgraded the HPC from 2008 R2 to 2012 R2, however the excel (soa) jobs sometimes failed.Looking at the activity log, it was canceled and started for a few times, but some of the jobs that completed successfully also got canceled and started for a few times. And it didn't even show by whom were the jobs canceled.
Those failed jobs usually had requests sent to the cluster but seemed never processed. There were incoming requests but never progressed.
In the job details, there were also warnings saying “This sub-task was canceled because it could not be requeued along with the rest of the job. Another sub-task will be created to replace it.” I believe this was the reason the job got canceled and started again and again as I aslo saw such warnings in the jobs details of those done successfully.
But the error message saying "task canceledd during execution" eventually caused terminated the job, correct?
However, no one canceled the job, anyone has any idea?Each time the excel job failed, the client side printed such an error log
"Session is failed or canceled. Please refer to the reason of the session's job for more information.
at Microsoft.Hpc.Scheduler.Session.BrokerResponse`1.get_Result()"The Hpc Borker.exe consumed around 930MB memory while the Hpc Session.exe process consumed around 980MB memory, even when the job failed or finished, these two process did release any memory, is this corret? I'm thinking that previously some job failed and what I could see from the log was everything was fine and suddenly no more calculate response combing back and after a few mins, the job failed.
Similar to the recent failed jobs, but recent ones didn't recent even one single calculate response. So most probably there was some issue with the Hpc session, right?
I noticed several errors from Microsoft\HPC\Scheduler, saying " Client 8 was previously registered with version 2.0, now has version 4.2", what is this?
Sometimes after several canceled/started, the job could start processing data and completed successfully. But sometimes, it couldn't, when the server where the broker node deployed ws busy, it got below error sometimes and the job failed..
Broker is unavailable due to loss of heartbeat. Make sure you can connect to the broker node, the HpcBroker service is running on the broker node and the session is still running.
at Microsoft.Hpc.Scheduler.Session.BrokerResponse`1.GetUserData[T]()
- Diedit oleh MChen19th Minggu, 13 November 2016 09.52
Jumat, 11 November 2016 05.53
Semua Balasan
-
Hi MChen19th,
It looks like the broker worker (HpcBrokerWorker.exe) on the broker node failed to dispatch requests to the service hosts on the compute nodes. We need to look at the broker worker logs to investigate further. Please follow the steps below to collect broker worker logs,
1. Stop the broker service on the broker nodes by running 'net stop hpcbroker'. If you have multiple broker nodes, it's better to bring all but one offline and then repro with single online broker node.
2. Delete all old HpcBrokerWorker_*.bin files under the %CCP_HOME%Data\LogFiles\SOA on the broker nodes. Then start the broker service by 'net start hpcbroker'.
3. Repro this issue, collect the failed SOA job details by 'job view <jobId> /detailed' and all the HpcBrokerWorker_*.bin files on the broker nodes, send the zip file to me via yutongs@microsoft.com for investigation.
Btw, could you confirm the 2012 R2 version in the HPC Cluster Manager GUI? Is it latest Update 3 + KB3161422 (build 4.5.5111.0)? What' the Excel version on the compute nodes, 2010 or 2013? Is the workload Excel VBA or UDF?
Regards,
Yutong Sun
Jumat, 18 November 2016 03.56 -
Appreciate very much for looking into this. Here is the info that I can get:
It's build 4.2.4400.0
Excel 2010
Workbook offload mode
I'll pass over your instructions to the Ops team to gather the logs for broker node.
Kamis, 24 November 2016 02.15 -
BTW, the client is creating session and broker client with HPC Pack 2008 R2, could this be an issue?Kamis, 24 November 2016 02.25
-
Changed the WCF service to just run as local system account instead of specifying it in the service panel.
Minggu, 10 November 2019 18.16