none
Service job for an interactive session failed due to its correspnding broker worker process unexpectedly terminating. RRS feed

  • Question

  • Hello,

    i have 1 head node, 2 broker node and some compute node in 2016 5.1.6086.0

    i send SOA job and 10% failed with this soa trace error : Service job for an interactive session failed due to its correspnding broker worker process unexpectedly terminating.

    thank's

    Tuesday, February 20, 2018 4:25 PM

All replies

  • Hi Priorum,

    Can you help us collect broker worker logs from the 2 broker nodes? They are located at %CCP_LOGROOT_SYS%SOA and named like HpcBrokerWorker_*.bin.

    You can send you log to hpcpack@microsoft.com

    Thanks,
    Zihao

    Thursday, February 22, 2018 5:47 AM
  • Hello,

    thank's for the reply.

    i have foud somthing strange : when a comute node connectivity became node manager connectivity unreachable.

    all the broker node reject clients connections.

    maybe it's a parameter in .config file.

    Best regard

    Thursday, February 22, 2018 10:03 AM
  • I can see this in broker :

    [Session:98461]   [BrokerInfo] Failed to kill broker worker process:   System.ComponentModel.Win32Exception (0x80004005): The handle is   invalid
       
           at   Microsoft.Hpc.Scheduler.Session.Internal.BrokerLauncher.BrokerProcess.KillBrokerProcess()
       
           at   Microsoft.Hpc.Scheduler.Session.Internal.BrokerLauncher.BrokerInfo.OnCloseBroker(IAsyncResult   result)

    Thursday, February 22, 2018 12:59 PM
  • Hi Priorum,

    The error you see happens after broker worker is determined to be closed so I think it is irrelevant.

    It is strange that a compute node lose its connectivity causes broker worker getting killed. But we still need broker worker log as I mentioned before to check what was going on at that time.

    Thanks,
    Zihao

    Friday, February 23, 2018 12:39 AM
  • Hi Priorum,

    I checked the log files you sent, some of them are empty and last session they record is session 48926, which is far behind session 98461.

    The reason is log files with largest serial number are usually pre-created and empty. Please help collect more log files, ensure that they contain sessions in question.

    Thanks,
    Zihao

    Friday, February 23, 2018 1:05 AM