none
All tasks in a job in faulted/failed state RRS feed

  • Question

  • I changed my AD password yesterday and today none of the jobs work as expected. So I explicitly set the username/password in SessionInfo object and send it and yet it failed.

    When I exported the SOA Trace for the job that has all failed tasks, and is in a faulted state, I see the following error:

    2016-07-13T03:52:49.6568672Z,2792,6184,2,"[Session:214] [SchedulerHelper] Exception throwed while fetch job owner's SID: System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ClientFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state.
    
    Server stack trace: 
       at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen()
       at System.ServiceModel.Channels.OutputChannel.BeginSend(Message message, TimeSpan timeout, AsyncCallback callback, Object state)
       at System.ServiceModel.Dispatcher.DuplexChannelBinder.BeginRequest(Message message, TimeSpan timeout, AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartSend(Boolean completedSynchronously)
       at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishEnsureOpen(IAsyncResult result, Boolean completedSynchronously)
       at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureOpen(Boolean completedSynchronously)
       at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.FinishEnsureInteractiveInit(IAsyncResult result, Boolean completedSynchronously)
       at System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.StartEnsureInteractiveInit()
       at System.ServiceModel.Channels.ServiceChannel.BeginCall(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, TimeSpan timeout, AsyncCallback callback, Object asyncState)
       at System.ServiceModel.Channels.ServiceChannelProxy.InvokeBeginService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
       at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)
    
    Exception rethrown at [0]: 
       at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
       at Microsoft.Hpc.Scheduler.Session.Internal.Common.ISchedulerAdapterInternalAsync.BeginGetJobOwnerSID(Int32 jobid, AsyncCallback callback, Object asyncState)
       at Microsoft.Hpc.Scheduler.Session.Internal.BrokerLauncher.SchedulerAdapterInternalClient.GetJobOwnerSID(Int32 jobid)
       at Microsoft.Hpc.Scheduler.Session.Internal.BrokerLauncher.SchedulerHelper.<>c__DisplayClass1e.<GetJobOwnerSID>b__1a()
       at Microsoft.Hpc.Scheduler.Session.Internal.BrokerLauncher.RetryHelper`1.InvokeOperation(OperationDelegate operation, ExceptionThrownDelegate onException, RetryPolicy policy)
    RetryCount = 3"
    2016-07-13T03:52:49.6623698Z,1904,8976,4,"[Session:214] [SchedulerHelper] GetUserSID..."


    The job owner is already a memeber of hpc users group on the cluster. What can I do to fix this error?

    Thanks

    Wednesday, July 13, 2016 4:15 AM

All replies

  • Hi SRIRAM R,

    Could you try to submit a normal batch job with the user credential and see if it works? When you specified the username/password in the SessionStartInfo, did the client succeed in creating session and was the session job submitted running before failed?

    Regards,

    Yutong Sun

    Wednesday, July 13, 2016 11:18 AM
    Moderator
  • A simple job with  1 task ("set" as commandline) works using IScheduler.SubmitJob() method by passing in explicit user credentials

    When I specify username/pwd in SessionStartInfo, a job is created and it quickly moves from configuring to running to finished (failed/faulted state) -- all requests are in 

    faulted (response type) /failed (status) with no info  in 'exception details'.

    the exception details from my post above were from 'export soa trace' on job and from <headnode>.log file

    (i am taking the login/pwd details explicitly from user instead of relying on the Setinterfacemode as it spawns an additional job, which can be avoided, i thought).

    i also set disablecredentials to true by using cluscfg thinking somewhere credentials may be getting cached, but that did not help either.

    preparenodecommand and releasenodecommand are getting executed fine - which leads me to think that which ever process is trying to communicate with my SOA is looking for from SID info and it's failing, hence the SOA is

    not getting loaded and executed..

    any insight is appreciated.






    • Edited by SRIRAM R Wednesday, July 13, 2016 3:18 PM
    Wednesday, July 13, 2016 1:50 PM
  • Hi SRIRAM R,

    According to issue description, we may need to look at the broker and session service logs to investigate why all the requests (tasks) were failed with the session job finished after the password was changed.

    1. Broker service logs are HpcBroker_*.bin files and HpcBrokerWorker_*.bin files under %CCP_DATA%LogFiles\SOA folder on all the broker nodes.

    2. Session service logs are HpcSession_*.bin files under %CCP_DATA%LogFiles\SOA folder on the head node.

    The .bin files can be parsed by the built-in tool using the command line like this: HpcTrace parselog [filename]. The output is .log files which can be opened as .txt file.

    You may also send the .bin logs to me via email: yutongs@microsoft.com if they are not many, so I can help to investigate them. If there are a lot of .bin files, you may first delete the old ones, repro the issue, and then collect the newly generated. If you have multiple broker nodes, you may bring only one online and the rest offline before the repro, so you don't have to locate the broker node for the broker logs.

    Btw, the CommunicationObjectFaultedException in GetUserSID call as you posted can be retried and may not related. Broker logs are key to understand why all requests were faulted.

    Regards,

    Yutong Sun


    Thursday, July 14, 2016 1:38 AM
    Moderator
  • Hi Yutong

    Have uploaded relevant log files to https://1drv.ms/f/s!AnOkuFPvV8aAivIp4GB22ZDeFbr5Zw

    For testing, you can look at Session/Job ID of 230

    In one of the broker logs, I noticed a fault = true as follows

    07/14/2016 15:56:25.143 v HpcSoa 8548 4180 [Session:230] TaskId = 2220, MessageId = c3e8588b-25db-4b5b-9831-854bb3c6eb82, Fault = True, Received response from service host.  

    i have put in a trace statement in the constructor of soa class and on the first line of my soa method call and the one in constructor works but not the one in the method and i dont see any errors logged anywhere

    i have decorated the soa dll with [ServiceBehavior(IncludeExceptionDetailInFaults = true)]


    Thursday, July 14, 2016 4:48 PM
  • Hi SRIRAM R,

    I checked the logs, just as you observed, the broker has successfully dispatched the request to the service host, and received a fault message from the service host. The broker did not log the content of the fault, however when the client got the response from the broker, the fault message content can be retrieved from the result of the response if the includeExceptionDetailInFaults="true" is set in the service registration file.

    While you may check the fault response message from the client to check why the service host returned fault, you may also following the steps below to open the service host log and check them on the compute nodes,

    1. Set the service Event Logging Level to Verbose from Cluster Manager Console

    2. Add the following section in the service registration file to add the SoaListener for the HpcSoa source. You may take the built-in CcpEchoSvc.config for example,

    <system.diagnostics>
        <sources>
          <source name="HpcSoa" switchValue="All">
            <listeners>
              <remove name="Default" />
              <add name="SoaListener" />
            </listeners>
          </source>
        </sources>
        <sharedListeners>
          <add type="System.Diagnostics.ConsoleTraceListener, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
            name="Console" traceOutputOptions="DateTime">
            <filter type="" />
          </add>
          <add type="Microsoft.Hpc.Trace.HpcTraceListener, Microsoft.Hpc.Trace"
            name="SoaListener"
            initializeData="%CCP_LOGROOT_USR%SOA\HpcServiceHost\%CCP_JOBID%\%CCP_TASKINSTANCEID%\Host"
            FileSizeMB="1"
            MaxAllowedDiskUsageInMB="1000" />
        </sharedListeners>
        <trace autoflush="true" useGlobalLock="false">
          <listeners>
            <remove name="Default" />
            <add name="SoaListener" />
          </listeners>
        </trace>
      </system.diagnostics>

    3. Repro the issue, and then go to the compute nodes on which the service hosts were run, cd to %CCP_LOGROOT_USR%SOA\HpcServiceHost\<JOBID>\<CCP_TASKINSTANCEID>, there would be log files named HOST_*.bin,
    the *.bin log files can be parsed locally by running 'hpctrace parselog <*.bin>'

    Last but not least, you may also run the built-in Echo service with the EchoClient.exe to quickly isolate the problem. You may find the EchoClient.exe under the %CCP_HOME%Bin folder, and run e.g. 'EchoClient.exe -h <headnode>' in a commandline window.

    Regards,

    Yutong Sun

    Friday, July 15, 2016 7:14 AM
    Moderator
  • As I said before I decorated the DLL with  [ServiceBehavior(IncludeExceptionDetailInFaults = true)]

    Perhaps that alone would not suffice - i will try setting it in the SOA service config file as well.

    Friday, July 15, 2016 3:53 PM