none
task submitted to a machine but processing never starts? RRS feed

  • Question

  • Hi,

    We've been using the HPC setup for a while and its worked great.  The last couple of Mondays, we've had jobs being submitted Monday morning and 1 of them gets assigned to a particular machine, and based on our app logs, the process never gets started.  Eventually it hits the 30minute timeout we have set and get an error saying the timeout has been reached.  It seems like rebooting the machine puts it into a good state/allows the machine to be used again without issue.  We used it successfully all of last week but the machine/hpc service seems to have gotten into some state where it shows up to the head node as an active worker node, but is unable to actually process any jobs.

    Let me know if you need any logs/etc.  We are running what I believe is the latest, 2012r2

    Thanks!

    Monday, March 14, 2016 12:11 PM

All replies

  • here is what I see listed under the results/output for the task.  There is only 1 job in this task (and we have it set to exclusive so there should only ever be 1 job/task running on any 1 machine)

    ===

    HpcSoa Information: 10011 : HpcServiceHost entry point is called.
    HpcSoa Information: 11 : [Session:37483] OnAzure = False
    HpcSoa Information: 11 : [Session:37483] ServiceConfigFullPath = \\MACHINE-NAME-HERE\HpcServiceRegistration\ConfigName.config
    HpcSoa Information: 11 : [Session:37483] Sleep...

    Monday, March 14, 2016 12:17 PM
  • Hi Jason,

    If you saw these task output from the SOA job, the HpcServiceHost on the compute node was started successfully and waiting to process the SOA requests. Please check the client and the symptom for "the process never gets started". Did it fail to send the SOA requests or to get the SOA responses? You may also double check the "Number of Requests", "Succeeded Requests", "Calculating Requests" and "Failed Requests" properties of the SOA job from the GUI Console or job command line. If the SOA message level tracing is enabled (how-to), you may also check the detailed message logs for each requests.

    If nothing interesting is found above, the SOA logs are needed to investigate.

    1) To collect Session API logs, please add the following under the <configuration> section in your [Application].exe.config:
    <system.diagnostics>
      <trace autoflush="true"/>
      <sharedlisteners>
        <add name="xml" type="System.Diagnostics.XmlWriterTraceListener" initializedata="c:\TEMP\session.svclog"/>
      </sharedlisteners>
      <sources>
        <source name="SOA Session API" switchvalue="All">
          <listeners>
            <remove name="Default"/>
            <add name="xml"/>
          </listeners>
        </source>
      </sources>
    </system.diagnostics>

    2) To collect SOA broker logs (with HPC Pack 2012 R2 Update 2/3) on the broker node, add an attribute PerSessionLogging="1" for the shared listener “SoaListener” in HpcBrokerWorker.exe.config under %CCP_HOME%Bin on the broker nodes and then restart the HpcBroker service. After that, when a SOA session with id [SessionId] finishes, there would be a file named HpcBrokerWorker_[LogIdentifier]_[SessionId] under %CCP_DATA%LogFiles\SOA folder on the broker node. All the broker log files for this SOA session are named like HpcBrokerWorker_[LogIdentifier]_*.bin, they are by default 1MB files.

    3) To collect the SOA service host logs on the compute node, first add the following under the <configuration> section in the SOA service registration file, and then configure SOA service's Event logging level to Verbose,ActivityTracing. After reproing the issue, go to the compute node, cd to %CCP_LOGROOT_USR%SOA\HpcServiceHost<JOBID>\<CCP_TASKINSTANCEID>, there should be service host log files HOST_*.bin.

    <system.diagnostics>
      <sources>
        <source name="HpcSoa" switchValue="All">
          <listeners>
            <remove name="Default" />
            <add name="SoaListener" />
          </listeners>
        </source>
      </sources>
      <sharedListeners>
        <add type="System.Diagnostics.ConsoleTraceListener, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
          name="Console" traceOutputOptions="DateTime">
          <filter type="" />
        </add>
        <add type="Microsoft.Hpc.Trace.HpcTraceListener, Microsoft.Hpc.Trace"
          name="SoaListener"
          initializeData="%CCP_LOGROOT_USR%SOA\HpcServiceHost\%CCP_JOBID%\%CCP_TASKINSTANCEID%\Host"
          FileSizeMB="4"
          MaxAllowedDiskUsageInMB="1000" />
      </sharedListeners>
      <trace autoflush="true" useGlobalLock="false">
        <listeners>
          <remove name="Default" />
          <add name="SoaListener" />
        </listeners>
      </trace>
    </system.diagnostics>

    The *.bin log files can be parsed by the built-in command line tool hpctrace.exe. Just run "hpctrace parselog <Bin file>". You may also zip the logs and send them to me via yutongs@microsoft.com for analysis.

    BR,

    Yutong Sun



    Tuesday, March 15, 2016 8:24 AM
    Moderator
  • Thanks I'll look at getting this information.  We have seen this happen on 2 different HPC clusters starting 2 weeks ago so will try and get the trace information.   It started on both clustersthe Monday after we did patching on the majority of these machines so guessing that is somehow related.  
    Tuesday, March 15, 2016 12:12 PM