none
tasks not processing even though there are open nodes? RRS feed

  • Question

  • Hi,

    Having some issues with HPC 2012R2 and was hoping I could get some assistance.  I have a job consisting of 37 tasks, with each task being very long running (12 to 24 hours) and have 25 nodes configured in our fleet.   What I would expect to happen is 25 tasks get allocated (1 to each node which is how we have it configured) and once any of those 25 tasks finishes, another task is sent to that node and started.

    This isnt the behavior we're seeing at times.  After running for about 12 hours, jobs started finishing up and new ones were started up.  I looked again a bit later, and there seemed to be open nodes as more tasks finished, but the additional tasks were not started as expected.

    Within the HPC job manager, if I opened the job, i see:

    Progress: 54%

    TOtal Requests: 37

    Succeeded: 20

    Calculating: 17

    Which seems correct, except that all 17 tasks don't actually seem to be running.  If I switch to the "View Tasks", i see my 25 allocated nodes, but only 11 of them show up in state running (the other 14 show as Finished).   So there seems to be 6 tasks which in 1 screen seems to say its calculating, but in the tasks view they do not seem to have actually started at all.  I've confirmed our app end that the 6 tasks have not started (or didnt start enough to write any logs at a minimum).

    Any thoughts?   




    Tuesday, May 24, 2016 4:06 PM

Answers

  • Hi Jason,

      Looks like you're running SOA job instead of Batch job, right? If yes, please check your service registration file configuration for the value of "serviceRequestPrefetchCount". In your case you need set it to 0 instead of the default value 1. (Check https://technet.microsoft.com/en-us/library/ff943786(v=ws.10).aspx )

     


    Qiufang Shi

    Thursday, May 26, 2016 2:35 AM

All replies

  • Hi Jason,

      Looks like you're running SOA job instead of Batch job, right? If yes, please check your service registration file configuration for the value of "serviceRequestPrefetchCount". In your case you need set it to 0 instead of the default value 1. (Check https://technet.microsoft.com/en-us/library/ff943786(v=ws.10).aspx )

     


    Qiufang Shi

    Thursday, May 26, 2016 2:35 AM
  • yes, this is a SOA job.  Thanks, i will give that a shot and see if that resolves our issue.

    Thursday, May 26, 2016 2:57 PM
  • This didn't seem to behave as expected / work properly. I set the values based on the sample in the link except I set the prefetch count to 0 and upped the timeout value from 24 hours to 72 (whatever the millisecond equivalent is). The job seems to submit fine, but no jobs actually get started. It just sits there and after about 15 minutes it throws a session exception saying the job has been detected as canceled. I unfortunately haven't been able to login to the job manager to see if there is more information. I rebooted the head node just in case that would help but same error again. I'll pull the job manager information and post it here once I get it. Thanks -Jason
    Friday, May 27, 2016 12:24 AM
  • Setting serviceRequestPrefetchCount as zero should work. What's the timeout parameter you set?

    BR,

    Yutong Sun

    Friday, May 27, 2016 6:41 AM
    Moderator
  • Thanks I figured it out once I was able to get into the job manager.  The service registration file had an issue/was throwing a parsing error.  

    I did not include the 

          <section name="loadBalancing"
                   type="Microsoft.Hpc.Scheduler.Session.Configuration.LoadBalancingConfiguration, Microsoft.Hpc.Scheduler.Session, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"
                   allowDefinition="Everywhere"
                   allowExeDefinition="MachineToApplication"
                   />

    within the sectionGroup section.  Once I added that, jobs started processing fine. 

    THanks for the help, appreciate it!

    Friday, May 27, 2016 10:46 AM