locked
I have a lot of pending jobs -- why won't they start? RRS feed

  • Question

  • My compute Cluster Administrator looks like this:

    Compute Nodes:
      Ready Nodes: 113
      Paused nodes: 2
      Unreachable Nodes: 1
      Pending for Approval nodes: 0

    Compute Jobs:
      Running jobs: 20
      Pending jobs: 1613
      Total jobs in queue: 189993
      Failed jobs: 58
      Cancelled jobs: 2
      finished jobs: 188300

    Flipping over to the "Node Management" screen, I see that I have several hundred cores free.

    If this were Torque/Moab, I would check to make sure that the queue is active, make sure that there are no reservations covering open nodes, make sure that the user's hat has sufficient funds to cover the operation...  But I don't have any of these buttons in the Compute Cluster Administrator, or the Compute Cluster Job Manager.

    When I look in the Compute Cluster Job Manager, there aren't notes in the "Pending Reason" field for most of the jobs.  The ones that do have a note say something to the effect of "a job of equal priority was queued earlier".

    I've done a cursory check of the system's Application Log and also "C:\Program Files\Microsoft Compute Cluster Pack\LogFiles", but nothing jumps out at me.  I'm unfamiliar with the error messages that are normally generated, though, so I don't really know what to grep for.

    The "node list" shows that 113 nodes are "READY", with nearly all of their CPUs "IDLE".

    What can cause 1613 jobs all be held, and not launched?  Is there anything that I should be looking for in the logs or in  to look for in either the logs or Any ideas would be greatly appreciated!
    Friday, May 22, 2009 12:31 AM

Answers

  • Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.

    When it crashes, eventvwr's application log shows a .Net "unhandled exception" error.  We're back in production now, and that error has been rotated off of the back of the event log.
    • Marked as answer by Luke Scharf Wednesday, May 27, 2009 8:46 PM
    Friday, May 22, 2009 9:37 PM

All replies

  • Is the scheduler service running? If not try to start it & see if your jobs start to execute. If it is running try restarting it and see if your jobs start to execute.

    Steve Chilcoat
    Friday, May 22, 2009 12:41 AM
  • the other thing to check is to make sure that the UI is reporting the node information correctly, and you dont have nodes that have gone unreachable. We have seen cases where the UI and the underlying system were out of sync here.

    The quickest way to check is to go to a command window and type:
    node list  ( or node list | findstr /is unreach to filter out only unreachable nodes )
    Friday, May 22, 2009 4:19 PM
  • Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.

    When it crashes, eventvwr's application log shows a .Net "unhandled exception" error.  We're back in production now, and that error has been rotated off of the back of the event log.
    • Marked as answer by Luke Scharf Wednesday, May 27, 2009 8:46 PM
    Friday, May 22, 2009 9:37 PM
  • The "node list" output had enough idle nodes (around 100)  to run /something/.  Coming from the Unix world, I rarely take GUIs at face value!  :-)
    Friday, May 22, 2009 9:38 PM
  • Hi Luke

    Probably the next thing to do is take a look at the database tables to see if there is a job stuck somewhere that is stalling things.

    To do this can you run the following queries, and see if they report anything unexpected:

    select jobid , count(*) from resources
    where jobid <> 0
    group by jobid
    order by jobid


    select state, count(*) from resources
    group by state
    order by state


    If you have full sql, you can use sql enterprise manager to run the queries( sqlwb.exe ) . For sql express the easiest way is to use the builtin sqlcmd executable from an admin command prompt ( this example assumes you are running on the headnode ):

    sqlcmd -S .\computecluster -E -d ccpclusterservice -Q "select jobid, count(*) from resources where jobid <> 0 group by jobid order by jobid"

    For the state query, these are the current set of states we recognise for resources:

     Offline                 = 0x0,
     Idle                    = 0x1,
     ScheduledReserve        = 0x2,
     JobScheduled            = 0x4,
     ReadyForTask            = 0x8,
     TaskScheduled           = 0x10,
     JobTaskScheduled        = 0x20,
     TaskDispatched          = 0x40,
     JobTaskDispatched       = 0x80,
     TaskRunning             = 0x100,
     CloseTask               = 0x200,
     CloseTaskDispatched     = 0x400,
     TaskClosed              = 0x800,
     CloseJob                = 0x1000,

    If you do see either nodes in an unexpected state, or jobs running that shouldnt be, we can dig in some more as to the root causes.

    cheers
    jeff
    Thursday, May 28, 2009 1:49 AM
  • Apologies for resurrecting a very old thread, but I've been experiencing this exact behaviour, and might have some light to shed on it - I'm in MS HPC 2008 R2.

    We have 9x8core machines, in a nodegroup called "8core", and another 15 or so 16Core in a group "16core". Someone has submitted a very large number of jobs (~50,000), queued for the 8core nodegroup, and then another 50,000 for the 16core group. The first 72 jobs are all running correctly on the 8cores, but the 16core jobs won't start up - they've been queuing for a day.

    I'm suspecting there may be a problem because there are a large number of jobs queuing - I noticed in Luke's first mail here he had nearly 190,000 jobs queuing. Are there any conditions that can occur whereby the scheduler aborts processing the queue part way through, without reaching the say, 100,000th job, which may be a candidate to run? For instance, if another job gets submitted while the scheduler is processing the queue, does it abort and start again later?

    Just looking for theories... the nodes themselves are all fine, confirmed both through command-line tools and the GUI.

    Thanks,

    Wes

    Sunday, December 23, 2012 8:11 PM
  • From the information you have posted on May 22nd I can see there are 189993 queued jobs in your scheduler.I think this overkill scheduler process.From my experience a few thousands queued jobs may crash scheduler process repeatedly. It happends to me all the time.It may be related to SQL Server though. Express Edition works with only 1 core however Enterprise Edition may use all available cores.

    Daniel Drypczewski

    Thursday, December 27, 2012 9:19 AM