Ask a questionAsk a question
 

AnswerI have a lot of pending jobs -- why won't they start?

  • Friday, May 22, 2009 12:31 AMLuke Scharf Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    My compute Cluster Administrator looks like this:

    Compute Nodes:
      Ready Nodes: 113
      Paused nodes: 2
      Unreachable Nodes: 1
      Pending for Approval nodes: 0

    Compute Jobs:
      Running jobs: 20
      Pending jobs: 1613
      Total jobs in queue: 189993
      Failed jobs: 58
      Cancelled jobs: 2
      finished jobs: 188300

    Flipping over to the "Node Management" screen, I see that I have several hundred cores free.

    If this were Torque/Moab, I would check to make sure that the queue is active, make sure that there are no reservations covering open nodes, make sure that the user's hat has sufficient funds to cover the operation...  But I don't have any of these buttons in the Compute Cluster Administrator, or the Compute Cluster Job Manager.

    When I look in the Compute Cluster Job Manager, there aren't notes in the "Pending Reason" field for most of the jobs.  The ones that do have a note say something to the effect of "a job of equal priority was queued earlier".

    I've done a cursory check of the system's Application Log and also "C:\Program Files\Microsoft Compute Cluster Pack\LogFiles", but nothing jumps out at me.  I'm unfamiliar with the error messages that are normally generated, though, so I don't really know what to grep for.

    The "node list" shows that 113 nodes are "READY", with nearly all of their CPUs "IDLE".

    What can cause 1613 jobs all be held, and not launched?  Is there anything that I should be looking for in the logs or in  to look for in either the logs or Any ideas would be greatly appreciated!

Answers

  • Friday, May 22, 2009 9:37 PMLuke Scharf Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.

    When it crashes, eventvwr's application log shows a .Net "unhandled exception" error.  We're back in production now, and that error has been rotated off of the back of the event log.
    • Marked As Answer byLuke Scharf Wednesday, May 27, 2009 8:46 PM
    •  

All Replies

  • Friday, May 22, 2009 12:41 AMSteve Chilcoat Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Is the scheduler service running? If not try to start it & see if your jobs start to execute. If it is running try restarting it and see if your jobs start to execute.

    Steve Chilcoat
  • Friday, May 22, 2009 4:19 PMJeff Baxter Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    the other thing to check is to make sure that the UI is reporting the node information correctly, and you dont have nodes that have gone unreachable. We have seen cases where the UI and the underlying system were out of sync here.

    The quickest way to check is to go to a command window and type:
    node list  ( or node list | findstr /is unreach to filter out only unreachable nodes )
  • Friday, May 22, 2009 9:37 PMLuke Scharf Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer
    Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.

    When it crashes, eventvwr's application log shows a .Net "unhandled exception" error.  We're back in production now, and that error has been rotated off of the back of the event log.
    • Marked As Answer byLuke Scharf Wednesday, May 27, 2009 8:46 PM
    •  
  • Friday, May 22, 2009 9:38 PMLuke Scharf Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    The "node list" output had enough idle nodes (around 100)  to run /something/.  Coming from the Unix world, I rarely take GUIs at face value!  :-)
  • Thursday, May 28, 2009 1:49 AMJeff Baxter Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi Luke

    Probably the next thing to do is take a look at the database tables to see if there is a job stuck somewhere that is stalling things.

    To do this can you run the following queries, and see if they report anything unexpected:

    select jobid , count(*) from resources
    where jobid <> 0
    group by jobid
    order by jobid


    select state, count(*) from resources
    group by state
    order by state


    If you have full sql, you can use sql enterprise manager to run the queries( sqlwb.exe ) . For sql express the easiest way is to use the builtin sqlcmd executable from an admin command prompt ( this example assumes you are running on the headnode ):

    sqlcmd -S .\computecluster -E -d ccpclusterservice -Q "select jobid, count(*) from resources where jobid <> 0 group by jobid order by jobid"

    For the state query, these are the current set of states we recognise for resources:

     Offline                 = 0x0,
     Idle                    = 0x1,
     ScheduledReserve        = 0x2,
     JobScheduled            = 0x4,
     ReadyForTask            = 0x8,
     TaskScheduled           = 0x10,
     JobTaskScheduled        = 0x20,
     TaskDispatched          = 0x40,
     JobTaskDispatched       = 0x80,
     TaskRunning             = 0x100,
     CloseTask               = 0x200,
     CloseTaskDispatched     = 0x400,
     TaskClosed              = 0x800,
     CloseJob                = 0x1000,

    If you do see either nodes in an unexpected state, or jobs running that shouldnt be, we can dig in some more as to the root causes.

    cheers
    jeff