IT 专业人士的资源 > 论坛主页 > Windows HPC Server Job Submission and Scheduling > I have a lot of pending jobs -- why won't they start?
提出问题提出问题
 

已答复I have a lot of pending jobs -- why won't they start?

  • 2009年5月22日 0:31Luke Scharf 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    My compute Cluster Administrator looks like this:

    Compute Nodes:
      Ready Nodes: 113
      Paused nodes: 2
      Unreachable Nodes: 1
      Pending for Approval nodes: 0

    Compute Jobs:
      Running jobs: 20
      Pending jobs: 1613
      Total jobs in queue: 189993
      Failed jobs: 58
      Cancelled jobs: 2
      finished jobs: 188300

    Flipping over to the "Node Management" screen, I see that I have several hundred cores free.

    If this were Torque/Moab, I would check to make sure that the queue is active, make sure that there are no reservations covering open nodes, make sure that the user's hat has sufficient funds to cover the operation...  But I don't have any of these buttons in the Compute Cluster Administrator, or the Compute Cluster Job Manager.

    When I look in the Compute Cluster Job Manager, there aren't notes in the "Pending Reason" field for most of the jobs.  The ones that do have a note say something to the effect of "a job of equal priority was queued earlier".

    I've done a cursory check of the system's Application Log and also "C:\Program Files\Microsoft Compute Cluster Pack\LogFiles", but nothing jumps out at me.  I'm unfamiliar with the error messages that are normally generated, though, so I don't really know what to grep for.

    The "node list" shows that 113 nodes are "READY", with nearly all of their CPUs "IDLE".

    What can cause 1613 jobs all be held, and not launched?  Is there anything that I should be looking for in the logs or in  to look for in either the logs or Any ideas would be greatly appreciated!

答案

  • 2009年5月22日 21:37Luke Scharf 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     已答复
    Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.

    When it crashes, eventvwr's application log shows a .Net "unhandled exception" error.  We're back in production now, and that error has been rotated off of the back of the event log.

全部回复

  • 2009年5月22日 0:41Steve Chilcoat 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    Is the scheduler service running? If not try to start it & see if your jobs start to execute. If it is running try restarting it and see if your jobs start to execute.

    Steve Chilcoat
  • 2009年5月22日 16:19Jeff Baxter 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    the other thing to check is to make sure that the UI is reporting the node information correctly, and you dont have nodes that have gone unreachable. We have seen cases where the UI and the underlying system were out of sync here.

    The quickest way to check is to go to a command window and type:
    node list  ( or node list | findstr /is unreach to filter out only unreachable nodes )
  • 2009年5月22日 21:37Luke Scharf 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     已答复
    Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.

    When it crashes, eventvwr's application log shows a .Net "unhandled exception" error.  We're back in production now, and that error has been rotated off of the back of the event log.
  • 2009年5月22日 21:38Luke Scharf 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    The "node list" output had enough idle nodes (around 100)  to run /something/.  Coming from the Unix world, I rarely take GUIs at face value!  :-)
  • 2009年5月28日 1:49Jeff Baxter 用户奖牌用户奖牌用户奖牌用户奖牌用户奖牌
     
    Hi Luke

    Probably the next thing to do is take a look at the database tables to see if there is a job stuck somewhere that is stalling things.

    To do this can you run the following queries, and see if they report anything unexpected:

    select jobid , count(*) from resources
    where jobid <> 0
    group by jobid
    order by jobid


    select state, count(*) from resources
    group by state
    order by state


    If you have full sql, you can use sql enterprise manager to run the queries( sqlwb.exe ) . For sql express the easiest way is to use the builtin sqlcmd executable from an admin command prompt ( this example assumes you are running on the headnode ):

    sqlcmd -S .\computecluster -E -d ccpclusterservice -Q "select jobid, count(*) from resources where jobid <> 0 group by jobid order by jobid"

    For the state query, these are the current set of states we recognise for resources:

     Offline                 = 0x0,
     Idle                    = 0x1,
     ScheduledReserve        = 0x2,
     JobScheduled            = 0x4,
     ReadyForTask            = 0x8,
     TaskScheduled           = 0x10,
     JobTaskScheduled        = 0x20,
     TaskDispatched          = 0x40,
     JobTaskDispatched       = 0x80,
     TaskRunning             = 0x100,
     CloseTask               = 0x200,
     CloseTaskDispatched     = 0x400,
     TaskClosed              = 0x800,
     CloseJob                = 0x1000,

    If you do see either nodes in an unexpected state, or jobs running that shouldnt be, we can dig in some more as to the root causes.

    cheers
    jeff