I have a lot of pending jobs -- why won't they start?My compute Cluster Administrator looks like this:<br/> <br/> Compute Nodes:<br/>   Ready Nodes: 113<br/>   Paused nodes: 2<br/>   Unreachable Nodes: 1<br/>   Pending for Approval nodes: 0<br/> <br/> Compute Jobs:<br/>   Running jobs: 20<br/>   Pending jobs: 1613<br/>   Total jobs in queue: 189993<br/>   Failed jobs: 58<br/>   Cancelled jobs: 2<br/>   finished jobs: 188300<br/> <br/> Flipping over to the &quot;Node Management&quot; screen, I see that I have several hundred cores free.<br/> <br/> If this were Torque/Moab, I would check to make sure that the queue is active, make sure that there are no reservations covering open nodes, make sure that the user's hat has sufficient funds to cover the operation...  But I don't have any of these buttons in the Compute Cluster Administrator, or the Compute Cluster Job Manager.<br/> <br/> When I look in the Compute Cluster Job Manager, there aren't notes in the &quot;Pending Reason&quot; field for most of the jobs.  The ones that do have a note say something to the effect of &quot;a job of equal priority was queued earlier&quot;.<br/> <br/> I've done a cursory check of the system's Application Log and also &quot;C:\Program Files\Microsoft Compute Cluster Pack\LogFiles&quot;, but nothing jumps out at me.  I'm unfamiliar with the error messages that are normally generated, though, so I don't really know what to grep for.<br/> <br/> The &quot;node list&quot; shows that 113 nodes are &quot;READY&quot;, with nearly all of their CPUs &quot;IDLE&quot;.<br/> <br/> What can cause 1613 jobs all be held, and not launched?  Is there anything that I should be looking for in the logs or in  to look for in either the logs or Any ideas would be greatly appreciated!© 2009 Microsoft Corporation. All rights reserved.Thu, 28 May 2009 01:49:24 Zcc54cbe3-9c79-46c2-b400-a33c964c785ehttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#cc54cbe3-9c79-46c2-b400-a33c964c785ehttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#cc54cbe3-9c79-46c2-b400-a33c964c785eLuke Scharfhttp://social.microsoft.com/Profile/en-US/?user=Luke%20ScharfI have a lot of pending jobs -- why won't they start?My compute Cluster Administrator looks like this:<br/> <br/> Compute Nodes:<br/>   Ready Nodes: 113<br/>   Paused nodes: 2<br/>   Unreachable Nodes: 1<br/>   Pending for Approval nodes: 0<br/> <br/> Compute Jobs:<br/>   Running jobs: 20<br/>   Pending jobs: 1613<br/>   Total jobs in queue: 189993<br/>   Failed jobs: 58<br/>   Cancelled jobs: 2<br/>   finished jobs: 188300<br/> <br/> Flipping over to the &quot;Node Management&quot; screen, I see that I have several hundred cores free.<br/> <br/> If this were Torque/Moab, I would check to make sure that the queue is active, make sure that there are no reservations covering open nodes, make sure that the user's hat has sufficient funds to cover the operation...  But I don't have any of these buttons in the Compute Cluster Administrator, or the Compute Cluster Job Manager.<br/> <br/> When I look in the Compute Cluster Job Manager, there aren't notes in the &quot;Pending Reason&quot; field for most of the jobs.  The ones that do have a note say something to the effect of &quot;a job of equal priority was queued earlier&quot;.<br/> <br/> I've done a cursory check of the system's Application Log and also &quot;C:\Program Files\Microsoft Compute Cluster Pack\LogFiles&quot;, but nothing jumps out at me.  I'm unfamiliar with the error messages that are normally generated, though, so I don't really know what to grep for.<br/> <br/> The &quot;node list&quot; shows that 113 nodes are &quot;READY&quot;, with nearly all of their CPUs &quot;IDLE&quot;.<br/> <br/> What can cause 1613 jobs all be held, and not launched?  Is there anything that I should be looking for in the logs or in  to look for in either the logs or Any ideas would be greatly appreciated!Fri, 22 May 2009 00:31:53 Z2009-05-22T00:31:53Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#03b6a20f-7ba9-4921-ae4e-2bd29f1ff2c2http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#03b6a20f-7ba9-4921-ae4e-2bd29f1ff2c2Steve Chilcoathttp://social.microsoft.com/Profile/en-US/?user=Steve%20ChilcoatI have a lot of pending jobs -- why won't they start?Is the scheduler service running? If not try to start it &amp; see if your jobs start to execute. If it is running try restarting it and see if your jobs start to execute.<br/><br/>Steve ChilcoatFri, 22 May 2009 00:41:16 Z2009-05-22T00:41:16Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#aafaa978-61dc-4056-8743-2dcf8b3045c3http://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#aafaa978-61dc-4056-8743-2dcf8b3045c3Jeff Baxterhttp://social.microsoft.com/Profile/en-US/?user=Jeff%20BaxterI have a lot of pending jobs -- why won't they start?the other thing to check is to make sure that the UI is reporting the node information correctly, and you dont have nodes that have gone unreachable. We have seen cases where the UI and the underlying system were out of sync here.<br/><br/>The quickest way to check is to go to a command window and type:<br/>node list  ( or node list | findstr /is unreach to filter out only unreachable nodes )Fri, 22 May 2009 16:19:45 Z2009-05-22T16:19:45Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#689d7947-f4ac-46fc-b565-a3a6cf940a9ahttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#689d7947-f4ac-46fc-b565-a3a6cf940a9aLuke Scharfhttp://social.microsoft.com/Profile/en-US/?user=Luke%20ScharfI have a lot of pending jobs -- why won't they start?Yes, the scheduler service was restarted several times, and the machine was even rebooted.  For some reason, it just started running jobs again this morning, even though the scheduler crashed crashed several more times.  We're back in production, but I would like to understand what happened so that I can prevent it from happening in the future.<br/> <br/> When it crashes, eventvwr's application log shows a .Net &quot;unhandled exception&quot; error.  We're back in production now, and that error has been rotated off of the back of the event log.Fri, 22 May 2009 21:37:22 Z2009-05-22T21:40:03Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#601c062c-4914-4d5b-a80a-6dc7a1698b0dhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#601c062c-4914-4d5b-a80a-6dc7a1698b0dLuke Scharfhttp://social.microsoft.com/Profile/en-US/?user=Luke%20ScharfI have a lot of pending jobs -- why won't they start?The &quot;node list&quot; output had enough idle nodes (around 100)  to run /something/.  Coming from the Unix world, I rarely take GUIs at face value!  :-)Fri, 22 May 2009 21:38:44 Z2009-05-22T21:41:01Zhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#2dcb7f7b-10bc-4ee9-b40c-85ba374e1ecbhttp://social.microsoft.com/Forums/en-US/windowshpcsched/thread/cc54cbe3-9c79-46c2-b400-a33c964c785e#2dcb7f7b-10bc-4ee9-b40c-85ba374e1ecbJeff Baxterhttp://social.microsoft.com/Profile/en-US/?user=Jeff%20BaxterI have a lot of pending jobs -- why won't they start?Hi Luke<br/><br/>Probably the next thing to do is take a look at the database tables to see if there is a job stuck somewhere that is stalling things.<br/><br/>To do this can you run the following queries, and see if they report anything unexpected:<br/><br/>select jobid , count(*) from resources <br/>where jobid &lt;&gt; 0<br/>group by jobid<br/>order by jobid <br/><br/><br/>select state, count(*) from resources<br/>group by state<br/>order by state <br/><br/><br/>If you have full sql, you can use sql enterprise manager to run the queries( sqlwb.exe ) . For sql express the easiest way is to use the builtin sqlcmd executable from an admin command prompt ( this example assumes you are running on the headnode ):<br/><br/>sqlcmd -S .\computecluster -E -d ccpclusterservice -Q &quot;select jobid, count(*) from resources where jobid &lt;&gt; 0 group by jobid order by jobid&quot;<br/><br/>For the state query, these are the current set of states we recognise for resources:<br/><br/> Offline                 = 0x0,<br/> Idle                    = 0x1,<br/> ScheduledReserve        = 0x2,<br/> JobScheduled            = 0x4,<br/> ReadyForTask            = 0x8,<br/> TaskScheduled           = 0x10,<br/> JobTaskScheduled        = 0x20,<br/> TaskDispatched          = 0x40,<br/> JobTaskDispatched       = 0x80,<br/> TaskRunning             = 0x100,<br/> CloseTask               = 0x200,<br/> CloseTaskDispatched     = 0x400,<br/> TaskClosed              = 0x800,<br/> CloseJob                = 0x1000,<br/><br/>If you do see either nodes in an unexpected state, or jobs running that shouldnt be, we can dig in some more as to the root causes.<br/><br/>cheers<br/>jeffThu, 28 May 2009 01:49:24 Z2009-05-28T01:49:24Z