HPC - Unreachable nodeHello,<br/><br/>Our HPC cluster is in production for three months: one head node, two compute nodes.<br/>Suddenly, we started to have some problems with jobs. Two cancelled jobs remained running and tasks queued. Some jobs continue to run but it was difficult to schedule new jobs.<br/>We restarted the HPC services on the head node. After that on compute node becomes unreachable. No problem with the second one. <br/>Ping between head node and compute nodes and reverse are working well and uses the private network. The name resolution works.<br/>We've rebooted the head node and the not working compute node without any success. We were able to change the role of the head node to add the compute node role.<br/><br/>On the head node, when I try to force the not working node to offline it doesn't work. I see in the HPCManagement log that there's a problem with a job ID:<br/>   Failed to update the configuration of the scheduler. The specified Job ID is not valid.  Check your Job ID and try again.<br/><br/>Result of the NODE LIST command:<br/>  Node Name           State       Max Run Idle<br/>  ------------------- ----------- --- --- ---- <br/>  ARCHPC30            Online      8   2   6<br/>  ARCHPC31            Online      8   4   4<br/>  ARCHPC32            Unreachable 8   4   4<br/><br/>It seems that the cluster thinks there's a job running on ARCHPC32<br/><br/>Result of the NODE VIEW ARCHPC32:<br/>  System Id                       : 4<br/>  System GUID                     : 98c285f8-c1bc-44af-ba57-d0698ec893ef<br/>  Job Types                       : Batch, Admin, Service<br/>  State                           : Unreachable<br/>  Number Of Cores                 : 8<br/>  Number Of Sockets               : 2<br/>  Offline Time                    : 3/10/2009 10:28:16 AM<br/>  Online Time                     : 5/15/2009 9:05:16 PM<br/>  Security Descriptor             : S-1-5-21-2581773388-2140707169-780721725-5144<br/>  Memory Size                     : 16382<br/>  CPU Speed                       : 3000<br/>  Node Groups                     : ComputeNodes<br/><br/>If I use the command job list /all /state:running or job list /all /status:running, I see all jobs even if they are finished, failed, running, canceled. Only two are running in the list and that corresponds to what's running now. <br/>How can I find the wrong job?<br/>How can I force the system to think that no one core is in use on ARCHPC32?<br/><br/>Thenks for your help.<br/><br/>Best regards.<br/><br/>Marc© 2009 Microsoft Corporation. All rights reserved.Mon, 18 May 2009 14:56:24 Z916ec675-e9ff-4ed9-82aa-065a4b766312http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#916ec675-e9ff-4ed9-82aa-065a4b766312http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#916ec675-e9ff-4ed9-82aa-065a4b766312Marc Costehttp://social.microsoft.com/Profile/en-US/?user=Marc%20CosteHPC - Unreachable nodeHello,<br/><br/>Our HPC cluster is in production for three months: one head node, two compute nodes.<br/>Suddenly, we started to have some problems with jobs. Two cancelled jobs remained running and tasks queued. Some jobs continue to run but it was difficult to schedule new jobs.<br/>We restarted the HPC services on the head node. After that on compute node becomes unreachable. No problem with the second one. <br/>Ping between head node and compute nodes and reverse are working well and uses the private network. The name resolution works.<br/>We've rebooted the head node and the not working compute node without any success. We were able to change the role of the head node to add the compute node role.<br/><br/>On the head node, when I try to force the not working node to offline it doesn't work. I see in the HPCManagement log that there's a problem with a job ID:<br/>   Failed to update the configuration of the scheduler. The specified Job ID is not valid.  Check your Job ID and try again.<br/><br/>Result of the NODE LIST command:<br/>  Node Name           State       Max Run Idle<br/>  ------------------- ----------- --- --- ---- <br/>  ARCHPC30            Online      8   2   6<br/>  ARCHPC31            Online      8   4   4<br/>  ARCHPC32            Unreachable 8   4   4<br/><br/>It seems that the cluster thinks there's a job running on ARCHPC32<br/><br/>Result of the NODE VIEW ARCHPC32:<br/>  System Id                       : 4<br/>  System GUID                     : 98c285f8-c1bc-44af-ba57-d0698ec893ef<br/>  Job Types                       : Batch, Admin, Service<br/>  State                           : Unreachable<br/>  Number Of Cores                 : 8<br/>  Number Of Sockets               : 2<br/>  Offline Time                    : 3/10/2009 10:28:16 AM<br/>  Online Time                     : 5/15/2009 9:05:16 PM<br/>  Security Descriptor             : S-1-5-21-2581773388-2140707169-780721725-5144<br/>  Memory Size                     : 16382<br/>  CPU Speed                       : 3000<br/>  Node Groups                     : ComputeNodes<br/><br/>If I use the command job list /all /state:running or job list /all /status:running, I see all jobs even if they are finished, failed, running, canceled. Only two are running in the list and that corresponds to what's running now. <br/>How can I find the wrong job?<br/>How can I force the system to think that no one core is in use on ARCHPC32?<br/><br/>Thenks for your help.<br/><br/>Best regards.<br/><br/>MarcSat, 16 May 2009 16:39:31 Z2009-05-16T16:39:31Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#16b18246-825a-4033-9d64-c8b37bf0cfc7http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#16b18246-825a-4033-9d64-c8b37bf0cfc7rmaghttp://social.microsoft.com/Profile/en-US/?user=rmagHPC - Unreachable nodeI have 30 nodes of 49 in my cluster that are showing online, but unreachable.  All diagnostic test pass except the ones that immediately fail due to being in an unreachable state.  I have tried almost everything I can think of.  I even tried reinstalling the HPC pack on one of the nodes to see if that would at least work.  The cluster has been in production since dec 08 and I haven't experienced anything like this.  Any help is appreciated.  It just started last night.Sun, 17 May 2009 21:14:06 Z2009-05-17T21:14:06Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#4953041a-168f-45e4-b4ce-8b59c87fdd67http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#4953041a-168f-45e4-b4ce-8b59c87fdd67rmaghttp://social.microsoft.com/Profile/en-US/?user=rmagHPC - Unreachable nodeit appears that somehow the nodes are associated with a job that does not exist.  if I perform a set-hpcnodestate -force -name xxxxx -state offline, I receive an error of &quot;failed to update the configuration of the scheduler, the specified job ID is invalid.  Check your Job ID and try again&quot;.  There are currently no jobs running and I tried stopping all job IDs that were run in the last 2 days with stop-hpcjob.  I'm stuck.Sun, 17 May 2009 23:14:39 Z2009-05-17T23:14:39Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#8bfbe9c1-c601-40f4-acad-1f543246502bhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#8bfbe9c1-c601-40f4-acad-1f543246502bJohannes_dehttp://social.microsoft.com/Profile/en-US/?user=Johannes_deHPC - Unreachable nodeHi rmag,<br/> <br/> I had a similar problem. Altough the error message about a not found job Id did't appear, somehow a cancelled job got several nodes stuck while draining.<br/> <br/> The following patch and a reboot solved the issue for me:<br/> <br/> http://www.microsoft.com/downloads/details.aspx?displaylang=en&amp;FamilyID=1ea55293-38a6-417c-b0e3-5942a0bfa008<br/> <br/> For more details read here: http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/a9075106-b474-4ca4-9dd8-02bcf529211c<br/> <br/> If you don't want to reboot you can restart the HPC services too.<hr class="sig">JHMon, 18 May 2009 07:02:21 Z2009-05-18T07:02:21Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#42dfc86e-0214-45e2-a9ac-d27b2d3a2779http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#42dfc86e-0214-45e2-a9ac-d27b2d3a2779Marc Costehttp://social.microsoft.com/Profile/en-US/?user=Marc%20CosteHPC - Unreachable nodeHi Johannes,<br/><br/>Thanks for your answer.<br/>Currently some jobs are running on the head node and a compute node.<br/>Are these jobs going to be cancelled if I restart the HPC services?<br/>Should I restart all HPC services or only one of them?<br/><br/>Best regards.<br/><br/>MarcMon, 18 May 2009 08:11:59 Z2009-05-18T08:11:59Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#577e08d4-1dca-46fb-9e8f-7eb2a3b75d40http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#577e08d4-1dca-46fb-9e8f-7eb2a3b75d40Marc Costehttp://social.microsoft.com/Profile/en-US/?user=Marc%20CosteHPC - Unreachable nodeHello,<br/><br/>I have applied the patch on the three servers and rebooted all of them.<br/>The problem isn't solved.<br/>The command set-hpcnodestate -force -name archpc32 -state offline displays the following error message.<br/>  Set-HpcNodeState : Failed to update the configuration of the scheduler. The spe<br/>  cified Job ID is not valid.  Check your Job ID and try again.<br/><br/>Any other idea?<br/><br/>Thanks.<br/><br/>MarcMon, 18 May 2009 08:56:41 Z2009-05-18T08:56:41Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#cc5a1fbe-71d4-48f6-91b1-4c0bf3219293http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#cc5a1fbe-71d4-48f6-91b1-4c0bf3219293Johannes_dehttp://social.microsoft.com/Profile/en-US/?user=Johannes_deHPC - Unreachable node<blockquote>Hi Johannes,<br/> <br/> Thanks for your answer.<br/> Currently some jobs are running on the head node and a compute node.<br/> Are these jobs going to be cancelled if I restart the HPC services?<br/> Should I restart all HPC services or only one of them?<br/> <br/> Best regards.<br/> <br/> Marc</blockquote> That depends on different job criteria, but most probable yes!<br/> <br/> Start all  of them and most important the HPC Storeh<br/><hr class="sig">JHMon, 18 May 2009 09:02:44 Z2009-05-18T09:02:44Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#aa5c0632-85c9-473a-92db-7794e21625a3http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#aa5c0632-85c9-473a-92db-7794e21625a3Johannes_dehttp://social.microsoft.com/Profile/en-US/?user=Johannes_deHPC - Unreachable nodeThere is some description from MWirth i recently stumbled over, that he altered the job details in the SQL database.<br/> However I don't think this is recommended.<br/> <strong>edit:</strong> <br/> http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/a9075106-b474-4ca4-9dd8-02bcf529211c<br/> <br/> Perhaps there is another issue involved. Try disabling the firewalls on all nodes, if it is secure in your environment. Check with your AD Admin wether replication is functional between your DCs and the nodes.<br/> <br/> <br/> My problem really got solved with the patch.<br/> Before that I tried different things like deleting the computer accounts from AD, forcing the node state aso.<br/> Nothing however improved the situation.<br/> <hr class=sig> JHMon, 18 May 2009 09:08:20 Z2009-05-18T09:10:30Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#ea6e7826-421b-48cc-b9cf-ce284f74a9a1http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#ea6e7826-421b-48cc-b9cf-ce284f74a9a1Marc Costehttp://social.microsoft.com/Profile/en-US/?user=Marc%20CosteHPC - Unreachable node<br/>I had a look in the database, especially in the tables dbo.Ressources and dbo.NodeResourceCounts. <br/><br/>The first table contains one record per node and core. <br/>On four records (concerned by the unreachable node), I found a jobid and taskid.<br/>This job didn't exist. I checked using the server manager.<br/>I stopped HPC services on the head node and modified the database (before I did a backup).<br/>In the table dbo.Ressources, I changed the four records:<br/>  JobId=0<br/>  TaskId=0<br/>  State=1<br/>  FirstTaskAllocation=False<br/><br/>In the table dbo.NodeResourcesCount, I changed the record corresponding to the unreachable node:<br/>  Idle=Total amount of core (same than Total)<br/>  TaskRunning=0<br/><br/>I restarted the HPC services on teh head node and the node became online.<br/>Now I can take it offline or online.<br/><br/>And jobs can run on all nodes.<br/><br/>I hope a fix will be available soo to do this automatically when HPC services are started on the head node.<br/><br/>MarcMon, 18 May 2009 09:30:17 Z2009-05-18T09:30:17Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#907010fe-e4de-4437-9931-a303eba2dc52http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#907010fe-e4de-4437-9931-a303eba2dc52Johannes_dehttp://social.microsoft.com/Profile/en-US/?user=Johannes_deHPC - Unreachable node<blockquote><br/> I had a look in the database, especially in the tables dbo.Ressources and dbo.NodeResourceCounts. <br/> <br/> The first table contains one record per node and core. <br/> On four records (concerned by the unreachable node), I found a jobid and taskid.<br/> This job didn't exist. I checked using the server manager.<br/> I stopped HPC services on the head node and modified the database (before I did a backup).<br/> In the table dbo.Ressources, I changed the four records:<br/>   JobId=0<br/>   TaskId=0<br/>   State=1<br/>   FirstTaskAllocation=False<br/> <br/> In the table dbo.NodeResourcesCount, I changed the record corresponding to the unreachable node:<br/>   Idle=Total amount of core (same than Total)<br/>   TaskRunning=0<br/> <br/> I restarted the HPC services on teh head node and the node became online.<br/> Now I can take it offline or online.<br/> <br/> And jobs can run on all nodes.<br/> <br/> I hope a fix will be available soo to do this automatically when HPC services are started on the head node.<br/> <br/> Marc</blockquote> Great that you could solve the problem.<br/> I'm not familiar with databases. Could you post a small tutorial / walkthrough how to get to the job details? (Tools used, paths aso)<br/> Would be great.<br/> <br/> Regards,<br/> <br/> Johannes<br/><hr class="sig">JHMon, 18 May 2009 09:41:03 Z2009-05-18T09:41:03Zhttp://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#b499950c-d83d-4181-a73a-ad1cf335ec01http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/916ec675-e9ff-4ed9-82aa-065a4b766312#b499950c-d83d-4181-a73a-ad1cf335ec01rmaghttp://social.microsoft.com/Profile/en-US/?user=rmagHPC - Unreachable node<p>We did the same on our cluster and this also fixed the problem.  Again, hopefully there will be a fix to force nodes offline or flush any open ties regardless of whether or not the jobID that they are associated with exists or not so we don't have to manually edit the database.</p>Mon, 18 May 2009 12:50:21 Z2009-05-18T12:50:21Z