none
HPC - Unreachable node

    Question

  • Hello,

    Our HPC cluster is in production for three months: one head node, two compute nodes.
    Suddenly, we started to have some problems with jobs. Two cancelled jobs remained running and tasks queued. Some jobs continue to run but it was difficult to schedule new jobs.
    We restarted the HPC services on the head node. After that on compute node becomes unreachable. No problem with the second one.
    Ping between head node and compute nodes and reverse are working well and uses the private network. The name resolution works.
    We've rebooted the head node and the not working compute node without any success. We were able to change the role of the head node to add the compute node role.

    On the head node, when I try to force the not working node to offline it doesn't work. I see in the HPCManagement log that there's a problem with a job ID:
       Failed to update the configuration of the scheduler. The specified Job ID is not valid.  Check your Job ID and try again.

    Result of the NODE LIST command:
      Node Name           State       Max Run Idle
      ------------------- ----------- --- --- ----
      ARCHPC30            Online      8   2   6
      ARCHPC31            Online      8   4   4
      ARCHPC32            Unreachable 8   4   4

    It seems that the cluster thinks there's a job running on ARCHPC32

    Result of the NODE VIEW ARCHPC32:
      System Id                       : 4
      System GUID                     : 98c285f8-c1bc-44af-ba57-d0698ec893ef
      Job Types                       : Batch, Admin, Service
      State                           : Unreachable
      Number Of Cores                 : 8
      Number Of Sockets               : 2
      Offline Time                    : 3/10/2009 10:28:16 AM
      Online Time                     : 5/15/2009 9:05:16 PM
      Security Descriptor             : S-1-5-21-2581773388-2140707169-780721725-5144
      Memory Size                     : 16382
      CPU Speed                       : 3000
      Node Groups                     : ComputeNodes

    If I use the command job list /all /state:running or job list /all /status:running, I see all jobs even if they are finished, failed, running, canceled. Only two are running in the list and that corresponds to what's running now.
    How can I find the wrong job?
    How can I force the system to think that no one core is in use on ARCHPC32?

    Thenks for your help.

    Best regards.

    Marc
    Saturday, May 16, 2009 4:39 PM

Answers


  • I had a look in the database, especially in the tables dbo.Ressources and dbo.NodeResourceCounts.

    The first table contains one record per node and core.
    On four records (concerned by the unreachable node), I found a jobid and taskid.
    This job didn't exist. I checked using the server manager.
    I stopped HPC services on the head node and modified the database (before I did a backup).
    In the table dbo.Ressources, I changed the four records:
      JobId=0
      TaskId=0
      State=1
      FirstTaskAllocation=False

    In the table dbo.NodeResourcesCount, I changed the record corresponding to the unreachable node:
      Idle=Total amount of core (same than Total)
      TaskRunning=0

    I restarted the HPC services on teh head node and the node became online.
    Now I can take it offline or online.

    And jobs can run on all nodes.

    I hope a fix will be available soo to do this automatically when HPC services are started on the head node.

    Marc
    • Marked as answer by Marc Coste Monday, May 18, 2009 9:32 AM
    Monday, May 18, 2009 9:30 AM

All replies

  • I have 30 nodes of 49 in my cluster that are showing online, but unreachable.  All diagnostic test pass except the ones that immediately fail due to being in an unreachable state.  I have tried almost everything I can think of.  I even tried reinstalling the HPC pack on one of the nodes to see if that would at least work.  The cluster has been in production since dec 08 and I haven't experienced anything like this.  Any help is appreciated.  It just started last night.
    Sunday, May 17, 2009 9:14 PM
  • it appears that somehow the nodes are associated with a job that does not exist.  if I perform a set-hpcnodestate -force -name xxxxx -state offline, I receive an error of "failed to update the configuration of the scheduler, the specified job ID is invalid.  Check your Job ID and try again".  There are currently no jobs running and I tried stopping all job IDs that were run in the last 2 days with stop-hpcjob.  I'm stuck.
    • Proposed as answer by Johannes_de Monday, May 18, 2009 10:25 AM
    Sunday, May 17, 2009 11:14 PM
  • Hi rmag,

    I had a similar problem. Altough the error message about a not found job Id did't appear, somehow a cancelled job got several nodes stuck while draining.

    The following patch and a reboot solved the issue for me:

    http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=1ea55293-38a6-417c-b0e3-5942a0bfa008

    For more details read here: http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/a9075106-b474-4ca4-9dd8-02bcf529211c

    If you don't want to reboot you can restart the HPC services too.
    JH
    Monday, May 18, 2009 7:02 AM
  • Hi Johannes,

    Thanks for your answer.
    Currently some jobs are running on the head node and a compute node.
    Are these jobs going to be cancelled if I restart the HPC services?
    Should I restart all HPC services or only one of them?

    Best regards.

    Marc
    Monday, May 18, 2009 8:11 AM
  • Hello,

    I have applied the patch on the three servers and rebooted all of them.
    The problem isn't solved.
    The command set-hpcnodestate -force -name archpc32 -state offline displays the following error message.
      Set-HpcNodeState : Failed to update the configuration of the scheduler. The spe
      cified Job ID is not valid.  Check your Job ID and try again.

    Any other idea?

    Thanks.

    Marc
    Monday, May 18, 2009 8:56 AM
  • Hi Johannes,

    Thanks for your answer.
    Currently some jobs are running on the head node and a compute node.
    Are these jobs going to be cancelled if I restart the HPC services?
    Should I restart all HPC services or only one of them?

    Best regards.

    Marc
    That depends on different job criteria, but most probable yes!

    Start all  of them and most important the HPC Storeh

    JH
    Monday, May 18, 2009 9:02 AM
  • There is some description from MWirth i recently stumbled over, that he altered the job details in the SQL database.
    However I don't think this is recommended.
    edit:
    http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/a9075106-b474-4ca4-9dd8-02bcf529211c

    Perhaps there is another issue involved. Try disabling the firewalls on all nodes, if it is secure in your environment. Check with your AD Admin wether replication is functional between your DCs and the nodes.


    My problem really got solved with the patch.
    Before that I tried different things like deleting the computer accounts from AD, forcing the node state aso.
    Nothing however improved the situation.

    JH
    • Edited by Johannes_de Monday, May 18, 2009 9:10 AM Added link
    Monday, May 18, 2009 9:08 AM

  • I had a look in the database, especially in the tables dbo.Ressources and dbo.NodeResourceCounts.

    The first table contains one record per node and core.
    On four records (concerned by the unreachable node), I found a jobid and taskid.
    This job didn't exist. I checked using the server manager.
    I stopped HPC services on the head node and modified the database (before I did a backup).
    In the table dbo.Ressources, I changed the four records:
      JobId=0
      TaskId=0
      State=1
      FirstTaskAllocation=False

    In the table dbo.NodeResourcesCount, I changed the record corresponding to the unreachable node:
      Idle=Total amount of core (same than Total)
      TaskRunning=0

    I restarted the HPC services on teh head node and the node became online.
    Now I can take it offline or online.

    And jobs can run on all nodes.

    I hope a fix will be available soo to do this automatically when HPC services are started on the head node.

    Marc
    • Marked as answer by Marc Coste Monday, May 18, 2009 9:32 AM
    Monday, May 18, 2009 9:30 AM

  • I had a look in the database, especially in the tables dbo.Ressources and dbo.NodeResourceCounts.

    The first table contains one record per node and core.
    On four records (concerned by the unreachable node), I found a jobid and taskid.
    This job didn't exist. I checked using the server manager.
    I stopped HPC services on the head node and modified the database (before I did a backup).
    In the table dbo.Ressources, I changed the four records:
      JobId=0
      TaskId=0
      State=1
      FirstTaskAllocation=False

    In the table dbo.NodeResourcesCount, I changed the record corresponding to the unreachable node:
      Idle=Total amount of core (same than Total)
      TaskRunning=0

    I restarted the HPC services on teh head node and the node became online.
    Now I can take it offline or online.

    And jobs can run on all nodes.

    I hope a fix will be available soo to do this automatically when HPC services are started on the head node.

    Marc
    Great that you could solve the problem.
    I'm not familiar with databases. Could you post a small tutorial / walkthrough how to get to the job details? (Tools used, paths aso)
    Would be great.

    Regards,

    Johannes

    JH
    Monday, May 18, 2009 9:41 AM
  • We did the same on our cluster and this also fixed the problem.  Again, hopefully there will be a fix to force nodes offline or flush any open ties regardless of whether or not the jobID that they are associated with exists or not so we don't have to manually edit the database.

    Monday, May 18, 2009 12:50 PM