none
Linux Node does not appear in Cluster Manager

    Question

  • I am running HPC Pack 2016 u1 and have Linux nodes that is running Red Hat 7.5

    i get the agent installed but i do not see it in the cluster manager

    Thursday, 6 September 2018 1:52 PM

Answers

  • Thanks for providing the very helpful logs.
    The reason shall be the hostname of your Linux node is a FQDN (i.e. BIOHPC3.DS.UAH.EDU), which is currently not supported. 

    The workaround is to modify the hostname of the Linux node to just "BIOHPC3"


    • Marked as answer by DCSpooner Monday, 10 September 2018 10:31 AM
    Monday, 10 September 2018 9:48 AM

All replies

  • Hi,

    Could you check the HPC Linux node agent log in /opt/hpcnodemanager/log/nodemanager.txt and share the error message?

    Friday, 7 September 2018 1:27 PM
  • i am not getting any error with in that file.

    here is the last few lines of the file

    [09/07 09:21:07.926] 181171 info: Hosts file manager: response received with no update
    [09/07 09:21:20.813] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:21:26.267] 181168 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:21:50.873] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:22:20.937] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:22:50.999] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:23:21.067] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:23:51.128] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:24:21.191] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:24:51.253] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:25:21.323] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:25:51.386] 181169 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:26:07.929] 181171 info: ResolveServiceLocation> Resolved serviceLocation SchedulerStatefulService for THOR1
    [09/07 09:26:07.938] 181171 info: Hosts file manager: response received with no update
    [09/07 09:26:21.452] 181169 info: ResolveServiceLocation> Reso

    and from htclinuxagent.log file

    2018/09/06 11:23:35 Stop HPC node manager daemon: 177920
    2018/09/06 11:23:35 HPC node manager daemon is disabled
    2018/09/06 11:23:43 The command line is: /opt/hpcnodemanager/hpcagent enable
    2018/09/06 11:23:43 The command line is: /opt/hpcnodemanager/hpcagent daemon
    2018/09/06 11:23:43 The connection string is thor1
    2018/09/06 11:23:43 Configure iptables to allow incoming tcp connection to 40000 and 40002.
    2018/09/06 11:23:44 HPC node manager process started
    2018/09/06 11:23:46 Daemon pid: 180241
    2018/09/06 11:23:46 HPC Linux node manager daemon is enabled
    2018/09/06 11:26:03 The command line is: /opt/hpcnodemanager/hpcagent disable
    2018/09/06 11:26:03 The cmd for process 180241 is python /opt/hpcnodemanager/hpcagent daemon
    2018/09/06 11:26:03
    2018/09/06 11:26:03 Stop HPC node manager daemon: 180241
    2018/09/06 11:26:03 HPC node manager daemon is disabled
    2018/09/06 11:26:05 The command line is: /opt/hpcnodemanager/hpcagent enable
    2018/09/06 11:26:05 The command line is: /opt/hpcnodemanager/hpcagent daemon
    2018/09/06 11:26:05 The connection string is thor1
    2018/09/06 11:26:05 Configure iptables to allow incoming tcp connection to 40000 and 40002.
    2018/09/06 11:26:06 HPC node manager process started
    2018/09/06 11:26:08 Daemon pid: 181152
    2018/09/06 11:26:08 HPC Linux node manager daemon is enabled

    • Edited by DCSpooner Friday, 7 September 2018 2:39 PM
    Friday, 7 September 2018 2:35 PM
  • i am also seeing this from the HpcScheduler log on the head node 

    09/07/2018 15:12:58.494 i HpcScheduler 2000 5892 [JV] Validating job changes 
    09/07/2018 15:12:58.495 i HpcScheduler 2000 5892 [JV] Validating task changes with expandParametric = False 
    09/07/2018 15:12:58.495 i HpcScheduler 2000 5892 [JV] Set all the new tasks to failed if the job is failed. 
    09/07/2018 15:12:58.496 i HpcScheduler 2000 5892 [JV] Validating task changes with expandParametric = True 
    09/07/2018 15:12:58.496 i HpcScheduler 2000 5892 [JV] Set all the new tasks to failed if the job is failed. 
    09/07/2018 15:12:58.884 i HpcScheduler 2000 4312 [LinuxCommunicator] Linux ComputeNodeReported. NodeName BIOHPC3.DS.UAH.EDU, JobCount 0 
    09/07/2018 15:12:58.885 v HpcScheduler 2000 4312 [LinuxCommunicator] ComputeNodeReported: Not found node BIOHPC3.DS.UAH.EDU 
    09/07/2018 15:12:58.885 v HpcScheduler 2000 4312 [UnmanagedResourceManager] ComputeNodeReported: Not found node BIOHPC3.DS.UAH.EDU 
    09/07/2018 15:12:59.330 v HpcScheduler 2000 5164 [Policy]  Scheduling policy  Queued 
    09/07/2018 15:12:59.331 i HpcScheduler 2000 5164 [Policy] Resource Pool state

    BIOHPC3.ds.uah.edu is the Linux node i am trying to see in cluster manager.

    Friday, 7 September 2018 3:30 PM
  • Thanks for providing the very helpful logs.
    The reason shall be the hostname of your Linux node is a FQDN (i.e. BIOHPC3.DS.UAH.EDU), which is currently not supported. 

    The workaround is to modify the hostname of the Linux node to just "BIOHPC3"


    • Marked as answer by DCSpooner Monday, 10 September 2018 10:31 AM
    Monday, 10 September 2018 9:48 AM
  • yes that was it, i would have not thought so in todays world of FQDN. 

    so i change the host name in the /etc/hosts files

    and did a sudo hostname BIOHPC3 

    after than did a sudo service hpcagent restart 

    and it showed up. 

    now just need to test a job and see what he outcome is

    thank you so much 

    Monday, 10 September 2018 10:34 AM