none
Linux Node does not appear in Cluster Manager.

    Question

  • Hello Together,

    we try to setup a new Microsoft HPC Pack 2016 Cluster with on-premise Linux Nodes.
    With the manual from this site https://technet.microsoft.com/en-us/library/mt792019(v=ws.11).aspx
    we was able to install on the Linux Node the hpcagent.

    Unfortunatly the next steps are not explained.

    My expectation is currently, that the Linux Node will appear as new node in the HPC Cluster Manager Tooling
    (same as when we install some workstaion nodes) but this does not happen.

    We use as Linux System Unbuntu 16.04 (LTS).

    Can you please provide suggestions how to get the Linux Node included into the Cluster ?

    Thank you very much in advance for support.

    Best regards,

    Bobby
    • Edited by Bobby013 Friday, July 7, 2017 9:46 PM created link
    Friday, July 7, 2017 9:44 PM

All replies

  • Hello,

    Could you share us the following configuration files and logs to suzhu@microsoft.com?

    1. On the Linux node side:

    /opt/hpcnodemanager/nodemanager.json

    /opt/hpcnodemanager/logs

    2. On the head node side, share the following files with secondly latest index (for example, if 00009 is the latest, share 000008):

    C:\Program File\Microsoft HPC Pack 2016\Data\LogFiles\Scheduler\HpcScheduler_AA_*.bin

    C:\Program File\Microsoft HPC Pack 2016\Data\LogFiles\Scheduler\HpcScheduler_AB_*.bin

    C:\Program File\Microsoft HPC Pack 2016\Data\LogFiles\Scheduler\HpcScheduler_AC_*.bin

    Thanks,

    Sunbin



    • Edited by Sunbin Zhu Thursday, July 13, 2017 3:55 AM
    Monday, July 10, 2017 8:27 AM
  • Hello Sunbin Zhu,

    Thank you very much for your help, here the needed information you requested on this machines.

    I hope this helps you to analyze what is going on there.

    for me it seems the log file on the linux machine is pointing to a problem.

    1.) On the Linux Node side:

    nodemanager.json

    {"NamingServiceUri": ["http://lu00113vma:8939/api/fabric/resolve/singleton/"], "CertificateChainFile": "/opt/hpcnodemanager/certs/nodemanager.crt", "RegisterUri": "https://{0}:40003/api/lud354vu/registerrequested", "TrustedCAFile": "/opt/hpcnodemanager/certs/nodemanager.pem", "MetricInstanceIdsUri": "https://{0}:40003/api/lud354vu/getinstanceids", "DefaultServiceName": "SchedulerStatefulService", "MetricUri": "", "UdpMetricServiceName": "MonitoringStatefulService", "PrivateKeyFile": "/opt/hpcnodemanager/certs/nodemanager.key", "HeartbeatUri": "https://{0}:40003/api/lud354vu/computenodereported", "ListeningUri": "https://0.0.0.0:40002"}

    inside the log i see all the time the message (nodemanager.txt):

    [07/10 11:30:57.521] 19953 warning: ResolveServiceLocation> HttpException occurred when fetching from http://lu00113vma:8939/api/fabric/resolve/singleton/SchedulerStatefulService, ex Error resolving address

    2.) On Head Node side:

    I converted the *.bin to *.txt with hpctrace.

    HPCScheduler_AA_*.log: (very big with more than 50 MB (or as bin 4 MB)

    i found nothing inside which points to an error. How can i add this 4 MB file into this Forum ?

    Here the last entries from this file:

    07/10/2017 09:24:25.388 i HpcScheduler 9036 11544 [Policy] Resource Pool state Node LUVS003M [ <5 : Idle > ]..  
    07/10/2017 09:24:26.935 i HpcScheduler 9036 11540 [JV] Canceling jobs  
    07/10/2017 09:24:26.935 i HpcScheduler 9036 11540 [JV] Validating job changes  
    07/10/2017 09:24:26.935 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = False  
    07/10/2017 09:24:26.935 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:26.935 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = True  
    07/10/2017 09:24:26.935 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:26.950 v HpcScheduler 9036 11504 [RemotingCommunicator] NodeListener._HouseKeeper: Scheduler tests connection to itself, 172.20.195.139:5970  
    07/10/2017 09:24:27.903 v HpcScheduler 9036 11544 [Policy]  Scheduling policy  Queued  
    07/10/2017 09:24:27.903 i HpcScheduler 9036 11544 [Policy] Resource Pool state Node LUVS003M [ <5 : Idle > ]..  
    07/10/2017 09:24:29.435 i HpcScheduler 9036 11540 [JV] Canceling jobs  
    07/10/2017 09:24:29.435 i HpcScheduler 9036 11540 [JV] Validating job changes  
    07/10/2017 09:24:29.435 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = False  
    07/10/2017 09:24:29.435 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:29.435 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = True  
    07/10/2017 09:24:29.435 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:30.403 v HpcScheduler 9036 11544 [Policy]  Scheduling policy  Queued  
    07/10/2017 09:24:30.403 i HpcScheduler 9036 11544 [Policy] Resource Pool state Node LUVS003M [ <5 : Idle > ]..  
    07/10/2017 09:24:30.638 i HpcScheduler 9036 11508 [AzureCommunicator] SchedulerAzureCommunicator.CleanUpInternal: Enter, starting an Azure driver cleanup.  
    07/10/2017 09:24:30.638 i HpcScheduler 9036 11508 [AzureCommunicator] SchedulerAzureCommunicator.CleanUpInternal: Exit, Azure driver cleanup done.  
    07/10/2017 09:24:30.919 v HpcScheduler 9036 13284 [Store] RemoteEvent_TriggerTouch(UserName=n/a, ConnectionID=31)  
    07/10/2017 09:24:31.950 i HpcScheduler 9036 11540 [JV] Canceling jobs  
    07/10/2017 09:24:31.950 i HpcScheduler 9036 11540 [JV] Validating job changes  
    07/10/2017 09:24:31.950 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = False  
    07/10/2017 09:24:31.950 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:31.950 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = True  
    07/10/2017 09:24:31.950 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:32.903 v HpcScheduler 9036 11544 [Policy]  Scheduling policy  Queued  
    07/10/2017 09:24:32.903 i HpcScheduler 9036 11544 [Policy] Resource Pool state Node LUVS003M [ <5 : Idle > ]..  
    07/10/2017 09:24:34.450 i HpcScheduler 9036 11540 [JV] Canceling jobs  
    07/10/2017 09:24:34.450 i HpcScheduler 9036 11540 [JV] Validating job changes  
    07/10/2017 09:24:34.450 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = False  
    07/10/2017 09:24:34.450 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:34.450 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = True  
    07/10/2017 09:24:34.450 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:35.403 v HpcScheduler 9036 11544 [Policy]  Scheduling policy  Queued  
    07/10/2017 09:24:35.403 i HpcScheduler 9036 11544 [Policy] Resource Pool state Node LUVS003M [ <5 : Idle > ]..  
    07/10/2017 09:24:36.950 i HpcScheduler 9036 11540 [JV] Canceling jobs  
    07/10/2017 09:24:36.950 i HpcScheduler 9036 11540 [JV] Validating job changes  
    07/10/2017 09:24:36.950 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = False  
    07/10/2017 09:24:36.950 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  
    07/10/2017 09:24:36.950 i HpcScheduler 9036 11540 [JV] Validating task changes with expandParametric = True  
    07/10/2017 09:24:36.950 i HpcScheduler 9036 11540 [JV] Set all the new tasks to failed if the job is failed.  


    HPCScheduler_AB_*.log:

    Date(UTC) Time(UTC) Level Source PID TID Message
    07/07/2017 10:13:16.957 i HpcSchedulerStateful.exe 800 10488 [ServiceFabric] Changing role to IdleSecondary  
    07/07/2017 10:13:21.145 i HpcSchedulerStateful.exe 800 10448 [ServiceFabric] Changing role to ActiveSecondary  

    HPCScheduler_AC_*.log:

    Date(UTC) Time(UTC) Level Source PID TID Message
    07/07/2017 10:13:17.988 i HpcSchedulerStateful.exe 10224 10516 [ServiceFabric] Changing role to IdleSecondary  
    07/07/2017 10:13:24.285 i HpcSchedulerStateful.exe 10224 10516 [ServiceFabric] Changing role to ActiveSecondary  

    Any help is welcome, to fix the root cause of this problem, so that the Linux machine can comunicate with the Cluster.

    Thank you very much in advance,

    best regards,

    Bobby.

    Monday, July 10, 2017 10:38 AM
  • Hello Bobby,

    Seems the Linux node failed to resolve the IP address of your head node "lu00113vma", could you ping it on the Linux node?


    Tuesday, July 11, 2017 9:28 AM
  • Hello Sunbin,

    a ping is working, please see here:


    There must be an other reason, or ?

    Thanks for helping,

    best regards,

    Bobby

    Tuesday, July 11, 2017 9:44 AM
  • Hello Sunbin,

    when i take the command "http://lu00113vma:8939/api/fabric/resolve/singleton/SchedulerStatefulService" and put into a Browser on the Linux Node i will get following result:

    Is this the expected answer ? Or is something wrong here? (In this case the Answer is the name of the head node)

    Thank you very much for your help,

    best regards,

    Bobby

    Tuesday, July 11, 2017 5:42 PM
  • Yes, it is the expected answer, did you open the browser in your Linux node?

    Could you zip the logs i asked and send to suzhu@microsoft.com? the bin files you don't need to convert to txt format.

    And could you run command "hostname" on the Linux node and share the output?


    • Edited by Sunbin Zhu Thursday, July 13, 2017 3:55 AM
    Wednesday, July 12, 2017 5:00 PM
  • Hello Sunbin,

    i send you now the email with the requested log files and json files to the email address you gave me.

    when i execute the command "hostname" i will get as result "lud354vu".

    In the meantime i noticed, that i did not exportet the *.cer certificate which needs to convert to *.crt file format and register on linux side.

    With a new machine i did also this activity, but still here it is the same effect. This client does also not appear in the Resource Management of the Cluster Manager.

    The log files from this new node i send you also.

    Could it be that there is a problem with the certificates ?

    Thank you very much for your support,

    best regards,

    Bobby

    Friday, July 14, 2017 12:06 PM
  • Hello Sunbin,

    when i replace in the nodemanager.json file the NamingServiceUri from "http://lu00113vma:8939/....." to FQDN Name, the Error Message change from:

    24271 warning: ResolveServiceLocation> HttpException occured when fetching from http://lu00113vma:8939/api/fabric/resolve/singleton/SchedulerStatefulService, ex Error resolving address

    to:

    21786 info: ResolveServiceLocation> Resolved serviceLocationSchedulerStatefulService for LU00113VMA.

    21787 warning: HttpException occured when RegisterReporter report to https://LU00113VMA:4003/api/lud4pcfu/registerrrequested, ex Error resolving address.

    When i read this information correct it does mean, that the first request will work, but, the follow up not, because he get somewhere the not FQDN for the Head Node.

    I see here at the moment only one possibility, where this name could get from:

    The name "LU00113VMA" is returned from the first HTTP Request and should be returned as FQDN.

    My assumption now is that if the Head Node would return the FQDN Name, then the next step in the https request would also work.

    Is this assumption correct ?

    If yes, i remember at installation time, i was asked for a Cluster name. Here i inserted only the Head Node Name (without FQDN)

    Was this a mistake ? Should i insert here the FQDN name of the Head Node ?

    If it is like that, is there a possiblity to change the Clustername now to the FQDN ? (Changing a registry key value?)

    Because reinstallation would also be tricky because of some known issues here....

    It would be great if you could help me with some feedback about my assumption, and a a info what i need to change on the head node to get the Linux Node running.

    Thank you very much in advance,

    best regards

    Bobby.

    Tuesday, July 18, 2017 3:47 PM
  • Hi Bobby,

    Per you latest update, seems the Hpc Linux node agent failed to resolve the IP address of head node by host name, but can resolve by FQDN.

    The return from the first HTTP Request is correct, it is designed so to return a hostname but not a FQDN. 

    But you said you can ping the hostname, that is weird. 

    Anyway, could you check the /etc/resolv.conf file on the Linux node, there shall be a line 

    search <domainFQDN>

    For example, if the head node fqdn is myhn.hpc.local, the line shall be "search hpc.local"

    Wednesday, July 19, 2017 7:43 AM
  • Hi Sunbin,

    i checked the /etc/resolv.conf file and the line you are requesting is inside: "search <domainFQDN>"

    I did now edited my hosts file and added two entries:

         172.20.195.139 lu00113vma

         172.20.195.139 lu00113vma.<domainFQDN>

    After restating the hpc agent, i saw now the linux node in the Cluster Manager.

    I was now also able to bring the node online and execute an command on it.

    So it seems this helped. Still it is unclear for my why i need to add this information into the hosts file, and why the hpcagent daemon is not capable to resolve the names, but a ping is working.

    I will go to the local IT and ask here also for support. Maybe they have ideas about teh root cause of this.

    I do not know, if this workaround is a real solution for me, because if the IP changes (e.g. after reboot of HEad Node) it will not more work i guess.

    Maybe this new inforamtion helps you also to get more understaning, what is going on here.

    Thanks a lot for your help.

    best regards,

    Bobby

    Wednesday, July 19, 2017 10:02 AM
  • hi, I'm having the same problem on RHEL 7. has this been resolved?

    I can ping the headnode from the Linux node but only if I use the short name, not with the FQDN

    thanks

    Tuesday, February 20, 2018 3:36 PM
  • Hello,

    for me it is still working only with this modified host file. No other solution so far.

    All the best,

    Bobby

    Monday, February 26, 2018 8:52 AM