none
MSHPC 2012 R2 U3 with on-site Linux Nodes - jobs never terminate RRS feed

  • Question

  • Hi,

    Has anyone successfully setup a 2012R2 u3 headnode with on-site linux compute nodes? I am really close, I've got all the way to launching jobs, but a simple (or even empty job containing just a comment), appears to stay in the queue running forever.

    In more detail, here are my steps, which follow technet.microsoft.com/mt595803.aspx

    1. Fresh install of Windows 2012 R2, fully service-packed, and joined to our domain.
    2. Fresh install of HPC 2012 R2, Update 3 from the full ZIP file.
    3. Configure the cluster, templates, users etc.

    HOWEVER: I did note the instructions "Currently...we recommend Topology 5: All nodes only on an enterprise network". My nodes all have infiniband, so I've taken a plunge here and interpreted "recommend" lightly, and gone for Topology 3: my head node has enterprise, but everything else has private+application. I can retest this if we think it's the problem, but the results (later) looked good on this front.

    4. Copied the linux binaries to a public R/W share.
    5. Exported the default certificate into the same place. (Subject Alternative Name is "DNS Name=FI--DIDELXHN.dide.local   DNS Name=FI--DIDELXHN")

    6. On another machine - fi--didelx01, I installed Ubuntu 14.04.04 (fnode - a bit of module adding for the infiniband, but otherwise all straightforward.
    7. Connect to the share as described, and run the python script... And it all worked - IF, my -clusname was just fi--didelxhn, without a .dide.local on the end. If I did that, then the node appeared in the cluster manager in the headnode; if I included .dide.local, it didn't.
    8. The node appeared, I could take it online, and in the properties, it had got an IP address for both the private and application networks - ALTHOUGH it didn't add an entry in System32/drivers/etc/hosts for the application one.

    So everything looked quite good at this point - so, on the head node:

    job submit /jobtemplate:LinuxNodes /jobname:empty /numcores:1 #donothing

    I get a job id returned, but the job sits in the "Active" queue indefinitely. I'm not sure how to tell how far it's got - what logs to look for, etc...

    Any ideas? Many thanks - would be great to get this new feature working, and I'd be very happy to help test/debug this if I can be useful.

    Wes

    Thursday, March 17, 2016 3:33 PM

All replies

  • Found some logs in /opt/hpcnodemanager/logs, so I've cleared them down, and this captures installing one node and running one job. Had to reboot the linux node - think it dropped the inifiband card on the way, but it made no difference to the behaviour of job submission.

    nodemanager.txt:

    [03/17 16:23:27.218] 2047 info: Log system works.
    [03/17 16:23:27.218] 2047 info: Version: 1.6.13.0
    [03/17 16:23:27.220] 2047 info: Cleaning up zombie processes
    [03/17 16:23:27.238] 2047 info: Cleanup zombie result:
    [03/17 16:23:27.238] 2047 debug: Trusted CA File: /opt/hpcnodemanager/certs/nodemanager.pem
    [03/17 16:23:27.238] 2047 warning: HostsFileUri not specified, hosts manager will not be started.
    [03/17 16:23:27.238] 2057 info: Monitoring thread created. Interval 1
    [03/17 16:23:27.238] 2059 debug: ---------> Report to https://fi--didelxhn:40001/api/fi--didelx01/computenodereported with {"Availability":1936876886,"Jobs":[],"JustStarted":1,"MacAddress":"","Name":"FI--DIDELX01"}
    [03/17 16:23:27.238] 2059 debug: Create client to https://fi--didelxhn:40001/api/fi--didelx01/computenodereported, configure: timeout 5 seconds, chuck size 65536
    [03/17 16:23:27.240] 2098 info: Opening at https://0.0.0.0:40002/, result opened.
    [03/17 16:23:27.315] 2059 debug: ---------> Reported to https://fi--didelxhn:40001/api/fi--didelx01/computenodereported response code 200, value -1, interval 30
    [03/17 16:23:30.238] 2058 debug: ---------> Report to https://fi--didelxhn:40001/api/fi--didelx01/registerrequested with {"CoreCount":8,"DistroInfo":"Linux version 4.2.0-34-generic (buildd@lgw01-55) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #39~14.04.1-Ubuntu SMP Fri Mar 11 11:38:02 UTC 2016\n","IpAddress":"","MemoryMegabytes":145055,"NetworksInfo":[{"IpV4":"127.0.0.1/8","IpV6":"","IsIB":0,"MacAddress":"00:00:00:00:00:00","Name":"lo:"},{"IpV4":"129.31.26.100/24","IpV6":"","IsIB":0,"MacAddress":"00:24:e8:43:31:46","Name":"eth0:"},{"IpV4":"11.0.0.2/24","IpV6":"","IsIB":1,"MacAddress":"00:24:e8:43:31:48","Name":"eth1:"},{"IpV4":"","IpV6":"","IsIB":0,"MacAddress":"00:24:e8:43:31:4a","Name":"eth2:"},{"IpV4":"","IpV6":"","IsIB":0,"MacAddress":"00:24:e8:43:31:4c","Name":"eth3:"},{"IpV4":"","IpV6":"","IsIB":0,"MacAddress":"80:00:04:04:fe:80:00:00:00:00:00:00:00:08:f1:04:03:99:ba:d9","Name":"ib0:"}],"NodeName":"FI--DIDELX01","SocketCount":2,"Time":"Thu Mar 17 16:23:29 2016\n"}
    [03/17 16:23:30.238] 2058 debug: Create client to https://fi--didelxhn:40001/api/fi--didelx01/registerrequested, configure: timeout 5 seconds, chuck size 65536
    [03/17 16:23:30.300] 2058 debug: ---------> Reported to https://fi--didelxhn:40001/api/fi--didelx01/registerrequested response code 200, value -1, interval 300
    [03/17 16:23:30.350] 2092 info: Request: Uri /api/FI--DIDELX01/metricconfig
    [03/17 16:23:30.350] 2092 debug: Request: Uri /api/FI--DIDELX01/metricconfig, Node FI--DIDELX01, Method metricconfig
    [03/17 16:23:30.350] 2092 debug: AuthenticationKey found
    [03/17 16:23:30.350] 2092 debug: CallbackUri found udp://FI--DIDELXHN.dide.local:9894/api/e250d3b6-ea7d-4554-90c7-b5677beb4a61/metricreported
    [03/17 16:23:30.353] 2098 info: Request: Uri /api/FI--DIDELX01/metric
    [03/17 16:23:30.353] 2098 debug: Request: Uri /api/FI--DIDELX01/metric, Node FI--DIDELX01, Method metric
    [03/17 16:23:30.353] 2098 debug: AuthenticationKey found
    [03/17 16:23:30.353] 2098 debug: CallbackUri found udp://FI--DIDELXHN.dide.local:9894/api/e250d3b6-ea7d-4554-90c7-b5677beb4a61/metricreported
    [03/17 16:23:30.716] 2093 debug: Json: {"MetricCounters":[{"InstanceId":1,"InstanceName":"_Total","MetricId":1,"Path":"\\Processor\\% Processor Time"},{"InstanceId":0,"InstanceName":null,"MetricId":2,"Path":"\\Memory\\Available MBytes"},{"InstanceId":1,"InstanceName":"_Total","MetricId":3,"Path":"\\PhysicalDisk\\Disk Bytes/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":4,"Path":"\\Memory\\Pages/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":5,"Path":"\\System\\Context switches/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":6,"Path":"\\System\\System Calls/sec"},{"InstanceId":1,"InstanceName":"_Total","MetricId":7,"Path":"\\LogicalDisk\\Avg. Disk Queue Length"},{"InstanceId":0,"InstanceName":null,"MetricId":8,"Path":"\\Node Manager\\Number of Cores in use"},{"InstanceId":0,"InstanceName":null,"MetricId":9,"Path":"\\Node Manager\\Number of Running Jobs"},{"InstanceId":0,"InstanceName":null,"MetricId":10,"Path":"\\Node Manager\\Number of Running Tasks"},{"InstanceId":0,"InstanceName":null,"MetricId":11,"Path":"\\LogicalDisk\\% Free Space"},{"InstanceId":0,"InstanceName":null,"MetricId":12,"Path":"\\Network Interface\\Bytes Total/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":13,"Path":"\\GPU\\GPU Time (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":14,"Path":"\\GPU\\GPU Fan Speed (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":15,"Path":"\\GPU\\GPU Memory Usage (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":16,"Path":"\\GPU\\GPU Memory Used (MB)"},{"InstanceId":0,"InstanceName":null,"MetricId":17,"Path":"\\GPU\\GPU Power Usage (Watts)"},{"InstanceId":0,"InstanceName":null,"MetricId":18,"Path":"\\GPU\\GPU SM Clock (MHz)"},{"InstanceId":0,"InstanceName":null,"MetricId":19,"Path":"\\GPU\\GPU Temperature (degrees C)"}]}
    [03/17 16:23:30.716] 2094 debug: Json: {"MetricCounters":[{"InstanceId":1,"InstanceName":"_Total","MetricId":1,"Path":"\\Processor\\% Processor Time"},{"InstanceId":0,"InstanceName":null,"MetricId":2,"Path":"\\Memory\\Available MBytes"},{"InstanceId":1,"InstanceName":"_Total","MetricId":3,"Path":"\\PhysicalDisk\\Disk Bytes/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":4,"Path":"\\Memory\\Pages/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":5,"Path":"\\System\\Context switches/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":6,"Path":"\\System\\System Calls/sec"},{"InstanceId":1,"InstanceName":"_Total","MetricId":7,"Path":"\\LogicalDisk\\Avg. Disk Queue Length"},{"InstanceId":0,"InstanceName":null,"MetricId":8,"Path":"\\Node Manager\\Number of Cores in use"},{"InstanceId":0,"InstanceName":null,"MetricId":9,"Path":"\\Node Manager\\Number of Running Jobs"},{"InstanceId":0,"InstanceName":null,"MetricId":10,"Path":"\\Node Manager\\Number of Running Tasks"},{"InstanceId":0,"InstanceName":null,"MetricId":11,"Path":"\\LogicalDisk\\% Free Space"},{"InstanceId":0,"InstanceName":null,"MetricId":12,"Path":"\\Network Interface\\Bytes Total/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":13,"Path":"\\GPU\\GPU Time (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":14,"Path":"\\GPU\\GPU Fan Speed (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":15,"Path":"\\GPU\\GPU Memory Usage (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":16,"Path":"\\GPU\\GPU Memory Used (MB)"},{"InstanceId":0,"InstanceName":null,"MetricId":17,"Path":"\\GPU\\GPU Power Usage (Watts)"},{"InstanceId":0,"InstanceName":null,"MetricId":18,"Path":"\\GPU\\GPU SM Clock (MHz)"},{"InstanceId":0,"InstanceName":null,"MetricId":19,"Path":"\\GPU\\GPU Temperature (degrees C)"}]}
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 1, InstanceId 1, InstanceName _Total Path \Processor\% Processor Time
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 2, InstanceId 0, InstanceName  Path \Memory\Available MBytes
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 3, InstanceId 1, InstanceName _Total Path \PhysicalDisk\Disk Bytes/sec
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 4, InstanceId 0, InstanceName  Path \Memory\Pages/sec
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 5, InstanceId 0, InstanceName  Path \System\Context switches/sec
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 6, InstanceId 0, InstanceName  Path \System\System Calls/sec
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 7, InstanceId 1, InstanceName _Total Path \LogicalDisk\Avg. Disk Queue Length
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 8, InstanceId 0, InstanceName  Path \Node Manager\Number of Cores in use
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 9, InstanceId 0, InstanceName  Path \Node Manager\Number of Running Jobs
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 10, InstanceId 0, InstanceName  Path \Node Manager\Number of Running Tasks
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 11, InstanceId 0, InstanceName  Path \LogicalDisk\% Free Space
    [03/17 16:23:30.716] 2094 debug: Enabled counter MetricId 12, InstanceId 0, InstanceName  Path \Network Interface\Bytes Total/sec
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 13, InstanceId 0, InstanceName  Path \GPU\GPU Time (%)
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 14, InstanceId 0, InstanceName  Path \GPU\GPU Fan Speed (%)
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 15, InstanceId 0, InstanceName  Path \GPU\GPU Memory Usage (%)
    [03/17 16:23:30.716] 2093 debug: UUID variant rfc true
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 16, InstanceId 0, InstanceName  Path \GPU\GPU Memory Used (MB)
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 17, InstanceId 0, InstanceName  Path \GPU\GPU Power Usage (Watts)
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 18, InstanceId 0, InstanceName  Path \GPU\GPU SM Clock (MHz)
    [03/17 16:23:30.716] 2094 debug: Disabled counter MetricId 19, InstanceId 0, InstanceName  Path \GPU\GPU Temperature (degrees C)
    [03/17 16:23:30.716] 2093 info: getaddrinfo server FI--DIDELXHN.dide.local, port 9894
    [03/17 16:23:35.718] 2093 error: getaddrinfo failed Name or service not known
    [03/17 16:23:47.355] 2092 info: Request: Uri /api/FI--DIDELX01/metricconfig
    [03/17 16:23:47.355] 2092 debug: Request: Uri /api/FI--DIDELX01/metricconfig, Node FI--DIDELX01, Method metricconfig
    [03/17 16:23:47.355] 2092 debug: AuthenticationKey found
    [03/17 16:23:47.355] 2092 debug: CallbackUri found udp://FI--DIDELXHN.dide.local:9894/api/e250d3b6-ea7d-4554-90c7-b5677beb4a61/metricreported
    [03/17 16:23:47.355] 2098 debug: Json: {"MetricCounters":[{"InstanceId":1,"InstanceName":"_Total","MetricId":1,"Path":"\\Processor\\% Processor Time"},{"InstanceId":0,"InstanceName":null,"MetricId":2,"Path":"\\Memory\\Available MBytes"},{"InstanceId":1,"InstanceName":"_Total","MetricId":3,"Path":"\\PhysicalDisk\\Disk Bytes/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":4,"Path":"\\Memory\\Pages/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":5,"Path":"\\System\\Context switches/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":6,"Path":"\\System\\System Calls/sec"},{"InstanceId":1,"InstanceName":"_Total","MetricId":7,"Path":"\\LogicalDisk\\Avg. Disk Queue Length"},{"InstanceId":0,"InstanceName":null,"MetricId":8,"Path":"\\Node Manager\\Number of Cores in use"},{"InstanceId":0,"InstanceName":null,"MetricId":9,"Path":"\\Node Manager\\Number of Running Jobs"},{"InstanceId":0,"InstanceName":null,"MetricId":10,"Path":"\\Node Manager\\Number of Running Tasks"},{"InstanceId":0,"InstanceName":null,"MetricId":11,"Path":"\\LogicalDisk\\% Free Space"},{"InstanceId":0,"InstanceName":null,"MetricId":12,"Path":"\\Network Interface\\Bytes Total/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":13,"Path":"\\GPU\\GPU Time (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":14,"Path":"\\GPU\\GPU Fan Speed (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":15,"Path":"\\GPU\\GPU Memory Usage (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":16,"Path":"\\GPU\\GPU Memory Used (MB)"},{"InstanceId":0,"InstanceName":null,"MetricId":17,"Path":"\\GPU\\GPU Power Usage (Watts)"},{"InstanceId":0,"InstanceName":null,"MetricId":18,"Path":"\\GPU\\GPU SM Clock (MHz)"},{"InstanceId":0,"InstanceName":null,"MetricId":19,"Path":"\\GPU\\GPU Temperature (degrees C)"}]}
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 1, InstanceId 1, InstanceName _Total Path \Processor\% Processor Time
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 2, InstanceId 0, InstanceName  Path \Memory\Available MBytes
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 3, InstanceId 1, InstanceName _Total Path \PhysicalDisk\Disk Bytes/sec
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 4, InstanceId 0, InstanceName  Path \Memory\Pages/sec
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 5, InstanceId 0, InstanceName  Path \System\Context switches/sec
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 6, InstanceId 0, InstanceName  Path \System\System Calls/sec
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 7, InstanceId 1, InstanceName _Total Path \LogicalDisk\Avg. Disk Queue Length
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 8, InstanceId 0, InstanceName  Path \Node Manager\Number of Cores in use
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 9, InstanceId 0, InstanceName  Path \Node Manager\Number of Running Jobs
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 10, InstanceId 0, InstanceName  Path \Node Manager\Number of Running Tasks
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 11, InstanceId 0, InstanceName  Path \LogicalDisk\% Free Space
    [03/17 16:23:47.355] 2098 debug: Enabled counter MetricId 12, InstanceId 0, InstanceName  Path \Network Interface\Bytes Total/sec
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 13, InstanceId 0, InstanceName  Path \GPU\GPU Time (%)
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 14, InstanceId 0, InstanceName  Path \GPU\GPU Fan Speed (%)
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 15, InstanceId 0, InstanceName  Path \GPU\GPU Memory Usage (%)
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 16, InstanceId 0, InstanceName  Path \GPU\GPU Memory Used (MB)
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 17, InstanceId 0, InstanceName  Path \GPU\GPU Power Usage (Watts)
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 18, InstanceId 0, InstanceName  Path \GPU\GPU SM Clock (MHz)
    [03/17 16:23:47.355] 2098 debug: Disabled counter MetricId 19, InstanceId 0, InstanceName  Path \GPU\GPU Temperature (degrees C)
    [03/17 16:23:47.358] 2094 info: Request: Uri /api/FI--DIDELX01/metric
    [03/17 16:23:47.358] 2094 debug: Request: Uri /api/FI--DIDELX01/metric, Node FI--DIDELX01, Method metric
    [03/17 16:23:47.358] 2094 debug: AuthenticationKey found
    [03/17 16:23:47.358] 2094 debug: CallbackUri found udp://FI--DIDELXHN.dide.local:9894/api/e250d3b6-ea7d-4554-90c7-b5677beb4a61/metricreported
    [03/17 16:23:47.359] 2094 debug: Json: {"MetricCounters":[{"InstanceId":1,"InstanceName":"_Total","MetricId":1,"Path":"\\Processor\\% Processor Time"},{"InstanceId":0,"InstanceName":null,"MetricId":2,"Path":"\\Memory\\Available MBytes"},{"InstanceId":1,"InstanceName":"_Total","MetricId":3,"Path":"\\PhysicalDisk\\Disk Bytes/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":4,"Path":"\\Memory\\Pages/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":5,"Path":"\\System\\Context switches/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":6,"Path":"\\System\\System Calls/sec"},{"InstanceId":1,"InstanceName":"_Total","MetricId":7,"Path":"\\LogicalDisk\\Avg. Disk Queue Length"},{"InstanceId":0,"InstanceName":null,"MetricId":8,"Path":"\\Node Manager\\Number of Cores in use"},{"InstanceId":0,"InstanceName":null,"MetricId":9,"Path":"\\Node Manager\\Number of Running Jobs"},{"InstanceId":0,"InstanceName":null,"MetricId":10,"Path":"\\Node Manager\\Number of Running Tasks"},{"InstanceId":0,"InstanceName":null,"MetricId":11,"Path":"\\LogicalDisk\\% Free Space"},{"InstanceId":0,"InstanceName":null,"MetricId":12,"Path":"\\Network Interface\\Bytes Total/sec"},{"InstanceId":0,"InstanceName":null,"MetricId":13,"Path":"\\GPU\\GPU Time (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":14,"Path":"\\GPU\\GPU Fan Speed (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":15,"Path":"\\GPU\\GPU Memory Usage (%)"},{"InstanceId":0,"InstanceName":null,"MetricId":16,"Path":"\\GPU\\GPU Memory Used (MB)"},{"InstanceId":0,"InstanceName":null,"MetricId":17,"Path":"\\GPU\\GPU Power Usage (Watts)"},{"InstanceId":0,"InstanceName":null,"MetricId":18,"Path":"\\GPU\\GPU SM Clock (MHz)"},{"InstanceId":0,"InstanceName":null,"MetricId":19,"Path":"\\GPU\\GPU Temperature (degrees C)"}]}
    [03/17 16:23:57.315] 2059 debug: ---------> Report to https://fi--didelxhn:40001/api/fi--didelx01/computenodereported with {"Availability":1936876886,"Jobs":[],"JustStarted":0,"MacAddress":"","Name":"FI--DIDELX01"}
    [03/17 16:23:57.315] 2059 debug: Create client to https://fi--didelxhn:40001/api/fi--didelx01/computenodereported, configure: timeout 5 seconds, chuck size 65536
    [03/17 16:23:57.368] 2059 debug: ---------> Reported to https://fi--didelxhn:40001/api/fi--didelx01/computenodereported response code 200, value 3000000, interval 3000
    [03/17 16:24:21.036] 2097 info: Request: Uri /api/FI--DIDELX01/startjobandtask
    [03/17 16:24:21.036] 2097 debug: Request: Uri /api/FI--DIDELX01/startjobandtask, Node FI--DIDELX01, Method startjobandtask
    [03/17 16:24:21.036] 2097 debug: AuthenticationKey found
    [03/17 16:24:21.036] 2097 debug: CallbackUri found https://FI--DIDELXHN.dide.local:40001/api/FI--DIDELX01/taskcompleted
    [03/17 16:24:21.073] 2093 debug: Json: {"m_Item1":{"JobId":48,"ResIds":[69],"TaskId":50},"m_Item2":{"affinity":[1],"commandLine":"#donothing","environmentVariables":{"CCP_CLUSTER_NAME":"FI--DIDELXHN","CCP_COREIDS":"0","CCP_EXCLUSIVE":"False","CCP_ISADMIN":"1","CCP_JOBID":"48","CCP_JOBNAME":"","CCP_JOBTYPE":"Batch","CCP_MPI_NETMASK":"12.0.0.0/255.255.255.0","CCP_NODES":"1 FI--DIDELX01 1","CCP_NODES_CORES":"1 FI--DIDELX01 1","CCP_NUMCPUS":"1","CCP_OWNER_SID":"S-1-5-21-343818398-725345543-682003330-8202","CCP_REQUIREDNODES":"","CCP_RERUNNABLE":"False","CCP_RETRY_COUNT":"0","CCP_RUNTIME":"2147483647","CCP_SERVICEREGISTRATION_PATH":"\\\\FI--DIDELXHN\\HpcServiceRegistration","CCP_TASKID":"1","CCP_TASKINSTANCEID":"0","CCP_TASKSYSTEMID":"50","HPC_RUNTIMESHARE":"\\\\FI--DIDELXHN\\Runtime$","WCF_NETWORKPREFIX":"Application"},"stderr":null,"stdin":null,"stdout":null,"taskRequeueCount":0,"workingDirectory":null},"m_Item3":"DIDE\\wrhquentin","m_Item4":"Spurgeon7","m_Item5":null,"m_Item6":null}
    [03/17 16:24:21.073] 2093 debug: Job 48, Task 50.999: Run the job as root.
    [03/17 16:24:21.073] 2093 info: Job 48, Task 50.0: Change requeue count from 0 to 0, processKey 50
    [03/17 16:24:21.073] 2093 debug: Job 48, Task 50.0: , stream ? false
    [03/17 16:24:21.073] 2093 debug: Job 48, Task 50.0: StartTask for ProcessKey 50, process count 1
    [03/17 16:24:21.073] 2093 debug: Job 48, Task 50.0: Created thread 140088444860160
    [03/17 16:24:21.076] 2212 debug: Job 48, Task 50.0: 'chown -R root /tmp/nodemanager_task_50_0.mciirM', exitCode 0, output .
    [03/17 16:24:21.079] 2212 debug: Job 48, Task 50.0: 'chmod -R u+rwX /tmp/nodemanager_task_50_0.mciirM', exitCode 0, output .
    [03/17 16:24:21.099] 2212 debug: Job 48, Task 50.0: '/bin/bash PrepareTask.sh 50_0 0', exitCode 0, output .
    [03/17 16:24:21.100] 2212 debug: Job 48, Task 50.0: Monitor the forked process 2229
    [03/17 16:24:21.100] 2230 info: Started reading pipe thread
    [03/17 16:24:21.100] 2088 debug: Job 48, Task 50.0: Process started 2229
    [03/17 16:24:21.124] 2212 info: Job 48, Task 50.0: Process 2229: exite code 0
    [03/17 16:24:21.135] 2212 debug: Job 48, Task 50.0: Process 2229: Monitor ended
    [03/17 16:24:21.144] 2212 debug: Job 48, Task 50.0: '/bin/bash EndTask.sh 50_0 2229 1', exitCode 0, output .
    [03/17 16:24:21.155] 2212 debug: Job 48, Task 50.0: Statistics: 0
    0
    1671168


    [03/17 16:24:21.155] 2212 debug: Job 48, Task 50.0: WorkingSet 1671168
    [03/17 16:24:21.169] 2212 debug: Job 48, Task 50.0: '/bin/bash CleanupTask.sh 50_0 2229', exitCode 0, output .
    [03/17 16:24:21.172] 2212 debug: Job 48, Task 50.0: 'rm -rf /tmp/nodemanager_task_50_0.mciirM', exitCode 0, output .
    [03/17 16:24:21.172] 2212 debug: Job 48, Task 50.0: Callback to https://FI--DIDELXHN.dide.local:40001/api/FI--DIDELX01/taskcompleted with {"JobId":48,"NodeName":"FI--DIDELX01","TaskInfo":{"ExitCode":0,"Exited":1,"KernelProcessorTime":0,"Message":"STDOUT: \nSTDERR: \n","NumberOfProcesses":0,"PrimaryTask":1,"ProcessIds":"","TaskId":50,"TaskRequeueCount":0,"UserProcessorTime":0,"WorkingSet":1632}}
    [03/17 16:24:21.172] 2212 debug: Create client to https://FI--DIDELXHN.dide.local:40001/api/FI--DIDELX01/taskcompleted, configure: timeout 5 seconds, chuck size 65536
    [03/17 16:24:26.174] 2212 error: Job 48, Task 50.0: Exception when sending back task result. Error resolving address

    50_0_trust.txt : (which was created when I launched the job)

        adding FI--DIDELX01

    hpclinuxagent.log :

    2016/03/17 16:23:27 The command line is: /opt/hpcnodemanager/hpcagent enable
    2016/03/17 16:23:27 The command line is: /opt/hpcnodemanager/hpcagent daemon
    2016/03/17 16:23:27 The full cluster name is fi--didelxhn
    2016/03/17 16:23:27 Configure iptables to allow incoming tcp connection to 40000 and 40002.
    2016/03/17 16:23:28 HPC node manager process started
    2016/03/17 16:23:30 Daemon pid: 2031
    2016/03/17 16:23:30 HPC Linux node manager daemon is enabled

    Thursday, March 17, 2016 4:33 PM
  • So I think the problem is that fi--didelxhn.dide.local can't be pinged by the linux node - fi--didelxhn.dide.ic.ac.uk can though. I'm not quite sure why the original certificate used .dide.local instead of .dide.ic.ac.uk - that's something I can ask the domain admin... So perhaps I should try and create a new certificate with the full .dide.ic.ac.uk in the host name, instead of the default one that it set up for me...

    Sorry for thinking out loud - but hope this might be useful to someone!

    Thursday, March 17, 2016 4:44 PM