locked
Job Manager shows incorrect memory amount RRS feed

  • Question

  • These systems are running HPC Pack 2016.  The headnode is on Update 2 with the latest QFE.  All members are workstation nodes running Windows 10.  About half are on Update 1, the rest are on Update 2 with the latest QFE.

    When I look at nodes in the Cluster Manager, the specs are shown correctly, all members have 32GB RAM.  However, when anyone looks at the Job Manager (a user, or me as the admin), several are displayed as only having 16GB, which is resulting in users skipping those.  The ones with this discrepancy are running Update 2 with the QFE, though this issue existed before that was applied.

    Any ideas?  My biggest concern is if the cluster will use the resources correctly or not.  If so, I can at least tell them not to skip these resources.  It would be nice to fix the display issue as well.

    Friday, June 21, 2019 2:53 PM

All replies

  • Hi bryan.doe,

    Thanks for reporting this issue. So the HPC Cluster Manager can show the correct memory size of the workstation nodes, but the HPC Job Manager cannot, right? Could you give a snapshot comparison? We may want to check the HpcManagement service logs as well.

    Regards,

    Yutong Sun

    Friday, July 26, 2019 3:36 PM
  • That's correct.  The screenshot has a lot going on, but I have the Cluster Manager running, and opened Job Manager separately.  In Cluster Manager, you can see each node has 4 cores and 32GB RAM.  In Job Manager, when selecting nodes to run the job on, the cores are correct but the memory is wrong.  This is the case whether "Run this job only on nodes in the following list" is selected or not.

    Monday, July 29, 2019 9:12 PM
  • Hi bryan.doe,

    Right, cores are correct since you under-subscribed cores on some nodes. It is interesting for Memory values. Is there any different configuration between the node HPC-01 and HPC-02? HPC-01 shows the correct 32G Memory, while HPC-02 shows only half. If you open the same dialogue from HPC Cluster Manager, will the nodes show correct number of Memory?

    Regards,

    Yutong Sun

    Thursday, August 1, 2019 6:51 AM
  • Correct, the display of cores isn't an issue, it's just the memory.  And yes, if I open the New Job dialogue from Cluster Manager, it's the same as in Job Manager, it displays several as having less memory than they actually do.

    Looking closer, it appears to be just the "HPC Machines" group (which is one I created) affected.  Several of those machines have been replaced recently, but the old members were deleted.  Obviously the server knows the correct specs since it shows it in Resource Management.

    Thanks!

    Thursday, August 1, 2019 10:49 AM
  • You mentioned the machines had been replaced recently.

    Did you take these nodes offline and bring them online again after the replacement? You must do so to make the change on CPU cores or memory effective to job scheduler.

    Tuesday, August 6, 2019 4:16 AM
  • Anytime I replace a node, I take it offline and delete it from the console, so when they're replaced they're newly added to the cluster.
    Tuesday, August 6, 2019 2:30 PM