Can't cancel taking node offline action. RRS feed

  • Question

  • 1. Take node offline and node goes to draining status because of running job.

    2. Later on running job finished and core in use is 0. However node stay on draining status and cancel take node offline action doesn't work.

    3. Reboot hpc agent service, node and headnode don't help.

    Tuesday, August 4, 2020 5:02 PM


  • Find the problem under the help from HPC experts.

    The RegisterUri and HeartbeatUri in the /opt/hpcagent/nodemanage.json contain the wrong hostname. Issue get resolved after updating both entries with the correct hostname , deleting the Linux node from hpcheadnode and rebooting the linux node .

    • Edited by lijun1234 Wednesday, August 5, 2020 3:38 AM
    • Marked as answer by lijun1234 Wednesday, August 5, 2020 3:38 AM
    Wednesday, August 5, 2020 3:37 AM