locked
Head Node with Compute Node Problem RRS feed

  • Question

  • Hello,

    we use a Server with 16 cores for FEM Simulations via Ansys RSM an Windows HPC 2008 SP4 (Windows Server 2008 R2).
    The Server is configured as head node with compute node role.

    This connection works properly good, but:

    • If we submit a job, we have to switch the node manually online and when the job is finished the node switches offline automatically.
      Is this commen?
      We have tried to solve this problem by change the scheduler policy with no success.
      Our node runs with the head node template so managing the online and offline times like workstation nodes does not work, too.

    Is there a possibility to keep our node online when we have switched it to online manually?
    The best way would be if the node goes online automatically when we submit a job.

    Thanks for your help.

    Torsten

    Wednesday, March 12, 2014 9:09 AM

Answers

  • Thanks for your help.

    I solved the problem.
    The headnode was configured with an enterprise and an privat network and this caused the behaviour mentioned above.
    I noticed it when the server goes automatically offline when a short job (less than a hour) was computed and the server went offline at the same minute and second but one hour later than the last time.

    Now I configured the head node only with an enterprise network.
    Problem solved!!

    • Marked as answer by TBley84 Wednesday, April 30, 2014 12:52 PM
    Wednesday, April 30, 2014 12:52 PM

All replies

  • If I understand correctly, you want to achieve: 1) take nodes online before job submitting; 2) take nodes offline after job finished.

    Then, you may use script to get it by leveraging HPC Powershell cmdlet.  


    BR, Yizhong

    Wednesday, March 12, 2014 10:08 AM
  • Thank you for your answer.
    With regard to the problem that the node goes offline automatically you understand me wrong.

    We want to keep the node online but it goes after finishing a job offline automatically.
    For example: I allocate the node to evaluate four jobs. Then I see the four jobs in the cluster manager.
    Two of them will start directly and the other two will be queued due to no free ressources.
    After finishing the first two jobs the node goes offline without regarding the queued two jobs.
    They stay queued until I bring the node manually online again.

    We want the node to finish all jobs and then stay online.

    Thank for your help.

    TBley

    Wednesday, March 12, 2014 11:52 AM
  • Thanks for clarifying. That's interesting that node get offline automatically after some jobs finished.

    Please try to check which user trigger it

    1. Open HPC admin console, and select the node in node management

    2. Click action "Operations for the nodes" from action panel

    3. In the list view, add column "Operator"

    4. From the list view, you should be able to see all operations and their operator. Try to find some hint why the node get offline.


    BR, Yizhong

    Wednesday, March 12, 2014 12:58 PM
  • I think you meant with HPC admin console the cluster manager.
    If not, I am not sure where i can find it.

    So I tried it with the cluster manager.
    In the node management I go to the OperationsLog of our node.
    There I can only see the column Last Updated, State and Name.
    Unfortunately I cannot add there an additional column by right clicking or anything else.

    Is there an option anywhere else where I can check the log files?
    Perhaps the SQL database could have some useable infos.


    • Edited by TBley84 Thursday, March 13, 2014 8:24 AM
    Thursday, March 13, 2014 7:47 AM
  • The management operation logs can be in HPC Cluster Manager - the GUI Admin Console, under Node Management -> Operations. Please try to find if there is an operation with Name “Taking nodes offline”. When the operation is chosen, the detailed pane under could show the detailed operation log entries. Another way to view operation logs is to open HPC Powershell and use the PSH Cmdlets Get-HpcOperation and Get-HpcOperationLog.

    In common, the compute nodes would keep online and the node state is independent of the job state. So it might be possible the taking nodes offline operation is performed by system admins or the HPC management service for certain reasons. The management operation logs can reveal who has done this and why.

    HPC management database surely contains the management logs, however it is not recommended to look at logs there.

    Sunday, March 16, 2014 6:00 AM
  • Finally I found the logs.
    For executing one job I find to logs.
    1. Bringing nodes online (Done by myself)
    2. Taking nodes offline
         Details: (domain\xxx=Servername)

    Time    Message
    12.03.2014 16:12:49    Moving node domain\xxx from state Draining to state Offline
    12.03.2014 16:12:39    Waiting for node domain\xxxto drain
    12.03.2014 16:12:39    Setting the scheduler state for domain\xxxto offline.
    12.03.2014 16:12:39    Taking nodes offline: xxx
    12.03.2014 16:12:39    Updating the scheduler configuration for node domain\xxx
    12.03.2014 16:12:39    Moving node domain\xxx from state Online to state Draining

    There is no other sys admin for this server than myself, so an action to take the server offline by another person can be excluded.

    I hope this could give you idea what can be wrong.

    Monday, March 17, 2014 10:39 AM
  • nobody can help me?
    Monday, March 31, 2014 9:01 AM
  • Hi,

    So if you don't submit job, just bring node online, then the node can always stay online, right?

    And what is your job&task type, it is basic task, parametric task or  some others?

    Thursday, April 3, 2014 3:25 AM
  • Thanks for your help.

    I solved the problem.
    The headnode was configured with an enterprise and an privat network and this caused the behaviour mentioned above.
    I noticed it when the server goes automatically offline when a short job (less than a hour) was computed and the server went offline at the same minute and second but one hour later than the last time.

    Now I configured the head node only with an enterprise network.
    Problem solved!!

    • Marked as answer by TBley84 Wednesday, April 30, 2014 12:52 PM
    Wednesday, April 30, 2014 12:52 PM