none
Nodes Fail during Provisioning RRS feed

  • Question

  • I am deploying compute nodes from bare metal via a WIM image, but it is failing during provisioning right after the HPC Pack is installed on the compute nodes.  So far it successfully images the node; creates the computer account for the node; joins it into the domain; and installs the .NET Framework and HPC Pack. It fails when it gets to "Checking the configuration of the computer node <domain>\<nodename>".  The next entry I get in the log says:

     "The Management service encountered an error while performing a change on the node. Access is denied to the user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information."

    It repeats this behavior five more times before it fails the operation, dissociates the template from the compute node, and reverts it. I have attempted to reassign the nodes to different templates, but I get the same thing.

    Has anyone seen this before? Any comments or suggestions would be appreciated!

    Thursday, January 22, 2009 6:54 PM

Answers

  • there seems to be something in your domain policy that might be causing this.  the provisioning logs look fine for the regular deployment until you start seeing this issue, post Compute nodes joining the domain. The compute nodes use the local system which translates to the machine identity when trying to access resources on the  network, they fall back to anonymous log on when the DC cannot be contacted by the CNs and so the machine identity cannot be used.
     This usually happens when the head node which is acting as a NAT ( compute nodes are not on public network), has some of its ports blocked by IPSEC or group policy in the domain.
    I recurrance of this failure, periodically after you allow anonymous log on seems to  work with this theory.
    thanks
    -parmita
    pm
    Thursday, March 26, 2009 6:08 PM
    Moderator

All replies

  • Hello npark,

    Has provisioning from bare metal ever been successful in your environment with the templates you are using? If it has worked in the past can you account for any changes?

    If you use a very basic or default node template to provision the tasks which run immediately after the HPC Pack installation should be set to true to continue on failure. Can you try to provision with a very basic node template to see what occurs. Anytime an "Access is denied" error message is logged its obviously due to a permissions issue. Are you using any strict GPOs and security templates to harden the environment. On one of the nodes which fails you may want to take a look at the NetSetup.LOG file which is located in 'C:\Windows\Debug' as it may hold some clues to this.

    Regards,
    Tyler

    • Proposed as answer by Josh Barnard Saturday, January 24, 2009 1:33 AM
    Friday, January 23, 2009 3:03 PM
  • Hi,

    I am working on a problem involving the same issue.
    I found a workaround here (http://support.microsoft.com/kb/839569) that initially seemed to solve the problem. Problem is that if I wait aprox 12 hours and reattempt provision a compute node again, the same error ("The Management service encountered an error while performing a change on the node. Access is denied to the user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information.") is back.

    If I then go and verify the workaround settings are still in place, which they are by the way, and then try to re-provision the node it does work.

    At this point it repeats every 12 hours.

    I have verified that the "Network Access: Allow anonymous SID/Name translation" setting is "not-defined" in the Domain Default Policy by running MMC snap-in RsOP on the HPC Master Node.

    The last thing we have attempted, wich I do not yet have the results from, is to create a new OU, move HPC Master node and compute nodes into this OU and the create a GPO that explicitly enable "Network Access: Allow anonymous SID/Name translation" and applied this to the OU.


    As I said we are still waiting for the results of enabling the "Network Access: Allow anonymous SID/Name translation" setting via GPO in the domain. But I do have my doubts, I did previously change it in the local security policy settings on the HPC Master node and the Default Domain Policy did not define any value to the setting so it should not change from what I set it to on the local computer.


    I would very much appriciate any other suggestions regarding this issue.


    Thanks!


    Best Regards
    Harald

    Thursday, February 5, 2009 6:46 PM
  • Hi Harald,

    Can you explain a little more why you're going with Anonymous logon for your cluster rather than using regular user accounts. This is a scenario I haven't heard much about from other users.

    Thank you,
    ryan
    Ryan Waite - Product Unit Manager - Windows HPC
    Monday, February 9, 2009 7:37 PM
  • A couple questions for the folks seeing this issue ..
    1) head node is a part of the domain right, can you ping the DC from the head node?

    2) How were the AD accounts created? were they created by the template or were they created manually?

    3) Have you had any other issues with a transient network error when connecting to the enterprise network (this is assuming that the DC is not on the head node.)

    We have seen similar errors when the CNs cannot authenticate eith the  DC for some reason or the other. I woudl be good to understand what your AD setup is like.

    thanks
    -Parmita Mehta
    Program Manager, HPC

    pm
    Monday, February 9, 2009 9:13 PM
    Moderator
  • Ryan Waite said:

    Hi Harald,

    Can you explain a little more why you're going with Anonymous logon for your cluster rather than using regular user accounts. This is a scenario I haven't heard much about from other users.

    Thank you,
    ryan


    Ryan Waite - Product Unit Manager - Windows HPC


    Why? I never did make that choice. I did specify a Domain Account in the HPC Cluster Maanger "To-do-List" Wizard...
    Tuesday, February 10, 2009 9:37 AM
  • parmita mehta said:

    A couple questions for the folks seeing this issue ..
    1) head node is a part of the domain right, can you ping the DC from the head node?

    2) How were the AD accounts created? were they created by the template or were they created manually?

    3) Have you had any other issues with a transient network error when connecting to the enterprise network (this is assuming that the DC is not on the head node.)

    We have seen similar errors when the CNs cannot authenticate eith the  DC for some reason or the other. I woudl be good to understand what your AD setup is like.

    thanks
    -Parmita Mehta
    Program Manager, HPC


    pm


    1) Head node is part of a Windows 2003 Domain. Yes I can ping the DC from the head node.
    2) User account for HPC was created manually, and is part of Domain Admins Group. Computer Accounts are created by the HPC Node Template.
    3) No other problems seen when connecting to enterprise network.


    Tuesday, February 10, 2009 9:41 AM
  • ok,  then, can you send me the provisioning log for the compute nodes where you see this error? and the management log from the head node?
    This could be a new issue and logs would help with the diagnosing greatly!
    thanks
    -parmita
    pm
    Wednesday, February 11, 2009 1:19 AM
    Moderator
  • Hi,


    This is what I see in the Event Viewer on the HPC Master Node:


    Log Name:      Windows HPC Server
    Source:        HpcManagement
    Date:          29-01-2009 11:40:22
    Event ID:      6100
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      XYZ-HPC-Master.XYZ.LOCAL
    Description:
    The operation 'Assigning template W2K8 HPC Editicon x86_64 (Default) to node COMPUTE03.'  failed to run correctly. The operation was initiated by the user: PTS. The operation can be identified by the GUID: 35c66dc9-e8e6-426a-88c0-8424c2170fc4. Using this GUID a log of the operation can be obtained from the hpc powershell command: Get-HpcOperation -id 35c66dc9-e8e6-426a-88c0-8424c2170fc4 | Get-HpcOperationLog
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="HpcManagement" />
        <EventID Qualifiers="0">6100</EventID>
        <Level>2</Level>
        <Task>0</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2009-01-29T10:40:22.000Z" />
        <EventRecordID>461</EventRecordID>
        <Channel>Windows HPC Server</Channel>
        <Computer>XYZ-HPC-Master.XYZ.LOCAL</Computer>
        <Security />
      </System>
      <EventData>
        <Data>The operation 'Assigning template W2K8 HPC Editicon x86_64 (Default) to node COMPUTE03.'  failed to run correctly. The operation was initiated by the user: PTS. The operation can be identified by the GUID: 35c66dc9-e8e6-426a-88c0-8424c2170fc4. Using this GUID a log of the operation can be obtained from the hpc powershell command: Get-HpcOperation -id 35c66dc9-e8e6-426a-88c0-8424c2170fc4 | Get-HpcOperationLog</Data>
      </EventData>
    </Event>

    This is what I get when runing the Get-HPCOperation command in Power Shell:


     
    Message                                  TimeCreated               Severity    
    -------                                  -----------               --------    
    Moving node XYZ\COMPUTE03 from state ... 29-01-2009 10:25:26       Information 
    Associating template W2K8 HPC Editico... 29-01-2009 10:25:26       Information 
    Initiating provisioning operations fo... 29-01-2009 10:25:26       Information 
    Connecting to DC: XYZ.LOCAL              29-01-2009 10:25:26       Information 
    Assigning template W2K8 HPC Editicon ... 29-01-2009 10:25:27       Information 
    Searching for an existing account in ... 29-01-2009 10:25:27       Information 
    Found an existing account in Active D... 29-01-2009 10:25:27       Information 
    Initiating configuration operations f... 29-01-2009 10:25:27       Information 
    Waiting for node to boot into WINPE      29-01-2009 10:25:27       Information 
    Sending PXE command to boot node to W... 29-01-2009 11:10:44       Information 
    Mounting Headnode install share          29-01-2009 11:12:34       Information 
    Copying: config\diskpart.txt             29-01-2009 11:12:37       Information 
    Configuring disk partitions              29-01-2009 11:12:39       Information 
    Copying: Images\W2K8-HPC-Edition-x86_... 29-01-2009 11:13:05       Information 
    Creating local directory for install ... 29-01-2009 11:13:48       Information 
    Extracting WIM C:\W2K8-HPC-Edition-x8... 29-01-2009 11:13:49       Information 
    Cleaning up WIM file                     29-01-2009 11:15:31       Information 
    Specializing Windows unattended insta... 29-01-2009 11:15:32       Information 
    Installing Windows (Expected time: 30... 29-01-2009 11:15:34       Information 
    Sending PXE command to boot node to t... 29-01-2009 11:24:54       Information 
    Sending PXE command to boot node to t... 29-01-2009 11:30:52       Information 
    Initiating deployment operations for ... 29-01-2009 11:33:21       Information 
    Joining domain: XYZ.LOCAL                29-01-2009 11:33:21       Information 
    Rebooting                                29-01-2009 11:33:31       Information 
    Sending PXE command to boot node to t... 29-01-2009 11:35:13       Information 
    Installing .NET Framework 3.0            29-01-2009 11:36:28       Information 
    Mounting share \\XYZ-HPC-Master\REMIN... 29-01-2009 11:39:16       Information 
    Copying: z:\setup.exe                    29-01-2009 11:39:18       Information 
    Copying: z:\en-us                        29-01-2009 11:39:19       Information 
    Copying: z:\Setup                        29-01-2009 11:39:20       Information 
    Installing Microsoft HPC Pack            29-01-2009 11:39:22       Information 
    Cleaning up Windows Install Data         29-01-2009 11:40:11       Information 
    Cleaning up HPC Pack Install Data        29-01-2009 11:40:16       Information 
    Checking the configuration of compute... 29-01-2009 11:40:17       Information 
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:20       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:20       Error       
    Failed to execute the change on the t... 29-01-2009 11:40:20       Warning     
    Checking the configuration of compute... 29-01-2009 11:40:20       Information 
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Failed to execute the change on the t... 29-01-2009 11:40:21       Warning     
    Checking the configuration of compute... 29-01-2009 11:40:21       Information 
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Failed to execute the change on the t... 29-01-2009 11:40:21       Warning     
    Checking the configuration of compute... 29-01-2009 11:40:21       Information 
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Failed to execute the change on the t... 29-01-2009 11:40:21       Warning     
    Checking the configuration of compute... 29-01-2009 11:40:21       Information 
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Could not contact node 'COMPUTE03' to... 29-01-2009 11:40:21       Error       
    Failed to execute the change on the t... 29-01-2009 11:40:21       Warning     
    The operation failed and will not be ... 29-01-2009 11:40:21       Error       
    The operation failed due to errors du... 29-01-2009 11:40:21       Warning     
    The parent operation is being rolled ... 29-01-2009 11:40:21       Warning     
    The parent operation is being rolled ... 29-01-2009 11:40:21       Warning     
    The parent operation is being rolled ... 29-01-2009 11:40:21       Warning     
    Dissasociating template from compute ... 29-01-2009 11:40:22       Information 
    Reverted                                 29-01-2009 11:40:22       Information 

    This is the provisioning log from HPC Admin Console:

    Time    Message
    16-01-2009 15:52:43    Reverted
    16-01-2009 15:52:43    Dissasociating template from compute node XYZ\COMPUTE01
    16-01-2009 15:52:43    The parent operation is being rolled back
    16-01-2009 15:52:43    The parent operation is being rolled back
    16-01-2009 15:52:43    The parent operation is being rolled back
    16-01-2009 15:52:43    The operation failed due to errors during execution.
    16-01-2009 15:52:43    The operation failed and will not be retried.
    16-01-2009 15:52:43    The compute node failed to execute the operation.
    16-01-2009 15:52:43    The Management service encountered an error while performing a change on this node. Access is denied to user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information.
    16-01-2009 15:52:43    Checking the configuration of compute node XYZ\COMPUTE01.
    16-01-2009 15:52:43    The compute node failed to execute the operation.
    16-01-2009 15:52:43    The Management service encountered an error while performing a change on this node. Access is denied to user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information.
    16-01-2009 15:52:43    Checking the configuration of compute node XYZ\COMPUTE01.
    16-01-2009 15:52:43    The compute node failed to execute the operation.
    16-01-2009 15:52:43    The Management service encountered an error while performing a change on this node. Access is denied to user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information.
    16-01-2009 15:52:43    Checking the configuration of compute node XYZ\COMPUTE01.
    16-01-2009 15:52:43    The compute node failed to execute the operation.
    16-01-2009 15:52:43    The Management service encountered an error while performing a change on this node. Access is denied to user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information.
    16-01-2009 15:52:43    Checking the configuration of compute node XYZ\COMPUTE01.
    16-01-2009 15:52:43    The compute node failed to execute the operation.
    16-01-2009 15:52:43    The Management service encountered an error while performing a change on this node. Access is denied to user 'NT AUTHORITY\ANONYMOUS LOGON'. Check the operation log in the Administration Console for more information.
    16-01-2009 15:52:43    Checking the configuration of compute node XYZ\COMPUTE01.
    16-01-2009 15:52:41    Cleaning up HPC Pack Install Data
    16-01-2009 15:52:34    Cleaning up Windows Install Data
    16-01-2009 15:51:44    Installing Microsoft HPC Pack
    16-01-2009 15:51:42    Copying: z:\Setup
    16-01-2009 15:51:41    Copying: z:\en-us
    16-01-2009 15:51:40    Copying: z:\setup.exe
    16-01-2009 15:51:38    Mounting share \\XYZ-HPC-Master\REMINST to drive z:
    16-01-2009 15:48:51    Installing .NET Framework 3.0
    16-01-2009 15:47:42    Sending PXE command to boot node to the current OS.
    16-01-2009 15:46:09    Rebooting
    16-01-2009 15:45:58    Joining domain: XYZ.LOCAL
    16-01-2009 15:45:58    Initiating deployment operations for template: W2K8 HPC Editicon x86_64 (Default)
    16-01-2009 15:43:21    Sending PXE command to boot node to the current OS.
    16-01-2009 15:37:22    Sending PXE command to boot node to the current OS.
    16-01-2009 15:28:20    Installing Windows (Expected time: 30 minutes)
    16-01-2009 15:28:18    Specializing Windows unattended installation script
    16-01-2009 15:28:17    Cleaning up WIM file
    16-01-2009 15:26:35    Extracting WIM C:\W2K8-HPC-Edition-x86_64.WIM to C:\Install
    16-01-2009 15:26:34    Creating local directory for install media
    16-01-2009 15:25:32    Copying: Images\W2K8-HPC-Edition-x86_64.WIM
    16-01-2009 15:25:06    Configuring disk partitions
    16-01-2009 15:25:04    Copying: config\diskpart.txt
    16-01-2009 15:25:01    Mounting Headnode install share
    16-01-2009 15:23:05    Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    16-01-2009 15:22:52    Waiting for node to boot into WINPE
    16-01-2009 15:22:52    Initiating configuration operations for template: W2K8 HPC Editicon x86_64 (Default)
    16-01-2009 15:22:52    Computer account COMPUTE01 created
    16-01-2009 15:22:51    The computer account COMPUTE01 does not exist, creating a new account in Active Directory.
    16-01-2009 15:22:51    Searching for an existing account in the directory.
    16-01-2009 15:22:50    Connecting to DC: XYZ.LOCAL
    16-01-2009 15:22:50    Initiating provisioning operations for template: W2K8 HPC Editicon x86_64 (Default)
    16-01-2009 15:22:50    Associating template W2K8 HPC Editicon x86_64 (Default) with compute node XYZ\COMPUTE01
    16-01-2009 15:22:50    Moving node XYZ\COMPUTE01 from state Unknown to state Provisioning.
    16-01-2009 15:22:49    Assigning template W2K8 HPC Editicon x86_64 (Default) to node COMPUTE01.

    Friday, February 13, 2009 9:53 AM
  • there seems to be something in your domain policy that might be causing this.  the provisioning logs look fine for the regular deployment until you start seeing this issue, post Compute nodes joining the domain. The compute nodes use the local system which translates to the machine identity when trying to access resources on the  network, they fall back to anonymous log on when the DC cannot be contacted by the CNs and so the machine identity cannot be used.
     This usually happens when the head node which is acting as a NAT ( compute nodes are not on public network), has some of its ports blocked by IPSEC or group policy in the domain.
    I recurrance of this failure, periodically after you allow anonymous log on seems to  work with this theory.
    thanks
    -parmita
    pm
    Thursday, March 26, 2009 6:08 PM
    Moderator