locked
Cannot reimage HPC node RRS feed

  • Question

  • Hi,

    A node in a Windows HPC Server 2008 cluster failed to boot yesterday. Connecting to the hardware console I knew that some system files were corrupt. So I booted using the Windows Server DVD in order to try to recover the installation. From the command line I ran chkdsk and learned that many files were damaged, so I decided I would try to reimage the node using WDS.

    When I set the node to reimage the node gets the IP 10.1.1.4 from DHCP server and then receives the command to boot to WINPE. The HPC manager gives the following log message:

    9/16/2010 8:13:23 PM Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)

    Then the node shows the message "Windows is loading files..." and the progress bar begins to fill. Under the progress bar I read "IP: 10.1.1.1" that is the IP of my head node, where the WDS service is running.

    But when the progress bar is full it changes to an error screen with the header:

    Windows Boot Manager (Server IP: 010.001.001.001)

    And the error reported is:

    Status: 0xc0000001

    Info: The boot selection failed because a required device is inaccessible.

    It only allows to retry or reboot with the same result. Additionally the WDS log reports an error:

    The Following Client failed TFTP Download:

    IP: 10.1.1.4

    Filename: \Boot\x64\boot.wim

    ErrorCode: 4317

    The file C:\Program Files\Microsoft HPC Pack\Data\Boot\x64\boot.wim exists and has read permissions for SYSTEM, Administrators and Users.

    Whats wrong and how can I fix it?

    Thanks

    Monday, September 20, 2010 7:42 AM

Answers

  • At the end the problem was that the file boot.wim was faulty. It's located in:

    c:\Program Files\Microsoft HPC Pack\data\boot\x86_64

    I replaced it with another one from a fresh installation and it works now.

    • Edited by drioja Thursday, September 23, 2010 5:18 PM completion
    • Marked as answer by drioja Thursday, September 23, 2010 5:20 PM
    Thursday, September 23, 2010 5:17 PM

All replies

  • Did you find a fix for this?  I'm having the same issue.
    Tuesday, September 21, 2010 7:25 AM
  • Hi Steppy,

    Are you receiving the same ErrorCode (4317) from the WDS log?

    I'm waiting for some support. In the meantime I have tried repairing the MBR using the Windows Server 2008 install DVD. No success. At least I know this manages to boot into WinPE.

    Another possibility is that the boot.wim file is broken, but it seems to be ok and I don't know where can I get a new boot.wim. From the install DVD perhaps?

    Also thought in erasing the node entry from the cluster and repeating its discovering using baremetal, but I think that this makes no difference in regards to WDS deployment.

    Any help will be appreciated.

    Thanks

    Tuesday, September 21, 2010 7:35 AM
  • Also, did you have to turn on WDS Trace logging to see the WDS Log (if not, where is the log located?).

    Thanks!

    Tuesday, September 21, 2010 7:37 AM
  • The log was on, at least in my case. I get to it through Start -> Administrative tools -> Server Manager and then navigating to Roles -> Windows Deployment Services. The first control in the summary is an event list where I can open the error events to see the code and details of the message.
    Tuesday, September 21, 2010 7:44 AM
  • Yeah, we tried the bare metal approach but got the same error.  I'll check the Server Manager logs when I'm at work tomorrow and let you know.  Unfortunately I'm at home now so can't really troubleshoot it any further.  There is MS article on how to enable Trace Logging for WDS - Not sure if it will provide more info, but might be worth a shot... http://support.microsoft.com/kb/936625
    Tuesday, September 21, 2010 7:49 AM
  • The first step in the procedure didn't work for me. I got the following. Perhaps because I have always managed the deployment using the HPC Cluster Manager.

     

    C:\Windows\system32>wdsutil /get-server /show:all /detailed

    Windows Deployment Services Management Utility [Version 6.0.6002.18005]
    Copyright (C) Microsoft Corporation. All rights reserved.


    An error occurred while trying to execute the command.
    Error Code: 0xC104013B
    Error Description: The Windows Deployment Services Deployment Server management
    tools are not configured.

     

    I also saw that the registry entries mentioned are not set to enable tracing. Before changing anything else I would like to obtain any information about what the error code 4317 means. I didn't find any reference on this for the moment.

    Tuesday, September 21, 2010 8:18 AM
  • We are getting the same error code - 4317.  Not much more to report though.  I'll keep looking.

    Wednesday, September 22, 2010 2:03 AM
  • Just came across this... Not sure if it applies but also might be worth a shot;

    http://support.microsoft.com/kb/975710/en-au

    Wednesday, September 22, 2010 2:12 AM
  • Could be, but doesn't seem to be related. I'm connecting through a switch and would be strange any blocking there.

    I've started a support call yesterday and asked for this hotfix, they told me we are going to try other options before, but no new communication until now. So I still wait.

    Any progress in your side?

    • Marked as answer by drioja Thursday, September 23, 2010 5:19 PM
    • Unmarked as answer by drioja Thursday, September 23, 2010 5:19 PM
    Thursday, September 23, 2010 8:41 AM
  • At the end the problem was that the file boot.wim was faulty. It's located in:

    c:\Program Files\Microsoft HPC Pack\data\boot\x86_64

    I replaced it with another one from a fresh installation and it works now.

    • Edited by drioja Thursday, September 23, 2010 5:18 PM completion
    • Marked as answer by drioja Thursday, September 23, 2010 5:20 PM
    Thursday, September 23, 2010 5:17 PM
  • Thanks for that, I'll give that a shot.  Did you just have to install the HPC 2008 Pack on another machine and that installs the boot.wim?
    Wednesday, September 29, 2010 9:50 PM
  • How large was the boot.wim image you replaced?  When I checked mine this morning it was over 12Gb.  This problem is listed in an MS article due to a bug in HPC Pack SP1...  However, when I replaced the boot.wim with one from the Windows 2008 source installation disk, it still won't build.  It doesn't fail like it did before and just loops TFTP'ing a Wdsnbp.com - The WDS logs just say TFTP started, TFTP completed and repeats every 10 seconds...  The MS article talks about changing the Network config to something else and then changing it back to fix the issue, but I couldn't do that today in case it caused an outage.

    When you replaced the boot.wim file, did you have to do anything else?

    cheers,

    Step.

    Thursday, September 30, 2010 6:20 AM
  • The boot.wim file that failed was 6GB large, that was too much I thnk. I took the new one from another cluster and it was about 100MB large. The other cluster had R2 version of WS2k8 but it worked.

    If you have no access to another cluster I think that you could install HPC Pack in any server or virtual machine with WS2k8 and then take the file from c:\Program Files\Microsoft HPC Pack\data\boot\x86_64

    • Edited by drioja Thursday, September 30, 2010 7:53 AM suggestion
    Thursday, September 30, 2010 7:49 AM
  • Finally got this fixed.  I copied our 12Gb boot.wim file and gave it to one of my Server SOE guys. He opened it with the WAIK tools and exported it out which reduced it back to 100 odd MB.  I copied that back in and all is now well.  Just have to organise a time to patch the server so it doesn't happen again.  It's already starting to grow...

    Thanks for all your help.

    Step.

    Monday, November 29, 2010 10:01 AM