none
Bare metal deployment hangs in a PXE loop

    Question

  • Hi,

    I have upgraded my cluster from 2008 R2 to 2016 recently and now want to re-image the nodes. In my understanding, this should work as in previous versions:

    1. Nodes always boot to PXE first, then to disk
    2. If the node is imaged, the head node sends the node into PXE boot.
    3. The OS is applied using dism.
    4. After the reboot, the server sends abortpxe to make the node boot from disk.

    However, my HPC Pack 2016 Update 1 is reproducibly hanging in a loop of steps 1 - 3, ie the server always sends the nodes back to PXE. Eventually, the node template will be disassociated after a series of failures. Has there something changed in the deployment process?

    The provisioning log looks like:

    Time Message

    28.05.2018 14:13:11 Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    28.05.2018 14:13:11 Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    28.05.2018 14:13:11 Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    28.05.2018 14:09:25 Installing Windows (Expected time: 30 minutes)
    28.05.2018 14:09:21 Customizing the Windows unattended installation script
    28.05.2018 14:09:14 Cleaning up WIM file
    28.05.2018 14:09:06 Extracting WIM C:\en_windows_server_2016_x64_dvd_9718492.WIM to C:\Install
    28.05.2018 14:09:01 Creating local directory for install media
    28.05.2018 14:08:42 Copying: Images\en_windows_server_2016_x64_dvd_9718492.WIM
    28.05.2018 14:08:36 Determines the installation disk using diskpart.
    28.05.2018 14:08:31 Copying: selectinstalldisk.bat
    28.05.2018 14:08:23 Configuring disk partitions
    28.05.2018 14:08:18 Copying: setup\diskpart.txt
    28.05.2018 14:08:08 Mounting the installation shared folder on the head node
    28.05.2018 14:06:59 Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    28.05.2018 14:06:59 Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    28.05.2018 14:06:59 Sending PXE command to boot node to WINPE (Expected boot time: 5-15 minutes)
    28.05.2018 14:01:59 Waiting for node to boot into WINPE
    28.05.2018 14:01:59 Initiating configuration operations for template: Windows Server 2016 Vanilla Installation Template
    28.05.2018 14:01:59 Found an existing account in Active Directory: KESHIKI21
    28.05.2018 14:01:59 Searching for an existing account in Active Directory
    28.05.2018 14:01:59 Connecting to domain controller: xxx
    28.05.2018 14:01:59 Initiating provisioning operations for template: Windows Server 2016 Vanilla Installation Template
    28.05.2018 14:01:59 Associating template Windows Server 2016 Vanilla Installation Template with node XXX\KESHIKI21
    28.05.2018 14:01:59 Moving node XXX\KESHIKI21 from state Unknown to state Provisioning
    28.05.2018 14:01:58 Assigning template Windows Server 2016 Vanilla Installation Template to node KESHIKI21

    I reckon that the multiple PXE commands come from the Private network and from the dual-port IB adapter. I am not totally sure, but I think this was the same with 2008 R2.

    I have a redundant head node with a three-node SF cluster. The network topology is 3 (private + application isolated). The installation share is on the first head node.

    I can complete the installation by manually forcing the node to boot from disk, however, this is not practical for a cluster...

    Best regards,
    Christoph

    Monday, May 28, 2018 12:22 PM

Answers

All replies

  • Just to pre-answer the question: The selectinstalldisk.bat step forces the installation to use the disk designated as system disk by diskpart (select disk system). We have a data disk in the nodes besides the OS disk, and under certain conditions I could not really isolate, the wrong disk was chosen for installation. However, the problem is not related to this step and also occurs with the plain vanilla task sequence shipped with HPC Pack 2016 Update 1.

    Best regards,
    Christoph

    Monday, May 28, 2018 12:32 PM
  • Please apply the QFE for HPC Pack 2016 Update 1 here: https://www.microsoft.com/en-us/download/details.aspx?id=56964

    - Fix issue that compute nodes may enter into WinPE many times during bare metal deployment;


    Qiufang Shi


    Tuesday, May 29, 2018 12:12 AM