none
HPCPack2012R2 - Imaging compute nodes - Manager Service unreachable RRS feed

  • Question

  • Hi
    Not sure if this is the right way to get in touch? I have a MS support case (open a week now) but haven't got through to product team yet.
    Our Financial services company have been using HPC Pack about 6 years, since Win2003, on physical hardware to run MATLAB simulations on financial data. Our researchers love it!  I'm replacing our aging HPC hardware with IaaS (Amazon Web Services). I can join compute nodes to headnode if I install HPC pack computenode software after OS deployment. But if I clone compute nodes from a syspreped image with HPC pack preinstalled, the process fails after applying node template. The computenode goes into error state with HPC Node Manager Service unreachable and
    System.Runtime.Remoting.RemotingException: An error occurred while processing the request on the server: System.Runtime.Remoting.RemotingException: User identity is not authorized to connect to this endpoint
    I would love to get past this problem so I can quickly spin up compute nodes on demand. Let me know if this is something you've seen before?  I plan to repurpose the Auto Grow and Shrink Azure Nodes scripts to use AWS instead of Azure (happy to share my HPC on AWS discoveries with you guys).
    Regards
    Alex
    Tuesday, August 25, 2015 8:47 PM

All replies

  • Hi Alex,

      What version of HPC Pack are you using?

      error "User identity is not authorized to connect to this endpoint" means there is some trust issue between the headnode and the compute nodes while they communicate. We need the exact steps on how you do the clone and compute node deployment.

        We are now having built-in auto grow shrink support in the system without a script. We are happy to learn your discoveries. If possible, please contact hpcpack@microsoft.com


    Qiufang Shi

    Wednesday, August 26, 2015 2:06 AM
  • We're using HPC Pack 2012R2 Update2. We're using source HPC2012R2_Update2_Full.zip from https://www.microsoft.com/en-us/download/details.aspx?id=47755

    Here are the steps we're using to do the clone and compute node deployment:

    1. Create VM from latest Win2012R2 machine image in the image marketplace
    2. Join VM to domain (compute node software will not install on a machine in workgroup)
    3. Push HPC Pack software onto \\<NewComputeNodeReferenceVM>\C$\repo\packages
    4. Install HPC pack using powershell remoting using our domain credentials (setup.exe must be run with a domain account)

    [Scriptblock]$sb = {
    cd "C:\Repo\Packages\"
    setup.exe -unattend -computenode:scheduler.companyname.com
    Return $LASTEXITCODE
    }

    $exitCode = Invoke-Command -ComputerName $computerName -Authentication NegotiateWithImplicitCredential -ScriptBlock $sb

    5. No extra steps are taken to remove uniqueness from HPC pack before imaging. I was considering stopping the services and deleting the following keys?
    Remove-ItemProperty -path HKLM\SOFTWARE\Microsoft\HPC -name NodeId
    Remove-ItemProperty -path HKLM\SOFTWARE\Microsoft\HPC -name ActiveRole
    Remove-ItemProperty -path HKLM\SOFTWARE\Microsoft\HPC -name InstalledRole
    Remove-ItemProperty -path HKLM\SOFTWARE\Microsoft\HPC\Monitoring -name NodeId

    6. Configure sysprep.xml to make images machines join domain at startup (specialize pass, Microsoft-Windows-UnattendJoin, domain, OU, credentials)

    7. Sysprep and shutdown (generalize stage of sysprep)

    8. Wait for VM to shutown

    9. Snapshot compute node reference VM

    10. Create compute nodes from reference VM snapshot

    11. Specialize phase of sysprep runs. New compute nodes VM get new hostname, join domain

    12. New compute nodes appear in HPC scheduler as unassigned

    13. Apply node template (only step is activate OS)

    14. Template applies OK

    15. After about 30 seconds the compute node goes into error state. Management service unavailable.

    Wednesday, August 26, 2015 7:19 AM
  • Thanks Alex for the detailed steps. I checked and it should be the correct steps (The only difference from what we are doing: you configure the domain join at start up while we do it through remote powershell).

    Is your environment in azure? If yes you can share with us and we can take a check. If in your on premise environment, you can share the logs with us (Logs located at %CCP_DATA%LogFiles\), the scheduler and node manager logs both on headnode and compute node (I suppose management service don't report error state as you can bring them online successfully).

    And what network environment you're using? VLan?

    Thanks,

    Qiufang


    Qiufang Shi

    Wednesday, August 26, 2015 8:45 AM