Cannot reuse an existing HPC cluster RRS feed

  • Question

  • Hello,

    After having created a cluster using the Microsoft HPC Pack IaaS Deployment script v. 4.5.2, we deploy files to it using the PowerShell cmdlet

    Invoke-Command -ConnectionUri $MyHeadNodeURI -Credential $MyHeadNodeCredentials -ScriptBlock $MyScriptBlock -ArgumentList $MyArgumentList
    This almost always works when the HPC cluster has just been created and is used for the first time. But when we want to reuse the cluster with another set of files to run, the above command almost always fails. These are typical error messages we receive in this case:
    Remoting data is missing InvocationInfo property.
        + CategoryInfo          : OperationStopped: (myheadnode.cloudapp.net:String) [], PSRemotingTransportException
        + FullyQualifiedErrorId : JobFailure
        + PSComputerName        : myheadnode.cloudapp.net
    Command has failed on node MyComputeNode-CN00. Message:Task failed during execution with exit code . Please check task's
    output for error details.
        + CategoryInfo          : NotSpecified: (Command has fai... error details.:String) [], RemoteException
        + FullyQualifiedErrorId : NativeCommandError
        + PSComputerName        : myheadnode.cloudapp.net
    [myheadnode.cloudapp.net] Connecting to remote server myheadnode.cloudapp.net failed with the following error message :
    WinRM cannot complete the operation. Verify that the specified computer name is valid, that the computer is accessible
    over the network, and that a firewall exception for the WinRM service is enabled and allows access from this computer.
    By default, the WinRM firewall exception for public profiles limits access to remote computers within the same local
    subnet. For more information, see the about_Remote_Troubleshooting Help topic.
        + CategoryInfo          : OpenError: (myheadnode.cloudapp.net:String) [], PSRemotingTransportException
        + FullyQualifiedErrorId : WinRMOperationTimeout,PSSessionStateBroken

    Restarting the head node virtual machine does not fix the problem. The only alternative we have found is to delete and recreate the cluster, which typically takes an hour.

    Any suggestion to work around this problem would be greatly appreciated.

    Thank you.


    • Edited by MarcSim Wednesday, March 22, 2017 5:29 PM
    Wednesday, March 22, 2017 5:26 PM

All replies

  • Hi Marc,

    Are you connecting to head node or the compute node? some error message seems from head node, and some from compute node. If you run the command in a on-premise client machine, you will need to install a certificate in the Current User\Trusted Root Certification Authorities. During the deployment, the iaas deployment script will install the certificate temporarily, and will remove it immediately when the deployment completed.

    You can use the following powershell script to fetch the certificate:

    $ServiceName = "servicename"
    $VMName = "vmname"
    $cerSavePath = "d:\myheadnode.cer"
    $vm = Get-AzureVM -ServiceName $ServiceName -Name $VMName
    $winRmCertificateThumbprint = $vm.VM.DefaultWinRMCertificateThumbprint
    $winRmCertificate = Get-AzureCertificate -ServiceName $ServiceName -Thumbprint $winRmCertificateThumbprint -ThumbprintAlgorithm sha1
    $certBytes = [System.Convert]::FromBase64String($winRmCertificate.Data)
    [IO.File]::WriteAllBytes($cerSavePath, $certBytes)

    Thursday, March 23, 2017 3:31 AM