none
Upgrade to HPC Pack 2016 Update 1 fails because SF is screwed

    Question

  • After a direct installation of HPC Pack 2016 Update 1 failed due to weird errors in the Service Fabric (https://social.microsoft.com/Forums/en-US/b55dc78b-be84-490a-947f-a94431a4b575/hpc-pack-2016-installation-fails-when-creating-service-fabric-application?forum=windowshpcitpros#0b409d78-d42d-422e-956f-12d1ab70f69c), I have installed HPC PACK 2016 RTM. Because the RTM lack critical features like baremetal deployment, I wanted to upgrade the cluster to Update 1. Following the instructions from https://technet.microsoft.com/en-us/library/mt829314(v=ws.11).aspx, I first upgraded the Service Fabric from the version in the HPC Pack package to 5.7.221.9494, then to 6.1.472.9494 (I direct upgrade to the latest version totally screwed the SF the last time). The final upgrade to 6.1.480.9494 yielded some weird error in the "fabric:/System/UpgradeOrchestrationService" and was rolled back automatically.

    Therefore, I decided to stay with 6.1.472.9494 and continued by upgrading the secondary nodes. Although MigrateHaHN.ps1 ran without any error, it did not install anything. However, the existing configuration was deleted, making it impossible to run the script again. Therefore, I looked into the script and found out that it basically just perfoms a prerequisite install with the existing certificate and did this manually. While doing so, I found that the upgrade script looks for a property called "RuntimeShare", which does not exist in the XML backup - this variable is called "RuntimeDataShare". 

    With this fixed, I could piece together the command line for the last head node like

    .\setup.exe -unattend -keepdata -ClusterName:<name> -HeadNode -SSLThumbprint:<thumbrint> -HeadNodeList:"<node1>,<node2>,<node3>" -ServiceRegistrationShare:"<share1>" -SpoolDirShare:"<share2>" -DiagnosticsShare:"<share3>" -InstallShare:"<share4>"

    If I try to run this, nothing happens - setup.exe quickly shows up in the task manager and then presumably crashes. However, I cannot see any crashes related to setup.exe in the event viewer. What I can see are crashes of (of course) the service fabric:

    Application: FabricUOS.exe
    Framework Version: v4.0.30319
    Description: The application requested process termination through System.Environment.FailFast(string message).
    Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.NullReferenceException: Object reference not set to an instance of an object.
       at Microsoft.ServiceFabric.ClusterManagementCommon.UserConfigVersion.Equals(Object other)
       at Microsoft.ServiceFabric.ClusterManagementCommon.ClusterUpgradeStateBase.IsUserInitiatedUpgrade()
       at Microsoft.ServiceFabric.ClusterManagementCommon.ClusterResourceStateMachine.TryInterruptPendingUpgrade()
       at System.Fabric.UpgradeOrchestration.Service.UpgradeOrchestrator.<StartUpgradeAsync>d__5.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at System.Fabric.UpgradeOrchestration.Service.FabricUpgradeOrchestrationService.<RunAsync>d__17.MoveNext()
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.<ExecuteRunAsync>d__22.MoveNext()
    Stack:
       at System.Environment.FailFast(System.String)
       at System.Threading.Tasks.Task.Execute()
       at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
       at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
       at System.Threading.Tasks.Task.ExecuteEntry(Boolean)
       at System.Threading.ThreadPoolWorkQueue.Dispatch()

    and

    Faulting application name: FabricUOS.exe, version: 6.1.472.9494, time stamp: 0x5a99d900
    Faulting module name: mscorlib.ni.dll, version: 4.7.2117.0, time stamp: 0x59cf513d
    Exception code: 0x80131623
    Fault offset: 0x000000000052d436
    Faulting process id: 0x2898
    Faulting application start time: 0x01d3cce4f7de8522
    Faulting application path: C:\ProgramData\SF\vesta1\Fabric\work\Applications\__FabricSystem_App4294967295\UOS.Code.Current\FabricUOS.exe
    Faulting module path: C:\windows\assembly\NativeImages_v4.0.30319_64\mscorlib\6b278bb41b219b5d3ea584606329e448\mscorlib.ni.dll
    Report Id: 72d84e52-69a6-4860-8e67-d43f25b99f46
    Faulting package full name:
    Faulting package-relative application ID:

    To be honest, I do not know whether this issue is related to the installation problem.

    My question is: how do I get this fixed to get the head node installed without re-installing the whole HPC cluster for the fourth time in two weeks? Furthermore, I need to keep the content of the database somehow.

    Thanks in advance,
    Christoph

    Thursday, 5 April 2018 1:57 PM

Answers

  • Just to answer this question: Sunbin found out that I forgot to specify the database connection strings in my command line. The full command for the upgrade install looks like

    .\setup.exe -unattend -keepdata -ClusterName:vesta -HeadNode -SSLThumbprint:... -HeadNodeList:"vesta1,vesta2,vesta3" -ServiceRegistrationShare:"..." -SpoolDirShare:"..." -RuntimeShare:"..." -DiagnosticsShare:"..." -InstallShare:"..." -DiagDbConStr:"..." -MgmtDbConStr:"..." -RptDbConStr:"..." -MonDbConStr:"..." -SchdDbConStr:"..."

    As the SF upgrade repeatedly failed, I uninstalled it and ran the standalone installation of the most recent SF (https://docs.microsoft.com/de-de/azure/service-fabric/service-fabric-cluster-creation-for-windows-server). After that, I could run the installation command above.

    Monday, 23 April 2018 11:12 AM

All replies

  • In the meantime, I have also obtained some health data about SF, but I am not sure whether it is in an error state or in an OK state, because all replicas listed as OK, but the last state transition is to an error and two of three are marked down:

    Get-ServiceFabricPartition fabric:/System/UpgradeOrchestrationService | Get-ServiceFabricPartitionHealth
    
    
    PartitionId           : 00000000-0000-0000-0000-000000006000
    AggregatedHealthState : Error
    UnhealthyEvaluations  :
                            Error event: SourceId='System.FM', Property='State'.
    
    ReplicaHealthStates   :
                            ReplicaId             : 131674131008963596
                            AggregatedHealthState : Ok
    
                            ReplicaId             : 131674087229785586
                            AggregatedHealthState : Ok
    
                            ReplicaId             : 131674129599795479
                            AggregatedHealthState : Ok
    
    HealthEvents          :
                            SourceId              : System.FM
                            Property              : State
                            HealthState           : Error
                            SequenceNumber        : 4094
                            SentAt                : 05.04.2018 15:31:15
                            ReceivedAt            : 05.04.2018 15:31:43
                            TTL                   : Infinite
                            Description           : Partition is in quorum loss.
                            UpgradeOrchestrationService 3 3 00000000-0000-0000-0000-000000006000
                              N/S Ready vesta1 131674129599795479
                              N/S Down vesta2 131674131008963596
                              N/P Down vesta3 131674087229785586
                              (Showing 3 out of 3 replicas. Total available replicas: 1)
    
                            For more information see: http://aka.ms/sfhealth
                            RemoveWhenExpired     : False
                            IsExpired             : False
                            Transitions           : Ok->Error = 05.04.2018 15:31:43, LastWarning = 01.01.0001 00:00:00
    
                            SourceId              : RunAsync
                            Property              : RunAsyncUnhandledException
                            HealthState           : Warning
                            SequenceNumber        : 131674158755926927
                            SentAt                : 05.04.2018 15:31:15
                            ReceivedAt            : 05.04.2018 15:32:06
                            TTL                   : 00:05:00
                            Description           : System.NullReferenceException: Object reference not set to an instance of an object.
                               at Microsoft.ServiceFabric.ClusterManagementCommon.UserConfigVersion.Equals(Object other)
                               at Microsoft.ServiceFabric.ClusterManagementCommon.ClusterUpgradeStateBase.IsUserInitiatedUpgrade()
                               at Microsoft.ServiceFabric.ClusterManagementCommon.ClusterResourceStateMachine.TryInterruptPendingUpgrade()
                               at System.Fabric.UpgradeOrchestration.Service.UpgradeOrchestrator.<StartUpgradeAsync>d__5.MoveNext()
                            --- End of stack trace from previous location where exception was thrown ---
                               at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
                               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                               at System.Fabric.UpgradeOrchestration.Service.FabricUpgradeOrchestrationService.<RunAsync>d__17.MoveNext()
                            --- End of stack trace from previous location where exception was thrown ---
                               at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
                               at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
                               at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.<ExecuteRunAsync>d__22.MoveNext()
                            RemoveWhenExpired     : True
                            IsExpired             : False
                            Transitions           : Ok->Warning = 05.04.2018 15:31:51, LastError = 01.01.0001 00:00:00
    
    HealthStatistics      :
                            Replica               : 3 Ok, 0 Warning, 0 Error

    Get-ServiceFabricpartition fabric:/System/UpgradeOrchestrationService  | Get-ServiceFabricReplica
    
    
    ReplicaId           : 131674087229785586
    ReplicaAddress      : {"Endpoints":{"":"70e56430c9df8af7f34a038baa482ae7:131674035396570890+00000000-0000-0000-0000-000000006000+131674087229785586+131674160360901730+"}}
    ReplicaRole         : Primary
    NodeName            : vesta3
    ReplicaStatus       : Down
    LastInBuildDuration : 00:00:03
    HealthState         : Ok
    
    ReplicaId           : 131674129599795479
    ReplicaAddress      : {"Endpoints":{"":"45559a80ed71a9a46036d80c3b1edc46:131674074345904267+00000000-0000-0000-0000-000000006000+131674129599795479+131674159371270250+"}}
    ReplicaRole         : ActiveSecondary
    NodeName            : vesta1
    ReplicaStatus       : Down
    LastInBuildDuration : 00:00:02
    HealthState         : Ok
    
    ReplicaId           : 131674131008963596
    ReplicaAddress      :
    ReplicaRole         : ActiveSecondary
    NodeName            : vesta2
    ReplicaStatus       : Ready
    LastInBuildDuration : 00:00:03
    HealthState         : Ok
    Get-ServiceFabricpartition fabric:/System/UpgradeOrchestrationService
    
    
    PartitionId            : 00000000-0000-0000-0000-000000006000
    PartitionKind          : Singleton
    PartitionStatus        : InQuorumLoss
    LastQuorumLossDuration : 00:01:03
    MinReplicaSetSize      : 3
    TargetReplicaSetSize   : 3
    HealthState            : Error
    DataLossNumber         : 131674128954434430
    ConfigurationNumber    : 5802500816896
    


    Thursday, 5 April 2018 3:35 PM
  • Hi Christoph,

    Could you share your setup log in C:\Windows\Temp\HPCSetupLogs to hpcpack@microsoft.com?

    Thanks.

    Sunday, 8 April 2018 10:48 AM
  • Hi Sunbin,

    I have done so. The mail includes the logs of the first try and the last one. I have also looked into the last one and found

    17:02:38.566 -  [CompleteAndCheckDbsInUnattendMode] Cannot specify local DB because there are multiple head nodes in this cluster.

    I assume that something is missing in my hand-tailored command line that tells the installer where to find the data base ...

    Furthermore, I have created an issue for the SF problem at https://github.com/Azure/service-fabric-issues/issues/968. There, I got the reply that "we have had changes that prevent direct upgrade from 5.4 up", so the instructions at https://technet.microsoft.com/en-us/library/mt829314(v=ws.11).aspx ("Start a cluster upgrade to the latest version from the list (for example 6.0.232.9494).") cannot work with the SF installed with HPC Pack 2016 RTM.

    So I know think there are two separate problems: the first is that my SF is in a non-upgradable state and the second is the data base. I think the latter is easier to fix, but I am not sure whether it makes sense installing the HPC pack into a broken SF...

    Best regards,
    Christoph

    Monday, 9 April 2018 8:14 AM
  • Just to answer this question: Sunbin found out that I forgot to specify the database connection strings in my command line. The full command for the upgrade install looks like

    .\setup.exe -unattend -keepdata -ClusterName:vesta -HeadNode -SSLThumbprint:... -HeadNodeList:"vesta1,vesta2,vesta3" -ServiceRegistrationShare:"..." -SpoolDirShare:"..." -RuntimeShare:"..." -DiagnosticsShare:"..." -InstallShare:"..." -DiagDbConStr:"..." -MgmtDbConStr:"..." -RptDbConStr:"..." -MonDbConStr:"..." -SchdDbConStr:"..."

    As the SF upgrade repeatedly failed, I uninstalled it and ran the standalone installation of the most recent SF (https://docs.microsoft.com/de-de/azure/service-fabric/service-fabric-cluster-creation-for-windows-server). After that, I could run the installation command above.

    Monday, 23 April 2018 11:12 AM