none
Can't provision compute nodes in HPC 2008 RRS feed

  • Question

  • I am using the Windows Server Beta distribution dated 3/4/2008 and the HPC version with the same date.  After I have created the template, loaded and chosen the OS image and inserted the Product Key, I boot the compute node, as requested,  and watch its boot output.  It never receives any DHCP requests and eventually times out saying "Boot Failed" "Press Any Key".  Meanwhile, on the head node, the compute node's name appears in the provisioning window with a blank check box as if things were going normally.  If I check the box and choose "Provision", the utility registers the comute node with Active Directory, then indicates it is assigning the template to the compute node and just hangs there.  Not surprising since the compute node has long since given up listening for any PXE boot invitations on the network.  If I reboot the compute node at this point, it still doesn't receive any invitations so eventually times out again and indicates "Boot Failed".  The "Getting Started" Guide seems to indicate that there should be an initial PXE boot which then pauses and awaits further communications from the head node, which makes sense,  but that is not happening.  Let me point out that we also run a Windows 2003 CCS lab on these same platforms and have no problems getting a PXE boot and RIS image installation.

     

    Any assistance would be much appreciated!

    • Moved by Josh Barnard Thursday, March 26, 2009 12:39 AM (Moved from Windows HPC Server Developers - General to Windows HPC Server Deployment, Management, and Administration)
    Tuesday, April 1, 2008 9:10 PM

Answers

  • Thank you for the feedback David.

    it does seem like a bug, you shouldn't have to reboot.

    We'll enter the issue in our database and look out for a repro. I have heard from other people who have seen this in previous builds -- but we dont have a repro in more recent builds.

    please let us know if you see this again, or have trouble deploying the rest of your cluster.

    thanks

    -parmita

    • Marked as answer by Josh Barnard Thursday, March 26, 2009 12:38 AM
    Wednesday, April 9, 2008 9:38 PM
    Moderator

All replies

  •  

    It seems like the compute node isn't receiving DHCP offers or pxe responses

     

    Could you provide logs (%system drive%\program files\microsoft hpc pack\data\logfiles)?

    answers to the following questiosn will help  troubeshoot this--

    1) Did you configure the dhcp  in the network wizard?

    2) what are you network settings ( if you could provide the network summary page here --)

    3)Can you go to the Windows DHCP admin console and check to see if.

    A) the service is active/healthy;

    B) there is a valid scope;

    C) Have any IPs been given out to the compute nodes?

    4) does your domain use IPSEC-- if so you might need to make your head node a boundary server.

     

    Wednesday, April 2, 2008 4:22 PM
    Moderator
  • Parmita,

     

    Thanks for your response!  I must apologize for my lack of admin skills, we are UNIX shop so Windows is very new to me. I should have included the fact that I have been playing with this for a day or so and restarting the head node and compute node at various times.  I don't remember the sequence which allowed this but at one point I opened the provisioning window and powered up the compute node.  The compute node never appeared in the window as being ready for provisioning, but in observing the bootup of the compute node, it received the PXE boot and the provisioning proceeded without interruption (other than asking for the Product Key) .  So the problem does not appear to be in the DHCP setup nor in IPSEC.  It simply appears that whatever mechanism is responsible for doing the initial PXE boot, then pausing the system there is not working.  I must assume the fact that the compute node appears in the prvisioning window as being available means that the head node was able to reach it at some level. As I mentioned, we use this exact same setup for a W2003 CCS lab and have no problems PXE booting the compute node and provisioning it with RIS.

     

    Looking at the admin console, DHCP is configured and on with a valid scope. Windows firewall is off. I'm not sure how to look at the IPs which have been given out though. As to the log files, they are rather large and contain many many identical lines which appear to be errors. Here is a summary:

     

    NodeManager:

     

    2008/04/02 12:27:12 [4][CcpNodeManager] [Error] Connection to Scheduler lost. Detected by heartbeat with error code The constructor to deserialize an object of type 'Microsoft.Hpc.Scheduler.Properties.SchedulerException' was not found.

    -----------------------------------------------------------------------------

    HpcSdm:

     

     at Microsoft.SystemDefinitionModel.Store.SdmSqlStore.ThrowStoreException(Exception e)
       at Microsoft.SystemDefinitionModel.Store.SdmSqlStore.UpdateChange(ChangeWriter change)
       at Microsoft.SystemDefinitionModel.Service.SdmStore.UpdateChange(ChangeData change)

    Inner exception:
    System.Data.SqlClient.SqlException: An instance in the change has been modified by another change.
       at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
       at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)
       at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
       at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
       at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
       at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
       at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
       at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
       at Microsoft.SystemDefinitionModel.Store.StoreWriter.SaveDataWithTransaction(SqlConnection connection, SqlTransaction transaction)
       at Microsoft.SystemDefinitionModel.Store.ChangeWriter.SaveDataWithTransaction(SqlConnection connection, SqlTransaction transaction)
       at Microsoft.SystemDefinitionModel.Store.StoreWriter.SaveData(SqlConnection connection)
       at Microsoft.SystemDefinitionModel.Store.SdmSqlStore.UpdateChange(ChangeWriter change)

    2008/04/02 12:27:59 [4][Warn ][HpcSdm  ]  Its taking too long to persist the counters to the store

    ------------------------------------------------------------------------------------------

    HpcManagement:

     

     at Microsoft.ComputeCluster.Management.ClusterModel.PerformanceCounterCollector.GetNextValue()
       at Microsoft.ComputeCluster.Management.CounterCollectionManager.CollectCounters(Object sender, ElapsedEventArgs e)
    2008/04/02 12:25:32 [5][Error][HpcManagement]  Exception:
    System.ComponentModel.Win32Exception: Failed to read counter data
       at Microsoft.ComputeCluster.Management.Win32Helpers.PdhCounterCollector.GetValue(Int64 counterId)

    ---------------------------------------------------------------------------------------------

    CcpScheduler:

     

    2008/04/01 16:27:35 [4][Store] [Error] SqlException on command "SELECT Tasks_Main2.ParentJobID
    ,Tasks_Main2.ID
    ,Tasks_Main2.RequestCancel
    ,Tasks_Main2.State

    FROM Tasks_Main2

    WHERE Tasks_Main2.InstanceId>=@param0 AND Tasks_Main2.RequestCancel<>@param1 AND Tasks_Main2.State<>@param2". Error:A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)
    2008/04/01 16:27:35 [4][RC] [Error] Unexpected error in event engine: A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)
    2008/04/01 16:27:35 [15][Policy] [Error] Unexpected error in admin job scheduler: Cannot generate SSPI context.
    2008/04/01 16:27:35 [12][JV] [Error] Unexpected exception from validtor: System.Data.SqlClient.SqlException: Cannot generate SSPI context.
       at Microsoft.Hpc.Scheduler.Store.StoreServer.HandleException(Exception e)
       at Microsoft.Hpc.Scheduler.Store.StoreServer.RowEnum_GetRows(Int32 id, Int32 numberOfRows)
       at Microsoft.Hpc.Scheduler.Store.LocalRowEnumerator.GetRows2(Int32 numberOfRows)
       at Microsoft.Hpc.Scheduler.Store.RowEnumeratorEnumerator.MoveNext()
       at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.cancelJobs()
       at Microsoft.Hpc.Scheduler.JobValidatorNew.SingleThread.JobValidatorSingleThread.validateThreadMain()

    -------------------------------------------------------------------------------

     

    The documentation in the Getting Started Guide does not match the sequence in the To Do list in the version I am installing so I may be doing something out of sequence but I don't think that is the problem.

     

    Hope this helps! ... Thanks!

     

    David Boise

     

    Wednesday, April 2, 2008 5:55 PM
  • Parmita,

     

    Well I'm really confused now.  After my reply I went into the lab and rebooted the head node.  Since the provisioning of NODE01 was just sitting there at "assigning template", I deleted it. Then tried provisioning again.  As soon as I booted the compute node, it received the PXE boot and waited for futher commands.  The provisioning window now showed that "NODE02" was ready for provisioning and when I hit the "Provision" button it started doing just that. 

     

    Do I need to do a reboot of the head node after setting up the template and OS for provisioning?  It would appear that might do the trick but sounds like a bug to me.  I'm going to re-install the head node and start from scratch (again!) to see if I can figure out what I'm doing wrong (or what sequence will make things work). 

     

    Thanks again for your help!

     

    David Boise

    Wednesday, April 2, 2008 6:35 PM
  • Parmita ,

     

    I have now been able to successfully provision my compute node!  As I mentioned, I re-installed W2008 and HPC on the head node, went through the first 4 steps in the "To Do" list, then exited and rebooted the head node.  After the reboot I opened the  console and went to the provisioning step in the To Do list.  As soon as I powered on the compute node it saw the PXE boot then paused waiting for commands.  I verified it had now appeared in the provisioning window as being a candidate, checked it and hit "Provision".  It is now provisioning so it appears the reboot after selecting the template, and loading the OS did the trick. Not sure if anyone else has had this problem but that appears to be a decent workaround

    for me.

     

    Let me know if this is a known issue or if I'm the only one who has seen it.

     

    Thanks,

     

    David Boise

     

    Wednesday, April 2, 2008 9:19 PM
  • Thank you for the feedback David.

    it does seem like a bug, you shouldn't have to reboot.

    We'll enter the issue in our database and look out for a repro. I have heard from other people who have seen this in previous builds -- but we dont have a repro in more recent builds.

    please let us know if you see this again, or have trouble deploying the rest of your cluster.

    thanks

    -parmita

    • Marked as answer by Josh Barnard Thursday, March 26, 2009 12:38 AM
    Wednesday, April 9, 2008 9:38 PM
    Moderator
  • Is there any alternative to rebooting? I'm seeing the same symptoms, and I need to provision some nodes -- but I have a user logged in to the headnode who's running jobs and getting work done....

    Thanks,
    -Luke
    Monday, August 17, 2009 10:26 PM