none
"Collecting cluster usage data" failed

    Question

  • Hi there,

    we're getting EventIDs 6100 in the Windows HPC Server logs during the last weeks, which state "The operation 'Collecting cluster usage data' failed to run correctly. The operation was initiated by the user SYSTEM."

    The operations log in the Cluster manager Console and the HPCManagement.log File also contain entrys in this context:

    2009/07/30 11:44:15 [9][Error][Change  ]  Exception:
    System.NullReferenceException: Object reference not set to an instance of an object.
       at Microsoft.ComputeCluster.Management.ClusterModel.CollectSchedulerUsage.LogJobInformation(ISession session)
       at Microsoft.ComputeCluster.Management.ClusterModel.CollectSchedulerUsage.Execute(ISession session)
       at Microsoft.SystemDefinitionModel.ChangeAction.Execute(Session session)
    2009/07/30 11:44:15 [9][Error][Change  ]  Object reference not set to an instance of an object.
    2009/07/30 11:44:25 [5][Error][Change  ]  Exception:
    System.NullReferenceException: Object reference not set to an instance of an object.
       at Microsoft.ComputeCluster.Management.ClusterModel.CollectSchedulerUsage.LogJobInformation(ISession session)
       at Microsoft.ComputeCluster.Management.ClusterModel.CollectSchedulerUsage.Execute(ISession session)
       at Microsoft.SystemDefinitionModel.ChangeAction.Execute(Session session)
    2009/07/30 11:44:25 [5][Error][Change  ]  Object reference not set to an instance of an object.
    2009/07/30 11:44:35 [9][Error][Change  ]  Exception:
    System.NullReferenceException: Object reference not set to an instance of an object.
       at Microsoft.ComputeCluster.Management.ClusterModel.CollectSchedulerUsage.LogJobInformation(ISession session)
       at Microsoft.ComputeCluster.Management.ClusterModel.CollectSchedulerUsage.Execute(ISession session)
       at Microsoft.SystemDefinitionModel.ChangeAction.Execute(Session session)
    2009/07/30 11:44:35 [9][Error][Change  ]  Object reference not set to an instance of an object.
    2009/07/30 11:44:35 [9][Error][HpcManagement]  Event ChangeExecutionFailed: The operation 'Collecting cluster usage data.'  failed to run correctly. The operation was initiated by the user: SYSTEM. The operation can be identified by the GUID: becbfc85-8a05-4bcb-bdf7-7d1bf3c32b71. Using this GUID a log of the operation can be obtained from the hpc powershell command: Get-HpcOperation -id becbfc85-8a05-4bcb-bdf7-7d1bf3c32b71 | Get-HpcOperationLog


    Restarting the services or the whole headnode did not help to solve the problem. The usage data in "Charts and reports" seems to be ok.
    I cannot find any additional information in the system logs. Any ideas what could be wrong or how to diagnose this problem ?

    Best regards,
    Michael
    Thursday, July 30, 2009 11:40 AM

Answers

  • Charlie, thanks for your answer. In the meantime the problem is solved. Due to some other issues we noted that we already deployed CN images with Windows 2008 SP2 included, but the headnode was still on SP1. After applying SP2 to the headnode the problem was gone.

    -Michael
    • Marked as answer by MWirtz Friday, September 11, 2009 8:11 AM
    Friday, September 11, 2009 8:10 AM

All replies

  • can you run the powershell script that is mentioned in the event log entry (and below) and post the output here?

    Get-HpcOperation -id becbfc85-8a05-4bcb-bdf7-7d1bf3c32b71 | Get-HpcOperationLog

    It might also help to turn on more logging for the management service: in the registry, navigate to:

    HKEY_LOCAL_MACHINE\Software\Microsoft\Hpc

    and set TraceLevel to 4. There is no need to restart the management service. Wait until the exception occurs and look at the lines that precede the exception report. They might shed some additional light on the failure (post them here as well).

    This routine is part of the Customer Experience Improvement program that is enabled in a number of Microsoft products. I'm speculating that you've enrolled - you can check this by running HpcClusterManager.exe, selecting the Help menu and choosing "Customer Feedback Options". That will bring up a dialog indicating whether you're participating or not. If you have enrolled, choose the "I don't want to join the program at this time" radio button and press the OK button. This should turn off the collection of data that is where I believe the problem lies.

    I'm going to see if I can repro it on my cluster. If not, we'd be interested in working with you to determine the cause.

    charlie

    Thursday, August 06, 2009 12:25 AM
  • Charlie, thanks for your answer. In the meantime the problem is solved. Due to some other issues we noted that we already deployed CN images with Windows 2008 SP2 included, but the headnode was still on SP1. After applying SP2 to the headnode the problem was gone.

    -Michael
    • Marked as answer by MWirtz Friday, September 11, 2009 8:11 AM
    Friday, September 11, 2009 8:10 AM