none
Trouble wih nodegroups and HPC-Reporting RRS feed

  • Question

  • Hi there,

    we set up a cluster with HPC Server 2008 R2 SP2 some weeks ago. Now we're seeing that the Node availibility Report does not show any node groups (not even the standard ones Computenodes/Headnodes etc) although we have configured 4 of them.

    Some digging through the HPC service eventlogs shows that there are lots warnings with EventID 10 written to the Microsoft/HPC/Reporting/Operational log that state

    "The tag ComputeNodes;IH-IHT of node NODE01 cannot be found in node group list. Will retry in next reporting worker."

    Deleting/creating new nodegroups or adding/removing members does not seem to solve this.

    Has anyone else seen this or an idea on how to troubleshoot this further ?

    - Michael

     

    Thursday, November 3, 2011 3:54 PM

All replies

  •  

    Short update on this:

    I don' know if this is related but: Everytime i modify the nodegroup memberhip or create/delete a nodegroup the following exception is thrown in Microsoft/HPC/Scheduler/Operational Log:

     

    Log Name:      Microsoft-HPC-Scheduler/Operational
    Source:        Microsoft-HPC-Scheduler
    Date:          04.11.2011 17:33:58
    Event ID:      8
    Task Category: None
    Level:         Error
    Keywords:     
    User:          SYSTEM
    Computer:      WINHPC-HN.hpc.win.rz.rwth-aachen.de
    Description:
    An unexpected exception occurred. For more information about this exception, see the Details tab.
    
     Additional data:
     UnknownError
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-HPC-Scheduler" Guid="{5B169E40-A3C7-4419-A919-87CD93F2964D}" />
        <EventID>8</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2011-11-04T16:33:58.377237600Z" />
        <EventRecordID>34828</EventRecordID>
        <Correlation />
        <Execution ProcessID="1644" ThreadID="2376" />
        <Channel>Microsoft-HPC-Scheduler/Operational</Channel>
        <Computer>WINHPC-HN.hpc.win.rz.rwth-aachen.de</Computer>
        <Security UserID="S-1-5-18" />
      </System>
      <EventData>
        <Data Name="Message">UnknownError</Data>
        <Data Name="ExceptionString">Exception detail: Microsoft.Hpc.Scheduler.Properties.SchedulerException: UnknownError
       at Microsoft.Hpc.Scheduler.Store.CachedNodeQuery.QueryNodes(IEnumerable`1 constraints)
    Current stack:    at Microsoft.Hpc.Scheduler.SchedulerTracing.TraceException(String facility, Exception exception)
       at Microsoft.Hpc.Scheduler.Store.CachedNodeQuery.QueryNodes(IEnumerable`1 constraints)
       at Microsoft.Hpc.Scheduler.Store.PropHandlers.ComputedNodeList.GetPropFromQuery(QueryContextBase ctx, PropertyId pid, StoreProperty&amp; prop)
       at Microsoft.Hpc.Scheduler.Store.QueryContextBase.GetPropFromQuery(PropertyId pid)
       at Microsoft.Hpc.Scheduler.Store.QueryContextBase.GetRowFromQuery(PropertyId[] pids)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.GetObjectProps(QueryContextBase ctx, Int32 itemId, PropertyId[] ids)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal.Object_GetProps(ConnectionToken&amp; token, ObjectType obType, Int32 obId, PropertyId[] ids, StoreProperty[]&amp; props)
       at Microsoft.Hpc.Scheduler.Store.StoreServer.Object_GetProps(ObjectType obType, Int32 obId, PropertyId[] ids, StoreProperty[]&amp; props)
       at Microsoft.Hpc.Scheduler.Store.SchedulerStoreSvc.GetPropsFromServer(ObjectType obType, Int32 itemId, PropertyId[] propertyIds)
       at Microsoft.Hpc.Scheduler.Internal.SchedulerJobInternal.LoadProps(PropertyId[] propIds)
       at Microsoft.Hpc.Scheduler.ResourceController.JobMonitor.Run()
       at Microsoft.Hpc.Scheduler.ResourceController.MonitorThread.RunMonitors()
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
    </Data>
      </EventData>
    </Event>
    

    I don't have a clue if this is critical since the nodegroup modify operations seem to be carried out successfully and the scheduler distributes jobs to specified nodegroups correctly.

    We have another scheduler machine for testing purposes and this one doesn't have these problems. The only difference i can see between the two systems is, that our production system uses remote databases and the lab machine has local databases.

    - Michael

     

     

    Friday, November 4, 2011 6:22 PM