Multiple instances of taskeng.exe on compute nodes RRS feed

  • Question

  • Hello,

    I have Windows HPC Server 2008 based cluster and recently when I looked at the Process List on one of the compute nodes, I've found about 200 instances of taskeng.exe process, each ate about 1 Mb of RAM and probably influenced on node performance. After some research I've found that the reason of such behavior is scheduled task RACAgent that is "Microsoft Reliability Analysis task" and that is scheduled to start every hour, but failed to do it, and after it fails the process taskeng.exe by some reason remains in process list. Furthermore when it fails it tries to restart itself and fails again and born more and more taskeng.exe processes. Finally on each node I've found up to 250 taskeng.exe processes and killed them with pskill.

    Interesting thing is that I have another cluster that is basically the same but on it RACAgent task never fails, so no instances of taskeng.exe are created.

    The error codes of failures are the following:
    Task Scheduler failed to start "\Microsoft\Windows\RAC\RACAgent" task for user "NT AUTHORITY\LocalService". Additional Data: Error Value: 2147549186.
    Task Scheduler failed to start "\Microsoft\Windows\RAC\RACAgent" task for user "NT AUTHORITY\LocalService". Additional Data: Error Value: 2147942658.
    Task Scheduler failed to start "\Microsoft\Windows\RAC\RACAgent" task for user "NT AUTHORITY\LOCAL SERVICE". Additional Data: Error Value: 2148007944.

    The problem with taskeng.exe is not specific for HPC Server 2008, Windows Vista and Windows Server 2008 also have it, but for different reason (User Feed Synchronization http://social.technet.microsoft.com/Forums/en-US/itprovistasp/thread/890865b9-32af-4799-bc20-6b8f67bdbb11/ ). But for HPC Server it is vitally important because of possible performance issues.

    1. The simple solution is to disable RACAgent scheduled task on problematic cluster nodes. How it will influence on my HPC Server functionality?
    2. What can be the reasons of such RACAgent failures and how to fix it?

    Monday, February 15, 2010 10:52 AM


  • Hello Nikita,

    There are many reason why the RACAgent is failing and would require additional troubleshooting to narrow it down to the particular task(s). The errors posted seem to be a resource depletion.

    err 2147549186
      RPC_E_CALL_CANCELED                                            winerror.h
    # Call was canceled by the message filter.

      ERROR_FILE_NOT_FOUND                                           winerror.h
    # The system cannot find the file specified.

    err 2147942658
      WAIT_TIMEOUT                                                   winerror.h
    # The wait operation timed out.

    err 2148007944
      CO_E_SERVER_STOPPING                                           winerror.h
    # Object server is stopping when OLE service contacts it

      ERROR_NOT_ENOUGH_MEMORY                                        winerror.h
    # Not enough storage is available to process this command.

    Depending on the task, it's trying to perform then depends if it's need or not. By default Windows places RacTask (Microsoft Reliability Analysis task to process system reliability data). This task is to provide event information for the RAC Operational Events, that calculate the System Stability Index. The System Stability Index is a number from 1 (least stable) to 10 (most stable) and is a weighted measurement derived from the number of specified failures seen over a rolling historical period. Reliability Events in the System Stability Report describe the specific failures. Customer typically try to collect the data so they can determine how reliable the server is.

    It’s not necessarily needed for HPC to function, but would require additional investigation to determine why its failing. I recommend to open a case with product support and to investigate the issue further. I assume there may be something interfering with the task since it's common on most node, such as antivirus or any filter drivers.


    • Proposed as answer by Don Pattee Friday, February 19, 2010 8:23 PM
    • Marked as answer by Nikita Tropin Friday, March 5, 2010 5:28 AM
    Wednesday, February 17, 2010 3:56 PM