Hello,
I have Windows HPC Server 2008 based cluster and recently when I looked at the Process List on one of the compute nodes, I've found about 200 instances of taskeng.exe process, each ate about 1 Mb of RAM and probably influenced on node performance. After some research I've found that the reason of such behavior is scheduled task RACAgent that is "Microsoft Reliability Analysis task" and that is scheduled to start every hour, but failed to do it, and after it fails the process taskeng.exe by some reason remains in process list. Furthermore when it fails it tries to restart itself and fails again and born more and more taskeng.exe processes. Finally on each node I've found up to 250 taskeng.exe processes and killed them with pskill.
Interesting thing is that I have another cluster that is basically the same but on it RACAgent task never fails, so no instances of taskeng.exe are created.
The error codes of failures are the following:
Task Scheduler failed to start "\Microsoft\Windows\RAC\RACAgent" task for user "NT AUTHORITY\LocalService". Additional Data: Error Value: 2147549186.
Task Scheduler failed to start "\Microsoft\Windows\RAC\RACAgent" task for user "NT AUTHORITY\LocalService". Additional Data: Error Value: 2147942658.
Task Scheduler failed to start "\Microsoft\Windows\RAC\RACAgent" task for user "NT AUTHORITY\LOCAL SERVICE". Additional Data: Error Value: 2148007944.
The problem with taskeng.exe is not specific for HPC Server 2008, Windows Vista and Windows Server 2008 also have it, but for different reason (User Feed Synchronization
http://social.technet.microsoft.com/Forums/en-US/itprovistasp/thread/890865b9-32af-4799-bc20-6b8f67bdbb11/ ). But for HPC Server it is vitally important because of possible performance issues.
1. The simple solution is to disable RACAgent scheduled task on problematic cluster nodes. How it will influence on my HPC Server functionality?
2. What can be the reasons of such RACAgent failures and how to fix it?
Thanks.