I have a SOA service which works well with small amounts of data. I recently began to test it with large amounts of data, and when I do so, after 15 minutes it slows down, and after 30 minutes it stops completely.
I have enabled the SOATRACE on the broker machine, according to http://blogs.technet.com/b/windowshpc/archive/2011/07/28/enabling-tracing-for-hpc-soa-applications.aspx and when I view the log I can see that after 15 minutes of the job running, I get the message "[ServiceJobMonitor] Timeout to receive scheduler delegation event." After this, it seems that the broker re-registers the job, but no real work happens because I get back only a few returned responses when I should expect a lot more. After another 15 minutes, I get the message "[BrokerLauncher] Close: SessionId = 5701" and shortly after, in the cluster manager, I can see that the job is set to Finished.
I had a look at the decompiled code of Microsoft.Hpc.ServiceBroker, and while there is a very good chance I don't comprehend what I am looking at, schedulerNotifyTimeoutManager has a timeout hard-coded to 15 minutes, and the timeout only gets reset when a task state or job state changes. I have a long-running SOA job, and the job and its tasks should stay running for days at a time.
Any help is very much appreciated.
It would be good if you could share the complete traces you got for us to investigate. Job finished could be both possibly by user Session.Close call or broker timeout.
I've exported the trace output to an Event Viewer Log, hosted on SkyDrive. Let me know if there is anything else I can provide.SOATrace_Session5701.evtx
- 編集済み krolley 2011年9月6日 6:53
I took a look at the log, and would like to confirm:
1. [ServiceJobMonitor] Timeout to receive scheduler delegation event. ---> Is the headnode heavy loaded? Could you try restarting HpcScheduler service to see if the problem still repros?
2. [BrokerLauncher] Close: SessionId = 5701 ---> It means client code calls Session.Close and from the trace log, I cannot find any trace indicating SessionIdleTimeout triggers, so the SOA job is finished by client code. Can you help to check if your code calls Session.Close somewhere unexpectedly? (Note: if you are using Interactive Session created by Session.Create, and do not set session.AutoClose=false, session would be closed by session.dispose when running out of using, or closed by GC when session object is disposed)
Thanks for taking a look at the log. I found out that:
- Restarting the HPC Scheduler service did not have any effect. I still always receive the "Timeout to receive scheduler delegation event" error, every 15 minutes. I'm not sure why, but it doesn't seem to have a negative effect. I thought it was the reason why my job was never completing, but now I don't think so.
- The session closed after 30 minutes because the receiveTimeout of the binding was set to 30 minutes. After increasing it, the job continues to run past 30 minutes.
Rather, the bigger problem that I am having, is that I went to do some performance testing and installed the HPC Pack 2008 R2 SP2 on some bare metal machines (while all my other compute nodes are on a VM), and when running the same calculation over and over in one job, the time taken to calculate the response is ever increasing, and the CPU utilisation of the HpcServiceHost process starts quite high but dwindles down to very low numbers. Is there any way to profile a running HpcServiceHost or otherwise get some information about why it might be slowing down when running the same calculation over and over?