locked
High load issues RRS feed

  • Question

  • Hi!

     

    We did a manual load test today, with lots of people at the office making calls at a certain times. Our application responded well in the beginning but then began to stutter and then not reply at all (from eventlog: "The Telephony Application Proxy declined a call... ... application 'Test' at URL 'http://localhost/test/Test.speax' took longer than 4 seconds to respond.). After a few minutes it started to respond again.

     

    At the end calls where accepted (workflow created, call answered but then nothing (application did not reach first prompt which is directly after CallAnswered). Eventlog gives no clues. The tuning studio shows that duration of calls is zero and that end status is SystemHangUp.

     

    When I restarted the SpeechServer, I got an error (from eventlog: Application Error: During shutdown of Telephony Application Host an instance of application 'test' at URL 'http://localhost/test/Test.speax' did not stop when told to do so.  There were no active TelephonySessions.)

    Anybody have an any clues?

     

    Thanks for your help!

    / Markus

    Wednesday, May 9, 2007 6:44 PM

Answers

  • I would suggest re-running your 16s workflow test.  When the response times start to get bad, attach a debugger and see what your threads are doing - I bet they're all blocked on something.  To my mind, increasing the maxWorkerThreads is a workaround not a fix, though I accept that it may be sufficient to achieve the scale required.
    Wednesday, May 16, 2007 12:29 PM

All replies

  • Just found a article (http://infosysblogs.com/microsoft/2006/11/post.html) about running workflows in asp.net with some general performance guidelines regarding worker threads (set in the machine.config).

     

    Has anybody else had performance problems? Do you use these settings (below)

     

    Thanks!

    Markus

     

    Microsoft has recommended changing the default values of all of above settings. See the following table that provides the new values.

    Configuration setting             Default value                          Recommended value
      maxConnection                          2                                               12 * #CPUs

    maxIOThreads                           20                                             100

    maxWorkerThread                      20                                             100

    minFreeThreads                          8                                              88 * #CPUs

    minLocalRequestFreeThreads        4                                             76 * #CPUs

    Wednesday, May 9, 2007 7:50 PM
  • Did you collect any performance counters whilst running this test?  In particular, did you watch the CPU and memory usage?  One cause of slowness could be saturating the CPU or RAM, though SpeechService is designed to reject calls if recognition or other latencies get high.  If the "application took longer than 4 secs to respond" is the 1st error, and you weren't saturating the CPU, then looking at thread usage inside w3wp would be a good place to start.  Eg. when you do your database lookups, are they synchronous or asynchronous?  If the former, you will be blocking threads, which may cause you to run out of threadpool threads.  A workaround for this is to increase the max number of worker threads, or you could use async requests to avoid blocking the threads. 

    Here are a couple of related threads - look at my responses near the bottom of each for examples of performing async operations:

    http://forums.microsoft.com/Ocs2007publicbeta/ShowPost.aspx?PostID=1578566&SiteID=57

    http://gotspeech.net/forums/thread/2769.aspx

     

    As to stuttering; once your application has initiated a prompt, the playing of that prompt is handled by a separate process sending RTP directly to your SIP Peer.  So I wouldn't expect that threadpool saturation inside w3wp would cause stuttering of a single prompt.  If the CPU or memory aren't saturated you should look network congestion for the RTP traffic.  If however your application is initiating multiple prompts sequentially then you could get gaps between prompts which may sound like stuttering to the caller. 

     

    Regarding the last application error, you would need to look through the speech server logs to find the events for that application instance to determine why it didn't shut down.

    Thursday, May 10, 2007 10:45 AM
  • This blog focuses on the DefaultWorkflowScheduler.  Speech Server uses the ManualWorkflowScheduler and uses the threadpool efficiently.  If you find that you are running out of thread pool worker threads, attach a debugger & look to see what the threads are doing.  More than likely, there will be several blocked on some synchronous IO (eg. database lookup), and replacing this with async IO would yield higher scalability than simply increasing the threadpool size to work around the problem.
    Thursday, May 10, 2007 10:58 AM
  • Thanks for your input, Anthony, very much appreciated!

     

    Nope, no performance counters. I actually didn't think it would be a performance test, just a check that everything worked fine with multiple users. It just saturated really fast!

     

    I see no problem doing asyn db calls but am unsure if this is the problem. At the beginning of each call, a search is performed to get some userdata but this is really fast (at least in normal conditions) and a few incoming calls a minute (that is, perhaps 5 fast db accesses) a minute shouldn't be a problem.

     

    I'll try to get another test going and be a bit more prepared what to look for. Now I just know I don't know enough.

    Thursday, May 10, 2007 6:40 PM
  • Have now run a bunch of load tests on the webserver against pages that use workflow (ManualWorkflowSchedulerService).

     

    I first tried running a shorter workflow just to test the IO latency (workflow did same requests as above but without any delays) and could easily simulate 500 simul. users without any problems running against a sqlexpress on a developer machine and creating loads of files on disk. So I do not believe that this was our problem.

     

    Next step was to run a workflow that takes 16s to complete (about the same as a normal call to our application) and does the same IO-operations (2 database lookups and writes 2 xml files to disk). Pretty soon (40 sim users or so), the response times started to rise quickly. A hard test with 200 users had a average response time of 55 seconds (to run a 16s workflow). I changed the machine.config: maxWorkerThreads="100"  maxIoThreads="100" and running the same test the average response time was down to 19 seconds. During all tests, the CPU was very load (below 5%) and RAM was also very low so for these tests, something else was limiting performance.

     

    Even if this scenario is not a real mss scenario, it seems as the basic settings where not tuned to running many simul. workflows and I hope these settings will help the mss as well.

     

    Our next loadtest will be run next week and I will be monitoring CPU, memory, RTP traffic, workflow and w3wp counters. If I you have any suggestions what else to look at, please let me know.

     

    Thanks,

    Markus

     

     

     

     

    Wednesday, May 16, 2007 9:54 AM
  • I would suggest re-running your 16s workflow test.  When the response times start to get bad, attach a debugger and see what your threads are doing - I bet they're all blocked on something.  To my mind, increasing the maxWorkerThreads is a workaround not a fix, though I accept that it may be sufficient to achieve the scale required.
    Wednesday, May 16, 2007 12:29 PM
  • I guess your right, just because the figures look better I haven't solved the problem yet. I'll try some debugging with the default thread-settings and see what I find. Thanks!
    Friday, May 18, 2007 12:49 PM
  • Hi!

     

    Just ran a new test and same troubles again, after 5 min or so, the mss stopped responding. However, after a while, a bunch of errormessages apperad in the eventlog, all from the same line of code which really indicates a thread lock problem, just like Anthony suggested. This line of code does query the database which might be solved by using asyn calls.

    However, I believe this problem is due some bad coding from my part. The error occurred just in the beginning of the workflow where I have an activity where all the conditions used are based on a property mDisturbances that should be retrieved from the database if it is null. So several conditions are parts of code, using the same property that does a database query and no lock used at all...

     

    I believe setting a lock within the get for property Disturbances should at least eliminate this problem. Would making this property a DependencyProperty make any difference (it is only used internally in this activity)?

     

    Thanks for your input!

     

    //properties

    private DisturbanceList mDisturbances;

    private DisturbanceList Disturbances

    {

        get

        {

         //begin lock?

           if (mDisturbances == null)

           {

                 TBLib.BS.BSDisturbance bs = new TBLib.BS.BSDisturbance(DataCache.Settings);

                 mDisturbances = bs.GetDisturbances();

           }

         //end lock

          return mDisturbances;

        }

    }

     

    private int DisturbanceCode

    {   

        get   {   return Disturbances.GetFirstDisturbance(); }

    }

     

    //conditions

    private void conMedKod_1(object sender, ConditionalEventArgs e)

    {

          e.Result = (DisturbanceCode == 1);

    }

    private void conMedKod_2(object sender, ConditionalEventArgs e)

    {

          e.Result = (DisturbanceCode == 2);

    }

    private void conMedKod_3_nn(object sender, ConditionalEventArgs e)

    {

          e.Result = (DisturbanceCode >= 3);

    }

     

    //dynamic statements

    private void T03_04_TurnStarting(object sender, TurnStartingEventArgs e)

    {

          if (Disturbances != null && Disturbances.Count > 0)

          {

    Friday, May 25, 2007 1:39 PM
  • What is mDisturbances a property on?  If it is on your workflow then, since it is not static, it is not shared between application instances.  Events raised by the Speech Server are serialized for a given application instance so there should be no multi-threaded access to Disturbances, so a lock will achieve nothing (unless you have threads coming from elsewhere).  If it hangs off a static object, then it is shared between application instances, and a lock would be necessary.

     

    As to your question regarding making it an InstanceDependency property, what I suggest you do is put a breakpoint on "mDisturbances = bs.GetDisturbances();" and see when it is hit.  You can use the Debugger SipPhone for this test.  If mDisturbances is null when you are expecting it to be non-null, then that suggests that an InstanceDependency is needed.
    Wednesday, May 30, 2007 9:27 AM
  • Hi Markus,

    Can you let us know the status of your issue? Thanks.

    Wednesday, June 13, 2007 10:47 PM
  •  

    Hi!

     

    Current status is that performance has improved. We found that db access performance was low and made the following changes:

    1. db lookups that where supposed to be real time are cached for 2 minutes which in practice lowers the amount of db access with 50%

    2. all database acces tries to get an connection from a connection pool of 120 connections (before connection pool with size not set rapidly rose to 60). If connection is not received within 2 s, a non-pooled connection is created.

    3. data required for if-else branches (conditions used properties that performed db lookups) where fetched before the if-else was reached to avoid locks and extra fetches. 

    4. maxiothreads and maxworker threads in machine.config where set to 200

     

    1 and 4 had great impact and 2 is just an extra precaution. 3 was not tested seperately and I do not believe this had any real impact but it looks safer anyway.

     

    So I guess I do not need the Timeout, even though I am still curious how this could be done.

     

    Thanks for you help!

    Monday, June 18, 2007 7:30 AM