ServicesResourcesConferencesOur TeamWeblogsAboutContact
   

Buddhike's Weblog

{binary mind}


IO Worker Threads - Performance Uncovered!

Writing my last post about WCF threading internals compelled me to reveal some of the tests I did sometime back. So today we are going to see how we can optimize the CPU utilization by using IO worker threads in CLR thread pool.
We are talking about CPU. Therefore I wrote this simple method to be used in my tests which does some CPU intensive operations. Note that I don't do any IO at all, not even a Console.WriteLine() and this is *very* important to assure the accuracy of the results.
In my CpuIntensiveMethod method I calculate the largest Fibonacci number that is below 100 and I run this calculation for int.MaxValue number of times (I will tell you why this number is preferred).

static void CpuIntensiveMethod(int testNumber)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
                                    
    // Let's perform some math to keep our CPU busy here.
    for (int i = 0; i < iterations; i++)
    {
        // Here we calculate the maximum Fibonacci number less than 100.
        int fn = 0;
        int preFn = 0;

        while (fn <= 100)
        {
            if (fn < 2)
            {
                preFn = fn;
                fn++;
                continue;
            }
            int next = fn + preFn;
            preFn = fn;
            fn = next;
        }
    }

    sw.Stop();
    timings[testNumber, 0] = Thread.CurrentThread.ManagedThreadId;
    timings[testNumber, 1] = sw.ElapsedMilliseconds;
    timings[testNumber, 2] = sw.ElapsedTicks;
}

In addition to the Fibonacci number calculation I also measure the time taken for computation using a StopWatch (newly available utility since .NET 2.0 bits) and record results in a multi-dimensional array. The reason is again, I did not want to do any IO in this method and thus stayed away from even using Console.WriteLine(). BTW: Since there is no interactive way to see whether the method has completed; I monitored the CPU usage using ProcessExplorer (or TaskManager) and then printed the values in the multi-dimensional array once the CPU usage spike is normalized.

In the first test, I simply ran CpuIntensiveMethod() in a single thread by simple calling this method from my main(). And the result was something like this.

Test 0 ran in managed thread id 1: Milliseconds 32942 Ticks 117920861

During the run I could see my CPU was at 100% which was quite expected. So at this point someone might say, that this is the optimal result that I should ever expect. I would definitely agree if I just want to use a single processor in my entire life. Apparently when I ran this test in my dual core box the process was consuming only 50% of my total processing power. This is the reason why that this model does not fit into server applications world. In fact to put this story in to server applications perspective, let's assume that CpuIntensiveMethod is a function in a server application which invoked by multiple clients. So if we had only one thread, we would take ~33 seconds to service one client. And everyone else would not be taken care of until we are done with current one.

So how can we solve this? We talked about one thread per client scenario in my last post (i.e. distributing work among multiple threads). Would it work? Sit back, we are about to try it now. To simulate this scenario I did my next test by running CpuIntensiveMethod method from several thread pool threads. But before talking about the test, let me tell you a little bit about some important CLR thread pool internals. CLR thread pool has worker threads (IO and non-IO) and a special thread called Gate thread. The gate thread is responsible for spinning up new worker threads. The algorithm it uses to determine when to spin a new thread is actually based on CPU utilization, GC frequency and worker queue size. So currently (to be more precise; as of last time I looked at win32threadpool.cpp back in 2006) this thread spins up for every 500ms. So if CpuIntensiveMethod method does not take too long it might get executed on the same thread pool thread and we would not be able to simulate concurrent clients as we originally intended. This is actually why I wanted to run my test int.MaxValue number of times (so, dear dev fellows, if you are running this code in a beefy box, please forgive me now :)). Getting back to our second test, I could successfully run my method in 10 thread pool threads concurrently and here are the numbers that I ended up with.

Test 0 ran in managed thread id 3: Milliseconds 116489 Ticks 416980652
Test 1 ran in managed thread id 4: Milliseconds 141677 Ticks 507140436
Test 2 ran in managed thread id 5: Milliseconds 149974 Ticks 536840900
Test 3 ran in managed thread id 6: Milliseconds 132707 Ticks 475034210
Test 4 ran in managed thread id 7: Milliseconds 156416 Ticks 559901108
Test 5 ran in managed thread id 8: Milliseconds 137661 Ticks 492765340
Test 6 ran in managed thread id 9: Milliseconds 149318 Ticks 534491916
Test 7 ran in managed thread id 10: Milliseconds 132840 Ticks 475507986
Test 8 ran in managed thread id 11: Milliseconds 136866 Ticks 489920458
Test 9 ran in managed thread id 12: Milliseconds 119222 Ticks 426763067

Looks horrible! Isn't it??? Each thread has taken roughly more than 110 seconds for the computation. This is more than 3 times of the time taken by first test. But don't panic. In fact I was happy the fact that it produced results I wanted to see. The reason is simple. In this test I had 10 threads, each striving to take 100% CPU. This means more work for the scheduler. Switcing contexts, swapping pages etc. cause the delay we are seeing in the above table. Consequently this proves that having one thread per client does not help us achieving the best CPU utilization.

In my final test I tried to service the same 10 concurrent calls that I had in the previous test but this time using thread pool's IOCP as shown below.

for (int i = 0; i < 10; i++)
{
    unsafe
    {
        Overlapped overlapped = new Overlapped(0, 0, IntPtr.Zero, null);
        NativeOverlapped* noverlapped = overlapped.Pack(new IOCompletionCallback
            (IoThreadProc));
       noverlapped->OffsetHigh = i;
       ThreadPool.UnsafeQueueNativeOverlapped(noverlapped);                
    }
}

Before I mention my results I must show you some nits in the above code snippet. Creating a NativeOverlapped structure is an expensive operation. Therefore you should try to create only one and reuse it if that fits bill (wanna be happy? Then listen. This is what WCF IOThreadScheduler does). However, in my case I wanted to have a way to pass the test number to CpuIntensiveMethod method. Therefore I used a simple hack by passing it in the OffsetHigh filed in the NativeOverlapped instance. Also you must make sure that you release the memory block held by NativeOverlapped structure by calling Overlapped.Free() function. Otherwise you will notice a rapidly growing working set and eventually your program will end up with an OOM exception (but here that's fine as this is a demo and I have only 10 instances of NativeOverlapped structure). OK, moving to the test results, this is what I got after running test 3.

Test 0 ran in managed thread id 3: Milliseconds 33614 Ticks 120325281
Test 1 ran in managed thread id 4: Milliseconds 35301 Ticks 126363625
Test 2 ran in managed thread id 3: Milliseconds 32928 Ticks 117867839
Test 3 ran in managed thread id 4: Milliseconds 34606 Ticks 123877189
Test 4 ran in managed thread id 3: Milliseconds 33359 Ticks 119411572
Test 5 ran in managed thread id 4: Milliseconds 34528 Ticks 123596836
Test 6 ran in managed thread id 3: Milliseconds 33502 Ticks 119922606
Test 7 ran in managed thread id 4: Milliseconds 34434 Ticks 123260605
Test 8 ran in managed thread id 3: Milliseconds 34454 Ticks 123331465
Test 9 ran in managed thread id 4: Milliseconds 34645 Ticks 124013520

Pretty impressive! Isn't it??? Now each method has taken roughly around 33 seconds which was the same number we saw during test 1. So we were able to serve 10 concurrent calls while using our optimal processing power. Also notice that each method is executed on a thread with MT id either 3 or 4. What does this mean? When CLR thread pool creates its IOCP, it calculates the number of CPUs in the box and passes that value to NumberOfConcurrentThreads parameter of CreateIoCompletionPort function. So in my case, I ran these tests in a dual core box and thus IOCP only allowed only two threads to be active concurrently. That's why we see only manage thread ids 3 and 4.

Concluding this post, I hope that these tests will help you to understand the optimizations that could be achieved by using IO worker threads.
Wanna try out the tests yourself? Here are the sources.

Have fun!

posted on Sunday, August 05, 2007 9:04 PM

Powered by Community Server, by Telligent Systems