Mindcraft Redux

On Mindcraft's April 1999 Benchmark

"First they ignore you, then they laugh at you, then they fight you, then you win." -- Gandhi (?)

Executive summary: The Mindcraft benchmark proved to be a wake-up call, and the Linux community responded effectively. Several problems which caused Apache to run slowly on Linux were found and resolved.

I've split the summary of recent benchmark results (including the Mindcraft benchmarks) off into a separate page, "NT vs. Linux Server Benchmark Graphs", at the request of readers who liked the data but weren't interested in the history of the Linux kernel hacking community's response to the benchmarks.


Introduction

In March 1999, Microsoft commissioned Mindcraft to carry out a comparison between NT and Linux showing that NT was 2 to 3 times faster than Linux. This provoked responses from Apacheweek.org, Eric Lee Green, Jeremy Allison, and Linux Weekly News, among others. Microsoft posted a rebuttal to the responses (evidently Microsoft takes this very seriously).

The responses generally claimed that Mindcraft had not configured Linux properly, and gave specific examples. Both Mindcraft and the Linux community agree that good tuning information for Linux is hard to find.

Why the Outcry?

The Linux community responded to Mindcraft's announcements with hostility, at least partly because of Mindcraft's attitude. Mindcraft stated "We posted notices on various Linux and Apache newsgroups and received no relevant responses." and concluded that the Linux 2.2 kernel wasn't as well supported as NT.

Mindcraft did not seem to take the time to become familliar with all the appropriate forums for discussion, and apparantly did not respond to requests for further information (see section III of Eric Green's response). Others have had better success; in particular, three key kernel improvements all came about in the course of normal support activities on Usenet, the linux-kernel mailing list, and the Apache bug tracking database. I believe the cases illustrated below indicate that free 2.2.x kernel support is better than Mindcraft concluded.

Also, Mindcraft's April 13th press release neglected to mention that Microsoft sponsored the Mindcraft benchmarks, that the tests were carried out at a Microsoft lab, and that Mindcraft had access to the highest level of NT support imaginable.

Finally, Mindcraft did not try to purchase a support contract for Linux from e.g. LinuxCare or Red Hat, both of whom were offering commercial support at the time of Mindcraft's tests.

Mindcraft proposed repeating the benchmarks at an independant testing lab to address concerns that their testing was biased, but they have not yet addressed concerns that their conclusions about Linux kernel support are biased.

Truth or FUD?

Mindcraft probably did tune NT well and Linux poorly -- but rather than assume this fully accounts for Linux's poor showing, let's look for other things that could have contributed. I'm going to focus on the web tests, since that's what I'm familliar with.

Previous Benchmarks

Although Apache was designed for flexibility and correctness rather than raw performance, it has done quite well in benchmarks in the past. In fact, Ziff-Davis's January 1999 benchmarks showed that "Linux with Apache beats NT 4.0 with IIS, hands down". (Also, Apache currently beats Microsoft IIS at the single processor SPECWeb96 benchmark, but this is with special caching software.)

Yet Mindcraft found that Apache's performance falls off dramatically when there are more than 160 clients (see graph). Is this a contradiction?

Not really. The benchmarks done by Jef Poskanzer, the author of the high-performance server 'thttpd', showed that Apache 1.3.0 (among other servers) has trouble above 125 connections on Solaris 2.6. The number of clients served by Apache in the Ziff-Davis tests above was 40 or less, below the knee found by Poskanzer. By contrast, in the Mindcraft tests (and in the IIS SPECWeb96 results), the server was asked to handle over 150 clients, above the point where Poskanzer saw the dropoff.

Also, the January Ziff-Davis benchmarks used a server with only 64 megabytes of RAM, not enough to hold both the server code and the 60 megabyte WebBench 2.0 document tree used by both Mindcraft and Ziff-Davis, whereas Mindcraft used 960 megabytes of RAM.

So it's not suprising that the Jan '99 Ziff-Davis and April '99 Mindcraft tests of Apache got different results.

Does it matter?

These benchmarks are done on static pages, using very little of Apache's dynamic page generation power. Christopher Lansdown points out that the performance levels reached in the test correspond to sites that receive over a million hits per day on static pages. It's not clear that the results of such a test have much relevance to typical big sites, which tend to use a lot of dynamically generated pages.

Another objection to these benchmarks is that they don't accurately reflect the real world of many slow connections. A realistic benchmark for a heavily- trafficked Web server would involve 500 or 1000 clients all restricted to 28.8 or 56 Kbps, not 100 or so clients connected via 100baseT.

A benchmark that aims to deal with both of these concerns is the new SPECWeb99 benchmark. When it becomes available, it looks like it will set the standard for how web servers should be benchmarked.

Nevertheless, Linus seems to feel that until more realistic benchmarks (like SPECWeb99) become available, benchmarks like the one Mindcraft ran are an understandable if dirty compromise.

Kernel issue #1: TCP bug

Why does Apache fall off in the above tests above 160 active connections? It appears the steep falloff may have been due to a TCP stack problem reported by ariel@sgi.com and later by Karthik Prabhakar:

Kernel issue #2: Wake-One vs. the Thundering Herd

(Note: According to the Linux Scalability Project's paper on the thundering herd problem, a "task exclusive" wake-one patch is now integrated into the 2.3 kernel; however, according to Andrea, as of 2.4.0-test10, it still wakes up processes in same order they were put to sleep, which is not optimal from a caching point of view. The reverse order would be better.

See also Nov 2000 measurements by Andrew Morton (andrewm@uow.edu.au); post 1, post 2, and Linus' reply.)

The latest version of the wake-one patch is listed below.

See also:

Kernel issue #3: SMP Bottlenecks in 2.2 Kernel

Other Apache users getting help solving performance problems

Kernel issue #4: Interrupt Bottlenecks

According to Zach, the Mindcraft benchmark's use of four Fast Ethernet cards and a quad SMP system exposes a bottleneck in Linux's interrupt processing; the kernel spent a lot of time in synchronize_bh(). (A single Gigabit Ethernet card would stress this bottleneck much less.) According to Mingo, TCP throughput scales much better with number of CPUs in 2.3.9 than it did in 2.2.10, although he hasn't tried it with multiple Ethernets yet.

See also comments on reducing interrupts under heavy load by Steve Underwood and Steven Guo.

See also Linus's "State of Linux" talk at Usenix '99 where he talks about the Mindcraft benchmark and SMP scalability.

See also SCT's Jan 2000 comments about progress in scalability.

Softnet is coming! Kernel 2.3.43 adds the new softnet networking changes. Softnet changes the interface to the networking cards, so every single driver needs updating, but in return network performance should scale much better on large SMP systems. (For more info, see Alexy's readme.softnet, his softnet-howto, or his Feb 15 post about how to convert old drivers.)

The Feb '00 thread Gigabit Ethernet Bottlenecks (especially its second week) has lots of interesting tidbits about how what interrupt (and other) bottlenecks remain, and how they are being addressed in the 2.3 kernel.

Ingo Molnar's post of 27 Feb 2000 describes interrupt-handling improvements in the IA32 code in great detail. These will be moving into the core kernel in 2.5, it seems.

Kernel issue #5: Mysterious network slowdown

This one is a bug, not a scalability issue.
Several 2.2 users have reported that sometimes networking slows down to 1 to 10% of normal, with high ping times, and that cycling the interface fixes the problem temporarily.

Kernel issue #6: 2.2.x/NT TCP slowdown

Petru Paler, July 10 1999, in linux-kernel ( [BUG] TCP connections between Linux and NT ) reported that any kind of TCP connection between Linux (2.2.10) and a NT Server 4 (Service Pack 5) slows down to a crawl. The problem was much milder (6kbytes/sec) with 2.0.37. He included a log of a slow connection made with tcpdump, which helped Andi Kleen see that NT was taking a long time to ACK a data packet, which was causing Linux to throttle back..
Solved: false alarm! It wasn't Linux' fault at all. Turns out NT needed to be told to not use full duplex mode on the ethernet card.

Kernel issue #7: Scheduler

Phil Ezolt, 22 Jan 2000, in linux-kernel ( Re: Interesting analysis of linux kernel threading by IBM):
When I run SPECWeb96 tests here, I see both a large number of running process and a huge number of context switches. ... Here's a sample of the vmstat data:
procs memory swap io system cpu 
 r b w swpd free    buff   cache   si so bi bo   in    cs    us sy id 
...
24 0 0 2320 2066936 590088 1061464 0  0  0  0    8680  7402  3  96  1 
24 0 0 2320 2065752 590664 1061464 0  0  0  1095 11344 10920 3  95  1 
Notice. 24 running process and ~7000 context switches.

That is a lot of overhead. Every second, 7000*24 goodnesses are calculated. Not the (20*3) that a desktop system sees. This is a scalability issue. A better scheduler means better scalability.

Don't tell me benchmark data is useless. Unless you can give me data using a real system and where it's faults are, benchmark data is all we have.

SPECWeb96 pushes Linux until it bleeds. I'm telling you where it bleeds. You can fix it or bury your head in the sand. It might not be what your system is seeing today, but it will be in the future.

Would you rather fix it now or wait until someone else how thrown down the performance gauntelet?

...

Here's a juicy tidbit. During my runs, I see 98% contention on the [2.2.14] kernel lock, and it is accessed ALOT. I don't know how 2.3.40 compares, because I don't have big memory support for it. Hopefully, Andrea will be kind enough give me a patch, and then I can see if things have improved.

[Phil's data is for the web server undergoing the SPECWeb96 test, which is an ES40 4 CPU alpha EV6 running Redhat 6.0 w/kernel v2.2.14 and Apache-v1.3.9 w/SGI performance patches; the interfaces receiving the load are two ACENic gigabit ethernet cards.]

Kernel issue #8: SMP Bottlenecks in 2.4 kernel

Manfred Spraul, April 21, 2000, in linux-kernel ( [PATCH] f_op->poll() without lock_kernel()):
kumon@flab.fujitsu.co.jp noticed that select() caused a high contention for the kernel lock, so here is a patch that removes lock_kernel() from poll(). [tested] with 2.3.99-pre5.
There was some discussion about whether this was wise at this late date, but Linus and David Miller were enthusiastic. Looks like one more bottleneck bites the dust.

On 26 April 2000, kumon@flab.fujitsu.co.jp posted benchmark results in Linux-Kernel with and without the lock_kernel() in poll(). The followups included a kernel patch to improve checksum performance and a patch for Apache 1.3 to force it to align its buffers to 32-word boundaries. The latter patch, by Dean Gaudet, earned praise from Linus, who relayed rumors that this can speed up SPECWeb results by 3%. This was an interesting thread.

See also LWN's coverage, and the paragraph below, in which Kumon presents some benchmark results and another patch.

Kernel issue #9: csum_partial_copy_generic

kumon@flab.fujitsu.co.jp, 19 May 2000, in linux-kernel ( [PATCH] Fast csum_partial_copy_generic and more ) reports a 3% reduction in total CPU time compared to 2.3.99-pre8 on i686 by optimizing the cache behavior of csum_partial_copy_generic. The workload was ZD's WebBench. He adds
The benchmark we used has almost same setting as the MINDCRAFT ones, but the apache setting is [changed] slightly not to use symlink checking.

We used maximum of 24 independent clients and number of apache processes is 16. A four-way XEON procesor system is used, and the performance is twice and more than a single CPU performance.

Note that in ZD's benchmarks with 2.2.6, a 4 CPU system only achieved a 1.5x speedup over a single CPU. Kumon is reporting a > 2x speedup. This appears to be about the same speedup NT 4.0sp3 achieved with 4 CPUs at that number of clients (24). It's encouraging to hear that things may have improved in the 11 months since the 2.2.6 tests. When I asked him about this, Kumon said
Major improvement is between pre3 and pre5, poll optimization. Until pre4 (I forget exact version), kernel-lock prevents performance improvement.

If you can retrieve l-k mails around Apr 20-25, the following mails will help you understand the background.

subject: namei() query
subject: [PATCH] f_op->poll() without lock_kernel()
subject: lockless poll() (was Re: namei() query)
subject: "movb" for spin-unlock (was Re: namei() query)

On 4 Sept 2000, kumon posted again, noting that his change still hadn't made it into the kernel.

Kernel issue #10: getname(), poll() optimizations

On 22 May 2000, Manfred Spraul posted a patch on linux-kernel which optimized kmalloc(), getname(), and select() a bit, speeding up apache by about 1.5% on 2.3.99-pre8.

Kernel issue #11: Reducing lock contention, poll overhead in 2.4

On 30 May 2000, Alexander Viro posted a patch that got rid of a big lock in close_flip() and _fput(), and asked for testing. kumon ran a benchmark, and reported:
I measured viro's ac6-D patch with WebBench on 4cpu Xeon system. I applied to 2.4.0-test1 not ac6. The patch reduced 50% of stext_lock time and 4% of the total OS time. ... Some part of kmalloc/kfree overhead is come from do_select, and it is easily eliminated using small array on a stack.
kumon then posted a patch that avoids kmalloc/kfree in select() and poll() when # of fd's involved is under 64.

Kernel issue #12: Poor disk seek behavior in 2.2, new elevator code in 2.4

On 20 July 2000, Robert Cohen (robert@coorong.anu.edu.au) posted a report in Linux-kernel listing netatalk (appletalk file sharing) benchmarks comparing 2.0, 2.2, and several versions of 2.4.0-pre. The elevator code in 2.4 seems to help (some versions of 2.4 can handle 5 benchmark clients instead of 2) but ...
The more recent test4 and test5pre2 don't fair quite so well. They handle 2 clients on a 128 Meg server fine, so they're doing better than 2.2 but they choke and go seek bound with 4 clients. So something has definitely taken a turn for the worse since test1-ac22.
Here's an update. The *only* 2.4 kernel versions that could handle 5 clients were 2.4.0-test1-ac22-riel and 2.4.0-test1-ac22-class 5+; everything before and after (up to 2.4.0-test5pre4) can only handle 2.

On 26 Sept 2000, Robert Cohen posted an update which included a simple program to demonstrate the problem, which appears to be in the elevator code. Jens Axboe (axboe@suse.de) responded that he and Andrea had a patch almost ready for 2.4.0-test9-pre5 that fixes this problem.

On 4 Oct 2000, Robert Cohen posted an update with benchmark results for many kernels, showing that the problem still exists in 2.4.0-test9.

Kernel issue #13: Fast Forwarding / Hardware Flow Control

On 18 Sept 2000, Jamal (hadi@cyberus.ca) posted a note in Linux-kernel describing proposed changes to the 2.4 kernel's network driver interface; the changes add hardware flow control and several other refinements. He says
Robert Olson and I decided after the OLS that we were going to try to hit the 100Mbps(148.8Kpps) routing peak by year end. I am afraid the bar has been raised. Robert is already hitting with 2.4.0-test7 ~148Kpps with a ASUS CUBX motherboard carrying PIII 700 MHZ coppermine with about 65% CPU utilization. With a single PII based Dell machine i was able to get a consistent value of 110Kpps.

So the new goal is to go to about 500Kpps ;-> (maybe not by year end, but surely by that next random Linux hacker conference)

A sample modified tulip driver (hacked by Alexey for 2.2 and mod'ed by Robert and myself over a period of time) is supplied as an example on how to use the feedback values. ...

I believe we could have done better with the mindcraft tests with these changes in 2.2 (and HW FC turned on).

[update] BTW, I am informed that Linux people were _not_ allowed to change the hardware for those tests, so I dont think they could have used these changes if they were available back then.

Kernel tuning issue: hitting TIME_WAIT

On 30 March 2000, Takashi Richard Horikawa posted a report in Linux-Kernel listing SPECWeb96 results for both the 2.2.14 and 2.3.41. Performance between a 2.2.14 client and a 2.2.14 server was poor because few enough ports were being used that ports were not done with TIME_WAIT by the time that port number was needed again for a new connection. The moral of the story may be to tune the client and servers to use as large a port range as possible, e.g. with
echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range
to avoid bumping into this situation when trying to simulate large numbers of clients with a small number of client machines.

On 2 April 2000, Mr. Horikawa confirmed that increasing the local port range with the above command solved the problem.

Suggestions for future benchmarks

Become familliar with linux-kernel and the Apache mailing lists as well as the Linux newsgroups on Usenet (try DejaNews power searches in forums matching '*linux*').
Post your proposed configuration and see whether people agree with it. Also, be open about your benchmark; post intermediate results, and see if anyone has suggestions for improvements. You should probably expect to spend a week or so mulling over ideas with these mailing lists during the course of your tests.

If possible, use a modern benchmark like SPECWeb99 rather than the simple ones used by Mindcraft.

It might be interesting to inject latency into the path between the server and the clients to more realistically model the situation on the Internet.

Benchmark both single and multiple CPUs, and single and multiple Ethernet interfaces, if possible. Be aware that the networking performance of version 2.2.x of the Linux kernel does not scale well as you add more CPUs and Ethernet cards. This applies mostly to static pages and cached dynamic pages; noncached dynamic pages usually take a fair bit of CPU time, and should scale very well when you add CPUs. If possible, use a cache to save commonly generated pages; this will bring the dynamic page speeds closer to the static page speeds.

When testing dynamic content: Don't use the old model of running a separate process for each request; nobody running a big web site uses that interface anymore, as it's too slow. Always use a modern dynamic content generation interface (e.g. mod_perl for Apache).

Configuring Linux

Tuning problems probably resulted in less than 20% performance decrease in Mindcraft's test, so as of 3 October 1999, most people will be happy with a stock 2.2.13 kernel or whatever comes with Red Hat 6.1. The 2.4 kernel, when it's available, will help with SMP performance.

Here are some notes if you want to see what people going for the utmost were trying in June:

Configuring Apache

Related reading


Copyright 1999-2003 Dan Kegel
dank@alumni.caltech.edu
Last updated: 16 Jan 2003
[Return to www.kegel.com]