Scott Oaks‘s Blog: How to test container scal...

来源:百度文库 编辑:神马文学网 时间:2024/04/24 21:16:56
How to test container scalability
Posted by sdo on May 02, 2007 at 11:38 AM |Comments (3)
Recently, I‘ve been asked a lot about Covalent Technologies report thatTomcat 6 can scale to 16,000 users and what that means for glassfish. Since glassfish can easily scale to 16,000 users as well (asCovalent found out once theyproperly configured glassfish), my reply has usually been accompanied by a shrug: we‘ve known for quite some time that NIO scales well.
But what does it mean to scale to N number of users, where N is large? The answer is highly dependent on your benchmark, and in particular to the think time that your benchmark uses. It‘s very easy to scale to 16,000 users if they each make a request every 90 seconds: that‘s on the order of 180 requests/second. On the other hand, if there‘s no think time in the equation, then continually handling 16,000 requests is quite difficult, particularly on small machines. Closely related to this is the response time of your requests: handling 16,000 requests with an average response time of 10 seconds isn‘t particularly helpful to your end users. But the most difficult aspect in scaling to 16,000 users is finding sufficient client horsepower to make sure that the clients themselves aren‘t the bottleneck. Otherwise, any conclusions you draw about the throughput or performance of the server are simply wrong: the conclusions apply to the performance of the clients. So in this blog, I‘ll explore how some of the considerations you need to examine in order to benchmark a large system property.
I‘ve written before aboutwhy the Apache Benchmark can‘t handle this situation (surprisingly enough, I‘d been ranting against ab long before Covalent published their benchmark; it‘s just fortuitous timing that they brought ab‘s failings to light at the same time I was fed up with questions about ab benchmarks from my colleagues). So for the tests I‘ll describe here, I used Faban‘s new Common Driver. I‘ve also previously written about howFaban is a great, configurable benchmarking framework, but the new common driver is a simple, command-line program that can benchmark requests of a single URL. I ran the tests on a partitioned SunFire T2000. This particular machines has 24 logical CPUs (6 cores with 4 hardware threads each, but for our purposes, simply 24 CPUs), which I partioned into a server set of 4 CPUs and a client set of 20 CPUs. Yes, it takes 20 CPUs to drive some of the tests I ran, and so for consistency, I kept that configuration for all of them. But it‘s a crucial point: if the client is a bottleneck, you‘re measuring the client performance, not the server performance. Using a set of processors on a single machine allowed me to run the tests bypassing the network, which also removes a potential bottleneck from measuring the server performance. Given that there are only 4 CPUs for the server, I configured all containers to use 2 acceptor threads and 20 worker threads, and otherwise followed Sun‘s and Covalent‘s blog entries on configuring the containers.
I started with a simple test:
java -d64 -classpath $JAVA_HOME/lib/tools.jar:fabancommon.jar:fabandriver.jar -Xmx3500m -Xms3500m com.sun.faban.driver.cd -c 30 http://localhost/tomcat.gif
This runs 30 separate clients (each in its own thread), each of which continually requests tomcat.gif with no think time. You‘ll notice we‘re using a 64-bit JVM for the test; eventually we‘ll be creating 16000 threads, which will require more than 4GB of address space. So to make it easier for me, I used that JVM for all my tests. Have I mentioned that driving a big client load requires a lot of resources so that the client doesn‘t become the bottleneck?
The common driver reports three pieces of information: the number of requests served per second, the average response time per request, and the 90th percentile for requests: 90% of requests were served with that particular response time or less. It will also report the number of errors observed and some error conditions I‘ll discuss a little later. I varied this test for different numbers of clients to see these results:
# Users Glassfish Tomcat30 7552.9/0.004 7614.6/0.003100 10004.6/0.009 7680.4/0.0131000 12434.7/0.079 6880.3/0.1455000 8942.7/0.534 7589.0/0.654The results here are operations per second and the average response time. I‘d assume that I‘ve misconfigured Tomcat‘s file cache here, but the point isn‘t to make a comparison between the products absolute performance; rather it is to explore issues around scalable benchmarking. For static content, we get decent scaling, though at some point there‘s enough requests so that the throughput of the server suffers: just what we would expect. So what about a dynamic test? Here are some numbers from surfing to http://localhost/Ping/PingServlet -- which is just a simple servlet that prints out 4 html strings and returns.
# Users Glassfish Tomcat30 5033.3/0.005 7154.0/0.004100 6359.5/0.015 7459.5/0.0131000 7411.2/0.134 6483.2/0.1545000 6060.1/0.818 6976.5/0.71216000 6144.3/2.544 5263.0/2.375Here the numbers are fairly close. At the low end, glassfish pays a penalty for being a full Java EE container, which requires it to do some additional work for the simple servlet. [Though the fact that the glassfish ops/sec increases so much with more users is an indication that there‘s probably some bottleneck we could fix in the code at 30 users; hmm...a performance engineer‘s work is never done.] That result at 5000 users? I‘ll discuss it later, but it‘s an anomaly. But first: what about 16,000 connections? In addition to producing low throughput, the tomcat run also reported:
ERROR: Little‘s law failed verification: 16000 users requested; 13092.3455users simulated.
In essence: almost 20% of the connections weren‘t serviced as expected (glassfish reported a similar error). I could repeat the test, and sometimes it would pass; sometimes it would fail. But I‘m clearly at the limit here of the hardware and software. In this scenario, most of the errors are timeout errors on connection: the server is to saturated in this test to accept new connections. Note that that wouldn‘t happen with something like ab, because ab‘s single-threaded nature inherently introduces an arbitrary (and unmeasured) amount of think time into the equation. The amount of think time is crucial, in that it drastically reduces the load on the server; and an arbitrary amount think time is fatal, because we no longer know what we‘re measuring.
To test this scenario properly, we introduce a deterministic think time into the driver by including a -W 2000 parameter, which says each client should have a 2 second (2000 ms) think time between requests. Now for 16,000 users, each server gave me these results:
Glassfish Tomcatops/second 6988.9 6615.3Avg. resp time 0.242 0.358Max resp time 1.519 3.69390% resp time 0.6 0.75Now both containers are handling the 16000 users, but the data we get regarding throughput and response time is valid.
Back to that result at 5000 users. The other interesting output from the Faban common driver for the glassfish result was:
ERROR: Think time deviation too high; request 0; actual is 1.0
Or in the case of tomcat, the actual was 6.0 (accounting for their better score) -- but the point is, although we didn‘t want think time on the client, the client had some bottleneck that didn‘t allow it to keep up and hence the benchmark result suffered. In effect, we ended up benchmarking the client again, having yet again introduced an arbitrary, non-deterministic think time. So even for 5000 users, we need to use some think time to get an accurate assessment of the server behavior. And so here are the results at 5000 users with a 500 millisecond think time:
Glassfish Tomcatops/sec 7607.25 7224.1Avg. resp time 0.149 0.182Max resp time 0.737 2.62690% resp time 0.25 0.25So does this any of this mean that glassfish is better than tomcat? For some applications, probably. For others, probably not. The real point to take away from this is an understanding of how important it is to understand what you‘re measuring when you measure performance. The tests I‘ve run are much too simple to draw any conclusions from: the only realistic benchmark is your own application. But hopefully, now you have a better understanding of how to approach large-scale testing your own application.
The Common Driver for Faban is brand new code, so it hasn‘t yet been integrated into Faban‘s build schedule -- in fact, there is an issue with how it handles POST requests, which is what is delaying its integration. For now, you can download thefabancommon.jar andfabandriver.jar files I used for testing. If you find any problems with it (other than trying a POST request), be sure to let me know!