Coarse-grained benchmarking

Vladimir SorOctober 8th, 2013Last Updated: October 8th, 2013

0 28 3 minutes read

While developing our software, we are all about metrics – even to the fact where I am pretty sure at least 10% of our posts contain a phrase “measure don’t guess”. One of those metrics we keep a close watch for is performance. Or to be more precise – the amount of extra CPU cycles we burn or the extra amount of heap used while running your application with Plumbr attached.

The set of tests used to measure the overhead is somewhat complex, containing both smaller synthetic tests and real-world applications. The former was relatively easy, for example SPECjvm benchmarks gave us good foundation to build our solution upon. Accompanied with our own benchmarking tests, we were confident that on the micro-level our test set was thorough enough to trust the results.

Building a representative set of tests for real world applications proved to be more of a struggle. Even though many of the applications would have been available for such use, the sheer amount of complexity while setting the applications up on multi-platform test matrix was a task too scary for us to tackle. Just try to imagine the dependencies and complex set-up guidelines and you might start understanding the pain.

So we were kind-of stuck until a year or so ago we found our saviour. This time the saviour had taken the shape of a pre-packaged testing library called “DaCapo Benchmark Suite”. This benchmarking suite consists of a set of open source, real world applications and libraries. The authors have configured the libraries to run non-trivial test cases, such as the following samples:

batik – produces a number of Scalable Vector Graphics (SVG) images based on the unit tests in Apache Batik
eclipse – executes some of the (non-gui) jdt performance tests for the Eclipse IDE
lusearch – Uses lucene to do a text search of keywords over a corpus of data comprising the works of Shakespeare and the King James Bible
tomcat – runs a set of queries against a Tomcat server retrieving and verifying the resulting webpages – tradebeans runs the daytrader benchmark via a Jave Beans to a GERONIMO backend with an in memory h2 as the underlying database

The full test set consists of 14 different applications and libraries representing well the different types of applications built upon the JVM. The tests themselves are also non-trivial, giving us the opportunity to now trust our coarse-grained benchmarks as well.

The benchmark is free to download and use. Setting it up is as easy as downloading the dacapo-9.12.jar and executing a specific benchmark as:

java -jar dacapo-9.12-bach.jar tomcat

or, in our case, where we wish to see the overhead of our -javaagent:

java -javaagent:$path_to/plumbr.jar -jar dacapo-9.12-bach.jar tomcat

Now comparing the outputs of the two tests, first run being a “naked” run and the second with our memory leak detection agent attached, we see that the test took 281ms or 10% more time to run with Plumbr attached:

java -jar dacapo-9.12-bach.jar tomcat
===== DaCapo 9.12 tomcat starting =====
Loading web application
Creating client threads
Waiting for clients to complete
Client threads complete ... unloading web application
===== DaCapo 9.12 tomcat PASSED in 2699 msec =====
Server stopped ... iteration complete

java -javaagent:$path_to/plumbr.jar -jar dacapo-9.12-bach.jar tomcat
===== DaCapo 9.12 tomcat starting =====
Loading web application
Creating client threads
Waiting for clients to complete
Client threads complete ... unloading web application
===== DaCapo 9.12 tomcat PASSED in 2980 msec =====
Server stopped ... iteration complete

Equipped with this knowledge it is now easy for us to tackle the biggest problems first – for example on the first runs we posed more than 100% overhead on some of the tests. Now we can safely say that on all cases we fit in between 5-20% and keep improving.

While running the tests, bear in mind that as always – you should never make decisions based on a single run as all the tests have a mandatory warmup period in order to get meaningful results. On DaCapo you can either trust the library itself to run the tests until the results converge by specifying -C startup parameter or specify the number of runs yourself by adding -n parameter to the startup script.

There are some concerns though – as the benchmark has not been updated since 2009, some of the tests are becoming outdated. Either because the technologies themselves have been rendered obsolete (such as the SVG for example) or that the benchmarked application versions are no longer used in production sites (such as an early 6.0 Tomcat build).

Reference: Coarse-grained benchmarking from our JCG partner Vladimir Sor at the Plumbr Blog blog.

Coarse-grained benchmarking

Thank you!

Vladimir Sor

Thank you!

Thank you!

Related Articles

Thank you!