Microbenchmarking comes to Java 9

Peter VerhasSeptember 13th, 2016Last Updated: September 13th, 2016

0 53 11 minutes read

I have not written article here for a few months and this will also continue with this exception. I plan to return writing around next year March. Explanation at the end of the this article. Wait! Not exactly at the end, because you could just scroll down. It is somewhere towards the end of the article. Just read on!

Three years ago I was writing about how Java compiler optimizes the code it executes. Or rather how javac does not do that and the same time JIT does. I made some benchmarks, some really bad ones as it was mentioned by Esko Luontola. These benchmarks were meant to show that JIT optimize even before it could gather significant statistical data about the execution of the code.

The article was created in January 2013. and the very first source code upload of JMH (Java Microbenchmark Harness) happened two month later. Since that time the harness developed a lot and next year it becomes part of the next release of Java. I have a contract to write a book about Java 9, and its chapter 5 should cover Java 9 microbenchmarking possibilities, among other things. It is a good reason to start something to play with around JMH.

Before getting into the details how to use JMH and what it is good for, let’s talk about a bit microbenchmarking.

Microbenchmarking

Microbenchmarking is measuring the performance of some small code fragment. It is rarely used and before starting to do a microbenchmark for real commercial environment we have to think twice. Remember that premature optimization is root of all evil. Some developers created a generalization of this statement saying that optimization itself is root of all evil, which may be true. Especially if we mean microbenchmarking.

Microbenchmarking is a luring tool to optimize something small without knowing if it is worth optimizing that code. When we have a huge application that has several modules, run on several servers how can we be sure that improving some special part of the application drastically improves the performance? Will it pay back in increased revenue that generates so much profit that will cover the cost we burnt into the performance testing and development? I am reluctant to say that you can not know that but only because such a statement would be too broad. Stadistically almost sure that such an optimization including microbenchmarking will not pain off most of the time. It will hurt, you just may not notice it, or even enjoy it, but that is a totally different story.

When to use microbenchmarking? I can see three areas:

You write an article about microbenchmarking.
You identified the code segment that eats most of the resources in your application and the improvement can be tested by microbenchmarks.
You can not identify the code segment that will eat most of the resources in an application but you suspect it.

The first area is a joke. Or not: you can play around with microbenchmarking to understand how it works and then to understand how Java code works, what runs fast and what does not. Last year Takipi posted an article where they tried to measure the speed of lambdas. Read it, very good article and clearly demonstrates the major advantage of blogging over writing something for the print. Readers commented and pointed out errors and they were corrected in the article.

The second is the usual case. Okay, before a reader, commented corrects me: the second should have been the usual case. The third is when you develop a library and you just do not know all the applications that will use it. In that case you will try to optimize the part that you think is the most crucial for most of the imagined, suspected applications. Even in that case it is better to take some sample applications.

Pitfalls

What are the pitfalls of Microbenchmarking? Benchmarking is done as experiment. The first programs I wrote were TI calculator code and I could just count the number of steps the program made to factor two large (10 digits that time) prime numbers. Even that time I was using an old Russian stop watch to measure the time being lazy to calculate the number of steps. Experiment and measurement was easier.

Today you could not calculate the number of steps the CPU makes. There are so many small factors that may change the performance of the application that are out of control of the programmer that it is impossible to make a calculation of the steps. We have the measurement left for us and we gain all the problems with all the measurements.

What is the biggest problem of measurements? We are interested in something, say X and we usually can not measure that. So we measure instead Y and hope that the value of Y and X are coupled together. We want to measure the length of the room, but instead we measure the time it takes for the laser beam to travel from one end to the other. In this case the length X and the time Y are strongly coupled. Many times X and Y only correlate more or less. Most of the times when people do measurement the values X and Y have no relation to each other at all. Still people put their money and more on decisions backed by such measurements. Think about the political elections as an example.

Microbenchmarking is no different. It is hardly ever done well. If you are interested in details and possible pitfalls Aleksey Shipilev has a good one hour video. The first question is how to measure the execution time. Small code runs short times and System.currentTimeMillis() may just return the same value when the measurement starts and when it ends, because we are still in the same millisecond. Even if the execution is 10ms the error of the measurement is still at least 10% purely because of the quantization of the time as we measure. Luckily there is System.nanoTime(). We happy, Vincent?

Not really. nanoTime() returns the current value of the running Java Virtual Machine’s high-resolution time source, in nanoseconds as the documentation says. What is “current”? When the invocation was made? Or when it was returned? Or sometime between? Select the one you want and you may still fail. That current value could have been the same during the last 1000ns that is all Java implementations should guarantee.

And another caveat before using nanoTime() from the documentation: Differences in successive calls that span greater than approximately 292 years (263 nanoseconds) will not correctly compute elapsed time due to numerical overflow.

292 years? Really?

There are other problems as well. When you start up a Java code the first few thousand executions of the code will be interpreted or executed without run-time optimization. JIT has the advantage over compilers of statically compiled languages like Swift, C, C++ or Golang that it can gather run-time information from the execution of the code and when it sees that the compilation it performed last time could have been better based on recent run-time statistics it compiles the code again. The same may be true for the garbage collection that also tries to use statistics to tune its operational parameters. Because of this well written server applications gain a bit of performance over time. They start up a bit slower and then they just become faster. If you restart the server the whole iteration starts again.

If you do micro benchmarks you should care about this behavior. Do you want to measure the performance of the application during warm-up time or how it really executes during operation?

The solution is a micro benchmarking harness that tries to consider all these caveats. The one that gets to Java 9 is JMH.

What is JMH?

“JMH is a Java harness for building, running, and analyzing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.” (quote from the official site of JMH)

You can run jmh as a separate project independent from the actual project you measure or you can just store the measurement code in a separate directory. The harness will compile against the production class files and will execute the benchmark. The easiest way, as I see, is to use the Gradle plugin to execute JMH. You store the benchmark code in a directory called jmh (the same level as main and test) and create a main that can start the benchmark.

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.io.IOException;

public class MicroBenchmark {

    public static void main(String... args) throws IOException, RunnerException {
        Options opt = new OptionsBuilder()
                .include(MicroBenchmark.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }

There is a nice builder interface for the configuration and a Runner class that can execute the benchmarks.

Playing a bit

In the book Java 9 Programming By Example one of the examples is the Mastermind game. Chapter 5 is all about solving the game parallel to speed up the guessing. (If you do not know the game, please read it on Wikipedia, I do not want to explain it here, but you will need it to understand the following.)

The normal guessing is simple. There is a secret hidden. The secret is four pegs of four different color out of 6 colors. When we guess we take the possible color variations one after the other and ask the question the table: if this selection is the secret are all answers are correct? In other words: can this guess be hidden or is there some contradiction in the answers for some previous answers? If this guess can be the secret then we will give it a try putting the pegs on the table. The answer may be 4/0 (alleluia) or something else. In the latter case we go on searching. This way the 6 color, 4 columns table can be solved in five steps.

For the shake of simplicity and visualization we name the colors with numbers, like 01234456789 (we have ten colors in the jmh benchmark since 6 colors are just no enough) and 6 pegs. The secret we use is 987654 because this is the last guess as we go from 123456, 123457 and so on.

When I first coded this game in August 1983 on a Swedish school computer (ABC80) in BASIC language each guessing took 20 to 30 seconds on the z80 processor running on 40MHz 6 colors, 4 positions. Today my MacBook Pro can play the whole game using single thread approximately 7 times in a second using 10 colors and 6 pegs. But that is not enough when I have 4 processors in the machine supporting 8 parallel threads.

To speed up the execution I split up the guess space into equal intervals and I started separate guessers each spitting guesses into a blocking queue. The main thread reads from the queue and puts the guesses on the table as they come. There are some post processing that may be needed in case some of the threads create a guess that becomes outdated by the time the main thread tries to use it as a guess but still we expect huge speed up.

Does it really speed up the guessing? That is JMH here for.

To run the benchmark we need some code that actually executes the game

@State(Scope.Benchmark)
    public static class ThreadsAndQueueSizes {
        @Param(value = {"1", "4", "8", "16", "32"})
        String nrThreads;
        @Param(value = { "1", "10", "100", "1000000"})
        String queueSize;

    }

    @Benchmark
    @Fork(1)
    public void playParallel(ThreadsAndQueueSizes t3qs) throws InterruptedException {
        int nrThreads = Integer.valueOf(t3qs.nrThreads);
        int queueSize = Integer.valueOf(t3qs.queueSize);
        new ParallelGamePlayer(nrThreads, queueSize).play();
    }

    @Benchmark
    @Fork(1)
    public void playSimple(){
        new SimpleGamePlayer().play();
    }

The JMH framework will execute the code several time measuring the time to run with several parameters. The method playParallel will be executed to run the algorithm for 1, 4, 5, 10 and 32 threads each with 1, 10, 100 and one million maximum queue length. When the queue is full the individual guessers stop with their guessing until the main thread pulls at least one guess off the queue.

I suspected if we have many threads and we do not limit the length of the queue then the worker threads will fill the queue with initial guesses that are just based on an empty table and thus does not deliver much value. What do we see after almost 15 minutes of execution?

Benchmark                    (nrThreads)  (queueSize)   Mode  Cnt   Score   Error  Units
MicroBenchmark.playParallel            1            1  thrpt   20   6.871 ± 0.720  ops/s
MicroBenchmark.playParallel            1           10  thrpt   20   7.481 ± 0.463  ops/s
MicroBenchmark.playParallel            1          100  thrpt   20   7.491 ± 0.577  ops/s
MicroBenchmark.playParallel            1      1000000  thrpt   20   7.667 ± 0.110  ops/s
MicroBenchmark.playParallel            4            1  thrpt   20  13.786 ± 0.260  ops/s
MicroBenchmark.playParallel            4           10  thrpt   20  13.407 ± 0.517  ops/s
MicroBenchmark.playParallel            4          100  thrpt   20  13.251 ± 0.296  ops/s
MicroBenchmark.playParallel            4      1000000  thrpt   20  11.829 ± 0.232  ops/s
MicroBenchmark.playParallel            8            1  thrpt   20  14.030 ± 0.252  ops/s
MicroBenchmark.playParallel            8           10  thrpt   20  13.565 ± 0.345  ops/s
MicroBenchmark.playParallel            8          100  thrpt   20  12.944 ± 0.265  ops/s
MicroBenchmark.playParallel            8      1000000  thrpt   20  10.870 ± 0.388  ops/s
MicroBenchmark.playParallel           16            1  thrpt   20  16.698 ± 0.364  ops/s
MicroBenchmark.playParallel           16           10  thrpt   20  16.726 ± 0.288  ops/s
MicroBenchmark.playParallel           16          100  thrpt   20  16.662 ± 0.202  ops/s
MicroBenchmark.playParallel           16      1000000  thrpt   20  10.139 ± 0.783  ops/s
MicroBenchmark.playParallel           32            1  thrpt   20  16.109 ± 0.472  ops/s
MicroBenchmark.playParallel           32           10  thrpt   20  16.598 ± 0.415  ops/s
MicroBenchmark.playParallel           32          100  thrpt   20  15.883 ± 0.454  ops/s
MicroBenchmark.playParallel           32      1000000  thrpt   20   6.103 ± 0.867  ops/s
MicroBenchmark.playSimple            N/A          N/A  thrpt   20   6.354 ± 0.200  ops/s

(In score the more is the better.) It shows that the best performance we get if we start 16 threads and if we somewhat limit the length of the queue. Running the parallel algorithm on one thread (a mater and a worker) is somewhat slower than the single thread implementation. This seems to be okay: we have the overhead of starting a new thread and communication between the threads. The maximum performance we have is around 16 threads. Since we can have 8 cores in this machine we expected the peek around 8. Why is that?

What happens if we replace the standard secret 987654 (which is boring after a while even for a CPU) with something random?

Benchmark                    (nrThreads)  (queueSize)   Mode  Cnt   Score   Error  Units
MicroBenchmark.playParallel            1            1  thrpt   20  12.141 ± 1.385  ops/s
MicroBenchmark.playParallel            1           10  thrpt   20  12.522 ± 1.496  ops/s
MicroBenchmark.playParallel            1          100  thrpt   20  12.516 ± 1.712  ops/s
MicroBenchmark.playParallel            1      1000000  thrpt   20  11.930 ± 1.188  ops/s
MicroBenchmark.playParallel            4            1  thrpt   20  19.412 ± 0.877  ops/s
MicroBenchmark.playParallel            4           10  thrpt   20  17.989 ± 1.248  ops/s
MicroBenchmark.playParallel            4          100  thrpt   20  16.826 ± 1.703  ops/s
MicroBenchmark.playParallel            4      1000000  thrpt   20  15.814 ± 0.697  ops/s
MicroBenchmark.playParallel            8            1  thrpt   20  19.733 ± 0.687  ops/s
MicroBenchmark.playParallel            8           10  thrpt   20  19.356 ± 1.004  ops/s
MicroBenchmark.playParallel            8          100  thrpt   20  19.571 ± 0.542  ops/s
MicroBenchmark.playParallel            8      1000000  thrpt   20  12.640 ± 0.694  ops/s
MicroBenchmark.playParallel           16            1  thrpt   20  16.527 ± 0.372  ops/s
MicroBenchmark.playParallel           16           10  thrpt   20  19.021 ± 0.475  ops/s
MicroBenchmark.playParallel           16          100  thrpt   20  18.465 ± 0.504  ops/s
MicroBenchmark.playParallel           16      1000000  thrpt   20  10.220 ± 1.043  ops/s
MicroBenchmark.playParallel           32            1  thrpt   20  17.816 ± 0.468  ops/s
MicroBenchmark.playParallel           32           10  thrpt   20  17.555 ± 0.465  ops/s
MicroBenchmark.playParallel           32          100  thrpt   20  17.236 ± 0.605  ops/s
MicroBenchmark.playParallel           32      1000000  thrpt   20   6.861 ± 1.017  ops/s

The performance increases since we do not need to go though all the possible variations. In case of one thread the increase is double. In case of multiple threads the gain is not that much. And note that this does not speed the code itself up, only measures more realistically using statistical, random secrets. What we can also see that the gain of 16 threads over 8 threads is not significant any more. This is significant only when we select a secret that is towards the end of the variations. Why? From what you have seen here and from the source code available in GitHub you can give an answer to that.

Summary

The book Java 9 Programming By Example is planned to be released February 2017. But since we are living in an open source world you can get access controlled by the publisher to 1.x.x-SNAPSHOT versions. Now I told you the preliminary GitHub URL that I use while I develop code for the book and you can also preorder the eBook and give feedback helping me to create a better book.

Reference:

Microbenchmarking comes to Java 9 from our JCG partner Peter Verhas at the Java Deep blog.