Garbage Collection: increasing the throughput

Nikita Salnikov TarnovskiSeptember 23rd, 2014Last Updated: September 23rd, 2014

1 166 5 minutes read

The inspiration for this post came after stumbling upon “Pig in the Python” definition in the memory management glossary. Apparently, this term is used to explain the situation where GC repeatedly promotes large objects from generation to generation. The effect of doing so is supposedly similar to that of a python swallowing its prey in whole only to become immobilised during digestion.

For the next 24 hours I just could not get the picture of choking pythons out of my head. As the psychiatrists say, the best way to let go of your fears is to speak about them. So here we go. But instead of the pythons, the rest of the story will be about garbage collection tuning. I promise.

Garbage Collection pauses are well known by their potential of becoming a performance bottleneck. Modern JVMs do ship with advanced garbage collectors, but as I have experienced, finding optimal configuration for a particular application is still darn difficult. To even stand a chance in manually approaching the issue, one would need to understand exact mechanics of garbage collection algorithms. This post might be able to help you in this regard, as I am going to use an example to demonstrate how small changes in JVM configuration can affect the throughput of your application.

Example

The application we use to demonstrate the GC impact on throughput is a simple one. It consists of just two threads:

PigEater – simulating a situation where the python keeps eating one pig after another. The code achieves this via adding 32MB of bytes into a java.util.List and sleeping 100ms after each attempt.
PigDigester – simulating an asynchronous digesting process. The code implements digestion by just nullifying that list of pigs. As this is a rather tiring process, then this thread sleeps for 2000ms after each reference cleaning.

Both threads will run in a while loop, continuing to eat and digest until the snake is full. This happens at around 5,000 pigs eaten.

package eu.plumbr.demo;

public class PigInThePython {
  static volatile List pigs = new ArrayList();
  static volatile int pigsEaten = 0;
  static final int ENOUGH_PIGS = 5000;

  public static void main(String[] args) throws InterruptedException {
    new PigEater().start();
    new PigDigester().start();
  }

  static class PigEater extends Thread {

    @Override
    public void run() {
      while (true) {
        pigs.add(new byte[32 * 1024 * 1024]); //32MB per pig
        if (pigsEaten > ENOUGH_PIGS) return;
        takeANap(100);
      }
    }
  }

  static class PigDigester extends Thread {
    @Override
    public void run() {
      long start = System.currentTimeMillis();

      while (true) {
        takeANap(2000);
        pigsEaten+=pigs.size();
        pigs = new ArrayList();
        if (pigsEaten > ENOUGH_PIGS)  {
          System.out.format("Digested %d pigs in %d ms.%n",pigsEaten, System.currentTimeMillis()-start);
          return;
        }
      }
    }
  }

  static void takeANap(int ms) {
    try {
      Thread.sleep(ms);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Now lets define the throughput of this system as the “number of pigs digested per second”. Taking into account that the pigs are stuffed into the python after each 100ms, we see that theoretical maximal throughput this system can thus reach up to 10 pigs / second.

Configuring the GC example

Lets see how the system behaves using two different configuration. In all situations, the application was run using a dual-core Mac (OS X 10.9.3) with 8G of physical memory.

First configuration:

4G of heap (-Xms4g –Xmx4g)
Using CMS to clean old (-XX:+UseConcMarkSweepGC) and Parallel to clean young generation -XX:+UseParNewGC)
Has allocated 12,5% of the heap (-Xmn512m) to young generation, further restricting the sizes of Eden and Survivor spaces to equally sized.

Second configuration is a bit different:

2G of heap (-Xms2g –Xmx2g)
Using Parallel GC to conduct garbage collection both in young and tenured generations(-XX:+UseParallelGC)
Has allocated 75% of the heap to young generation (-Xmn1536m)

Now it is time to make bets, which of the configurations performed better in terms of throughput (pigs eaten per second, remember?). Those of you laying your money on the first configuration, I must disappoint you. The results are exactly reversed:

First configuration (large heap, large old space, CMS GC) is capable of eating 8.2 pigs/second
Second configuration (2x smaller heap, large young space, Parallel GC) is capable of eating 9.2 pigs/second

Now, let me put the results in perspective. Allocating 2x less resources (memory-wise) we achieved 12% better throughput. This is something so contrary to common knowledge that it might require some further clarification on what was actually happening.

Interpreting the GC results

The reason in what you face is not too complex and the answer is staring right at you when you take a more closer look to what GC is doing during the test run. For this, you can use the tool of your choice, I peeked under the hood with the help of jstat, similar to the following:

jstat -gc -t -h20 PID 1s

Looking at the data, I noticed that the first configuration went through 1,129 garbage collection cycles (YGCT+FGCT) which in total took 63.723 seconds:

Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
594.0 174720.0 174720.0 163844.1  0.0   174848.0 131074.1 3670016.0  2621693.5  21248.0 2580.9   1006   63.182  116 	0.236   63.419
595.0 174720.0 174720.0 163842.1  0.0   174848.0 65538.0  3670016.0  3047677.9  21248.0 2580.9   1008   63.310  117 	0.236   63.546
596.1 174720.0 174720.0 98308.0 163842.1 174848.0 163844.2 3670016.0   491772.9  21248.0 2580.9   1010   63.354  118 	0.240   63.595
597.0 174720.0 174720.0  0.0   163840.1 174848.0 131074.1 3670016.0   688380.1  21248.0 2580.9   1011   63.482  118 	0.240   63.723

The second configuration paused in total of 168 times (YGCT+FGCT) for just 11.409 seconds.

Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
539.3 164352.0 164352.0  0.0    0.0   1211904.0 98306.0   524288.0   164352.2  21504.0 2579.2 	27    2.969  141 	8.441   11.409
540.3 164352.0 164352.0  0.0    0.0   1211904.0 425986.2  524288.0   164352.2  21504.0 2579.2 	27    2.969  141 	8.441   11.409
541.4 164352.0 164352.0  0.0    0.0   1211904.0 720900.4  524288.0   164352.2  21504.0 2579.2 	27    2.969  141 	8.441   11.409
542.3 164352.0 164352.0  0.0	0.0   1211904.0 1015812.6  524288.0   164352.2  21504.0 2579.2 	27	2.969  141 	8.441   11.409

Considering that the work needed to carry out in both cases was equivalent in regard that – with no long-living objects in sight the duty of the GC in this pig-eating exercise is just to get rid of everything as fast as possible. And using the first configuration, the GC is just forced to run ~6.7x more often resulting in ~5,6x longer total pause times.

So the story fulfilled two purposes. First and foremost, I hope I got the picture of a choking python out of my head. Another and more significant take-away from this is – that tuning GC is a tricky exercise at best, requiring deep understanding of several underlying concepts. Even with the truly trivial application used in this blog post, the results you will be facing can have significant impact on your throughput and capacity planning. In real-world applications the differences are even more staggering. So the choice is yours, you can either master the concepts, or, focus on your day-to-day work and let Plumbr to find out suitable GC configuration according to your needs.

Reference:

Garbage Collection: increasing the throughput from our JCG partner Nikita Salnikov Tarnovski at the Plumbr Blog blog.