Using Intel Performance Counters To Tune Garbage Collection

Attila Mihaly BalazsDecember 11th, 2013Last Updated: December 10th, 2013

0 34 9 minutes read

Introduction

I have to admit that I was shocked. Indeed, quite shaken when I realised this advent calendar post would be about garbage collection. The topic of GC has raised such passion amongst the advocates of Java and those who believe memory management should be manual. Many an article has been written regarding tiny subtle changes in strange looking command line arguments which have fractions of a percentage point performance impact on Java applications. How could I add to this huge body work?

I hope this post will not add to the GC hot air but rather be a breath of fresh air instead. Let us not look at the CPU time consumed by the garbage collector or pause times; how about looking at a hidden yet potentially critical aspect of memory management in general and garbage collection in particular: Data caching is one of the major challenges of modern computer software design (the others being instruction caching and multi-core work distribution).

Modern CPUs run so fast that main memory has no hope of keeping up. The way that some of this catastrophic performance penalty can be clawed back is caching. Memory is pulled in parallel into high speed cache memory and then the CPU accesses this cache. If we are lucky, and the code causes the CPU to read and write the same memory a few times (inside a loop for example) then the CPU can happily access the cache and be saved from waiting for loads and stores to and from main memory.

“How does garbage collection impact on the performance of caches” one might ask? There are many ways, some of them very subtle, however, here is a grab bag of some important ones:

The garbage collector traverses references in memory. This causes cache lines (blocks of memory in the cache) to contain the memory surrounding the reference and hence no longer hold other data which the program is using.

Whilst we call it a garbage collector, it is actually an allocator, mover and collector. This really matters when we think about data caching:

Allocation: Based on hardware rules, memory addresses are matched to cache lines. If pieces of memory share a cache line but are actually accessed from different threads we get an effect called false sharing. However, if little bits of data are spread out but accessed from the same thread we get poor cache utilisation.
Moving: Objects are not left in one place throughout their life times. The garbage collector avoids memory fragmentation by moving objects around. This has the interesting effect of guaranteeing that the cache lines associated with the object will no longer be associated with it after a move.
Collecting: The funny thing is that collecting is the easy bit. It can be as simple as just marking the memory as available for reuse. It is the traversal of the object graphs (multiple roots) to find out what can be collected which is going to cause data cache line loads and thus evict lines from the cache which were being read from or written to by user code.

So we can now see that the design of the garbage collector is critical to the operation of the data cache. Swapping which collector we use will not only have a impact on GC pauses and other obvious issues, it will also effect, at a low level and in a fundamental way, all of user code.

An Example

I am not going to present an exhaustive scientific paper on this concept. The purpose of this post is to show an alternative way of approaching JVM tuning. So, I ran a simple, short, multi-threaded patch in my personal synthesiser program Sonic-Field. The patch uses feedback, resonance and a bunch of other concepts to synthesise string instruments and then convolution to place the sounds in an acoustic environment.

The reason for picking sonic field is not just because it is of reasonable complexity, highly threaded and uses Spring but because I recently found I could get better performance from it using the CMS garbage collector. Latency with Sonic-Field is of no interest because it is a batch processor. However, the standard Java 7 garbage collector interacted badly with the way Sonic Field writes swap files out to disk when running low on memory. I tried CMS because it keeps the memory down the whole time (in theory – don’t flame me) because it constantly attempts to do small garbage collections along side the user threads.

If we put all this together we might well come up with a reasonable theory
“The CMS garbage collector might give fewer pauses and might be able to keep memory use down but in so doing it will almost certainly cause more data cache misses”. Constantly traversing the reference graph in memory to try and collect dead objects is going to cause cache loads and those loads will cause other data to be flushed from the cache (it has finite size). Thus, when user threads come to read again they will cause more cache misses and so on.

Does it matter? That answer will be entirely down to the application and the hardware and the load on the application. I am not, I repeat not, advocating one garbage collector over another! Nevertheless, it is a question I would like to answer so let’s answer it for my little test patch.

These data cache effects of the garbage collector are not visible from the normal VM profiling tools. This means that they do not get discussed much in the JVM community and they get considered in JVM tuning even less. However, there is a tool (actually several – but I am going to talk about the easiest to use) which can shed some light on the topic. I am talking about Intel’s PCM (Performance Counter Monitor). It can be used for code tuning as well, but I thought talking about the GC would be more fun today.

A Worked Example

pcm is just a command line tool. We pass the command line to run Java to it in quotes and it does its measurements. With other tooling, the performance counters can be used to get all sorts of other detail about an application. The benefit of the pcm command line tool is its simplicity and lack of intrusion into the over all application run. The disadvantage is that it will measure the JVM and application warm up phases. However, for server style applications or batch processors (like Sonic Field) these overheads are usually trivial compared to the actual application run.

I ran my patch on my personal Macbook Pro Retina (2012) with 16Gig of RAM. The JVM was:

java version "1.8.0-ea"

Java(TM) SE Runtime Environment (build 1.8.0-ea-b61)

Java HotSpot(TM) 64-Bit Server VM (build 25.0-b05, mixed mode)

Readings from pcm are simply written to standard out when the application exits. I compared runs with no settings for the garbage collector (the default therefore) and with my currently preferred set of tweaks. To be honest, I am not sure if the tweaks are optimal; I kind of lifted them from a bunch of online articles…

Here is the launch script for Java:

/Users/alexanderturner/x/IntelPerformanceCounterMonitorV2.5.1\ 2/pcm.x "java \

 -Xmx12G -Xms12G  -DsonicFieldTemp=/Users/alexanderturner/temp -DsonicFieldThreads=12 -DsonicFieldSwapLimit=4.0  -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly \

 -classpath \
bin:\

ing-framework-3.1.2.RELEASE/dist/org.springframework.asm-3.1.2.RELEASE.jar:\
spring/sp
spring/sp
rring-framework-3.1.2.RELEASE/dist/org.springframework.beans-3.1.2.RELEASE.jar:\

ing-framework-3.1.2.RELEASE/dist/org.springframework.core-3.1.2.RELEASE.jar:\
spring/sprin
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-3.1.2.RELEASE.jar:\
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-support-3.1.2.RELEASE.jar:\
spring/sp
rg-framework-3.1.2.RELEASE/dist/org.springframework.expression-3.1.2.RELEASE.jar:\
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.test-3.1.2.RELEASE.jar:\
spring/otherJars/commons-logging-1.1.1.jar \
    com.nerdscentral.sfpl.RenderRunner $1

"

Hopefully it is clear just how simple running Java under IntelPerformanceCounterMonitorV2 really is. So, here is the output:

Standard GC

Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP

   0    0     0.53   0.75   0.70    1.31     422 M    621 M    0.32    0.32    0.14    0.01     N/A     N/A     32

  2    0     0.56   0.77   0.73    1.31     346 M    466 M    0.26    0.31    0.11    0.01     N/A     N/A     28

   1    0     0.22   0.69   0.32    1.31     144 M    192 M    0.25    0.28    0.11    0.01     N/A     N/A     32

    3    0     0.21   0.68   0.31    1.31     135 M    171 M    0.21    0.28    0.10    0.01     N/A     N/A     28
4    0     0.55   0.77   0.71    1.31     332 M    410 M    0.19    0.38    0.11    0.01     N/A     N/A     22

  7    0     0.18   0.68   0.26    1.30     124 M    134 M    0.08    0.30    0.11    0.00     N/A     N/A     27

   5    0     0.19   0.68   0.29    1.31     133 M    155 M    0.14    0.30    0.11    0.00     N/A     N/A     22
   6    0     0.61   0.79   0.78    1.32     343 M    382 M    0.10    0.35    0.10    0.00     N/A     N/A     27

 -------------------------------------------------------------------------------------------------------------------
 SKT    0     0.38   0.75   0.51    1.31    1982 M   2533 M    0.22    0.33    0.11    0.01     N/A     N/A     22

 Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC):  773 Gticks ; C0 (active,non-halted) core resi
-------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.38   0.75   0.51    1.31    1982 M   2533 M    0.22    0.33    0.11    0.01     N/A     N/A     N/A
dency: 39.04 %

C                 : 1.49 => corresponds to 37.36 % utilization for cores in active state
 Instructions per no
 C1 core residency: 23.92 %; C3 core residency: 0.01 %; C6 core residency: 0.00 %; C7 core residency: 37.02 %
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

 PHYSICAL CORE I
Pminal CPU cycle: 0.76 => corresponds to 19.12 % core utilization over time interval

Concurrent Mark Sweep

Rather than

-XX:+UseConcMarkSweepGC 

  -XX:+UseCompressedOops

  -XX:+CMSParallelRemark

  -XX:ParallelGCThreads=

8Enabled

  -XX:CMSInitiatingOccupancyFraction=60

  -XX:+UseCMSInitiatingOccupancyOnly

Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP

   0    0     0.53   0.69   0.76    1.31     511 M    781 M    0.35    0.35    0.17    0.02     N/A     N/A     26

  2    0     0.54   0.71   0.75    1.31     418 M    586 M    0.29    0.40    0.14    0.01     N/A     N/A     29

   1    0     0.31   0.66   0.47    1.30     207 M    285 M    0.27    0.26    0.11    0.01     N/A     N/A     26

    3    0     0.30   0.66   0.46    1.30     198 M    258 M    0.23    0.27    0.11    0.01     N/A     N/A     29
4    0     0.59   0.73   0.81    1.31     397 M    504 M    0.21    0.46    0.12    0.01     N/A     N/A     29

  7    0     0.30   0.66   0.45    1.30     186 M    204 M    0.09    0.29    0.11    0.00     N/A     N/A     30

   5    0     0.30   0.66   0.45    1.30     188 M    225 M    0.16    0.28    0.11    0.01     N/A     N/A     29
   6    0     0.58   0.73   0.79    1.31     414 M    466 M    0.11    0.49    0.13    0.00     N/A     N/A     30

 -------------------------------------------------------------------------------------------------------------------
 SKT    0     0.43   0.70   0.62    1.31    2523 M   3313 M    0.24    0.38    0.13    0.01     N/A     N/A     25

 Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC):  708 Gticks ; C0 (active,non-halted) core resi
-------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.43   0.70   0.62    1.31    2523 M   3313 M    0.24    0.38    0.13    0.01     N/A     N/A     N/A
dency: 47.22 %

C                 : 1.39 => corresponds to 34.83 % utilization for cores in active state
 Instructions per no
 C1 core residency: 17.84 %; C3 core residency: 0.01 %; C6 core residency: 0.01 %; C7 core residency: 34.92 %
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

 PHYSICAL CORE I 
Pminal CPU cycle: 0.86 => corresponds to 21.51 % core utilization over time interval

All the information given here is of interest, however, there is so much of it I figure the best thing to do is cut the the case and test my assertion about the CMS collector. To do that we can look at just two lines form the output for each run:

Default:

SKT    0     0.38   0.75   0.51    1.31    1982 M   2533 M    0.22    0.33    0.11    0.01     N/A     N/A     22

Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC):  773 Gticks ; C0 (active,non-halted) core residency: 39.04 %

CMS:

0     0.43   0.70   0.62    1.31    2523 M   3313 M    0.24    0.38    0.13    0.01     N/A     N/A     25

SKT 

Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC):  708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

Discussion

We can see that under the CMS collector there were substantially more cache misses. The L2 misses were 30% greater and L2 were up 27% over the default collector. Nevertheless, the total time taken in giga ticks (708CMS/773Default) shows that all this extra data missing has not negatively impacted over all performance at all. I guess this means that a lot more research could and should be done before drawing any conclusions as to the correct approach for this application!

If you leave this post thinking that I did not fully discuss the subject, you are correct. My intention here has been to get the reader interested in thinking about this aspect of Java performance and opening the door to a new approach.

Reference: Using Intel Performance Counters To Tune Garbage Collection from our JCG partner Attila Mihaly Balazs at the Java Advent Calendar blog.