This post will discuss a technique to reduce the burden garbage collection pauses put on the latency of your application. As I have written couple of years ago, disabling garbage collection is not possible in JVM. But there is a clever trick that can be used to significantly reduce the length and frequency of the long pauses.
As you are aware, there are two different GC events taking place within the JVM, called minor and major collections. There is a lot of material available about what takes place during those collections, so I will not focus on describing the mechanics in detail. I will just remind that in Hotspot JVM – during minor collection, eden and survivor spaces are collected, in major collection the tenured space also gets cleaned and (possibly) compacted.
If you turn on the GC logging (-XX:+PrintGCDetails for example) then you immediately notice that the major collections are the ones you should focus. The length of a major garbage collection taking place is typically several times larger than the one cleaning young space. During a major GC there are two aspects requiring more time to complete. First and foremost, the survivors from young space are copied to old. Next, besides cleaning the unused references from the old generation, most of the GC algorithms also compact the old space, again requiring precious CPU cycles to be burnt.
Having lots of objects in old space also increases the likelihood of having more references from old space to young space. This results in larger card tables, keeping track of the references and increases the length of the minor GC pauses, when these tables are checked to decide whether objects in young space are eligible for GC.
So, if we cannot turn off the garbage collection, can we make sure these lengthy major GCs run less often and the reference count from Tenured space to Young stays low?
The answer is yes. There are even some crazy configurations which have managed to get rid of the major GC altogether. Getting rid of major GC events is truly a complex exercise, but reducing the frequency of those long pauses is something every deployment can achieve.
The strategy we are looking at is limiting the number of objects which get tenured. In a typical web application for example, most of the objects created are useful only during the HttpRequest. There is and always will be shared state having longer life span, but key is in the fact that there is a very high ratio of short lived objects versus long lived shared state.
The tricky part for any deployment out there now is to understand how much elbow room to give for the short-lived objects, so that
- You can guarantee that the short lived objects do not get promoted to Tenured space
- You are not over-provisioning, increasing the cost of your infrastructure
On conceptual level, achieving this is easy. You just need to measure the amount of memory allocated for short-lived objects during the requests and multiply it with the peak load time. What you will end up is the amount of memory you would want to fit either into eden or into a single survivor space. This will allow the GC to run truly efficiently without any accidental promotions to tenured. Zooming in from the conceptual level surfaces several complex technical issues, which I will open up in the forthcoming posts.
So what to conclude from here? First and foremost – determining the perfect GC configuration for your application is a complex exercise. This is both bad and good news. Bad in regard that – it needs a lot of experiments from your side. Good in regard that – we like difficult problems and we are currently crafting experiments to investigate the domain further. Some day, not too far in the future, Plumbr is able to do it for you, saving you from boring plumbing job and allowing you to focus on the actual problem at hand.