The term ‘Cobra effect’ stems from an anecdote set at the time of British rule of colonial India. The British government was concerned about the number of venomous cobra snakes. The Government therefore offered a reward for every dead snake. Initially this was a successful strategy as large numbers of snakes were killed for the reward. Eventually however Indians began to breed cobras for the income.
When this was realized the reward was canceled, but the cobra breeders set the snakes free and the wild cobras consequently multiplied. The apparent solution for the problem made the situation even worse.
So how is Java heap size related with Colonial India and poisonous snakes? Bear with me and I’ll guide you through the analogy using a story from a real life as a reference.
The term ‘Cobra effect’
You have created an amazing application. So amazing that it becomes truly popular and the sheer amount of traffic to your new service starts to push your application to its knees. Digging through the performance metrics you decide that the amount of heap available for your application will soon become a bottleneck.
So you take the time to launch new infrastructure with six times the original heap. You test your application to verify that it works. You then launch it on the new infrastructure. And immediately complaints start flowing in – your application has become less responsive than with your original tiny 2GB heap. Some of your users face delays in length of minutes when waiting for your application to respond. What has just happened?
There can be numerous reasons of course. But let’s focus on the most likely suspect – heap size change. This has several possible side effects like extended caching warmup times, problems with fragmentation, etc. But from the symptoms experienced you are most likely facing latency problems in your application during full GC runs.
What this means is – as Java is a garbage collected language – your heap used is regularly being garbage collected by JVM internal processes. And as one might expect – if you have a larger room to clean then it tends to take more time for the janitor to clean the room. The very same applies to cleaning unused objects from memory.
When running applications on small heaps (below 4GB) you often do not need to think about GC internals. But when increasing heap sizes to tens of gigabytes you should definitely be aware of the potential stop-the-world pauses induced by the full GC. The very same pauses did also exist with small heap sizes, but their length was significantly shorter – your pauses that now last for more than a minute might have originally spanned only a few hundred milliseconds.
So what can you do in cases when you really need more heap for your application?
- The first option would be to consider scaling horizontally instead of vertically. What this means for our current case is – if your application is either stateless or easily partitionable then just add more small nodes and balance the load between them. In this case you could stick with 32bit-architectures which also imposes smaller memory footprint.
- If horizontal scaling is not possible then you should focus on your GC configuration. If latency is what you are after, then you should forget about the throughput oriented stop-the-world GCs and start looking for alternatives. Which you will soon find to be limited to Concurrent Mark and Sweep (CMS) or Garbage-First (G1) collectors. The saddest news being that your best choice between those two collector types and other heap configuration parameters can only be found by experimenting. So do not make choices just by reading something, go out there and try it out with your actual production load.
But be aware of their limitations as well – both of those collectors pose throughput overhead on your application – especially G1 tends to show worse throughput numbers than the stop-the-world alternatives. And when the CMS garbage collector is not fast enough to finish operation before the tenured generation is full, it falls back to the standard stop-the-world GC. So you can still face 30 or more second pauses for heaps of size 16 GB and beyond.
- If you cannot scale horizontally or fail to achieve the required latency results on garbage collectors shipping with Oracle’s JVM, then you might also look into Zing JVM built by Azul Systems. One of the features making Zing to stand out is the pauseless garbage collector (C4), which might be exactly what you are looking for. Full disclosure though – we haven’t yet tried C4 in practice. But it does sound cool.
- Your last option is something for the true hardcore guys out there. You can allocate memory outside the heap. Those allocations obviously aren’t visible to the garbage collector and thus will not be collected. It might sound scary, but already from Java 1.4 we have access to the java.nio.ByteBuffer class which provides us a method allocateDirect() for off-heap memory allocations. This allows us to create very large data structures without bumping into multi-second GC pauses. This solution is not too uncommon – many BigMemory implementations are using ByteBuffers under the hood. Terracotta BigMemory and Apache DirectMemory for example.
To conclude – even when making changes backed with good intentions, be aware of both the alternatives and the consequences. Just as the Government of India back in the days publishing rewards for dead cobras.