You would think that if you wanted your application to go faster you would start with the CPU profiling. However, when looking for quick wins, it’s the memory profiler I target first.
Allocating memory is cheap
Allocating memory has never been cheaper. Memory is cheaper, you can get machines will thousands of GBs of memory. You can buy 16 GB for less than $200.
The memory allocation operation is cheaper than in the past, and it’s multi-threaded so it scales reasonably well.
However, memory allocation is not free. Your CPU cache is a precious resources especially if you are trying to use multiple threads. While you can buy 16 GB of main memory easily, you might only have 2 MB of cache per logical CPU. If you want these CPUs to run independently, you want to spend as much time as possible within the 256 KB L2 cache.
|Cache level||Size||access time in clock cycles||concurrency|
|1||32 KB data 32 KB instruction||1||cores independent|
|2||256 KB||3||cores independent|
|3||3 MB – 32 MB||10-20||sockets independent|
|main memory||4 MB – 4 TB||200+||each memory region seperate|
Allocating memory is not linear
Allocating memory on the heap is not linear. The CPU is very good at doing things in parallel. This means that if memory bandwidth is not your main bottleneck, the rate you produce garbage has less impact that what ever your bottleneck is, however if the allocation rate is high enough (and in most Java systems it is high) it will be a serious bottleneck.
You can tell if the allocation rate is a bottleneck if;
- You are close to the maximum allocation rate of the machine. Write a small test which creates lots of garbage and measure the allocation rate. If you close to this you have a problem.
- When you reduce the garbage produced by say 10%, the 99% latency of application becomes 10% faster, and yet the allocation rate hardly drops. This means your application will speed up so that it reached your bottleneck again.
- You have very long pause times e.g. into the seconds. At this point, your memory consumption has a very high impact on your performance, and reducing the memory consumption and allocation rate can improve scalability (how many requests you can process concurrently) as well as reduce your worst case jitter.
Is there a way to see CPU and memory at the same time
After reducing allocation rate, I look at the CPU consumption, with memory trace turned on. This give more weight to the memory allocations and will give you a different view to looking at CPU alone. Only when this CPU&Memory view looks clean, or at least has no quick wins do I look at CPU profiling alone.
Using these techniques as a starting point my aim is typically to reduce the 99%tile latency (the worst 1%) by a fact of 10. However, this approach can also increase the throughput of each threads as well as allow you to run more thread concurrently in an efficient manner.