You have an optimization task at hand. Which is great – you finally have the possibility to do something interesting instead of implementing yet another invoice processing screen.
The optimization task is (hopefully) linked to some non-functional requirement, such as the number of transactions your application has to process per second or the maximum time per transaction allowed.
Now, you have identified the part of your application that is the source of the performance problem. This part of your application uses a particular algorithm. You are considering switching the solution to another implementation and want to measure the performance impact. As discussed in one of our previous articles, you wisely choose the number of user transactions per second as the metrics to measure. You run your stress test with old implementation, write down the number of operations per second achieved, then run the very same stress test with new implementation, write down new number of operations per second.
Next, you compare the numbers and make some decisions. Some of these decisions may include the requirement for further performance measurements and/or optimizations of the new implementation. Let me present some very simplistic example:
- With your old algorithm 1,000 transactions took 100 seconds
- With your improved algorithm the same 1,000 transactions took 90 seconds
Great! The new version is faster! The changes you have made to the algorithm gave you 10% better performance. You now might need to make further decisions based on that information concerning next steps while optimizing your algorithm. But, let us bring one more piece of information into account: the time spent on garbage collection. From the GC log you can extract total GC pauses times. That is the amount of time when your application was stopped doing stop-the-world GC work. Then we can have the following picture:
- With your old algorithm 1,000 transactions took 100 seconds, out of which GC pauses used 20 seconds
- With improved algorithm 1,000 transactions took 90 seconds, from which GC pauses used 27 seconds
What can we deduce from this information? First of all, your algorithm’s running time has decreased from 100 – 20 = 80 seconds down to 90 – 27 = 63 seconds, 21% speedup. Secondly, the GC takes about 30% of your CPU time. Based on that your further algorithm’s optimization plans should focus not only on running speed, but also on decreasing memory usage and GC time.
How should you decide which direction to take? To answer that, we should look into your performance requirements. Maybe you have already fulfilled your requirement on how many transactions per seconds you need to process. But are all the transactions processed in short enough time? Maybe your original wise idea of measuring only the number of transactions per second might not have been the best one?
Those requirements in technical terms can be translated into two different aspects, namely – throughput and latency. Your current work has improved the throughput, but may have penalized the latency by introducing more and/or longer GC pauses.
Further optimizations of your algorithm would be steered by the NFRs. Let’s imagine you still have not reached your throughput requirements. What could be your next steps?
Considering that your implementation spends 30% of time on GC pauses, this fact alone should raise a red flag. On a typical case (yes, I know its like measuring the average temperature in a hospital and making decisions based on that) the GC pauses should not take more than 5% of the total time. And if you exceed 10% it is extremely likely that you should do something about it.
The first step to take in reducing GC pauses is investigating your JVM configuration. Maybe you just need to increase the maximum heap size
(-Xmx)? Maybe you should tune your generation sizes (
-XX:SurvivorRatio, …)? Maybe you should experiment with different garbage collectors (
-XX:+UseParallelGC, …)? Depending on your application, the right answer could be a change in the combination of the above, in some of them or not to provide any help at all.
When configuring the JVM does not provide sufficient results, the next step is to look into your data structures. Maybe you can reduce the GC pauses by making changes to your source code. Maybe you can get rid of all the wrapper classes around primitives and significantly reduce the overhead? Maybe you could take a look in the Collection classes used and reduce the overhead posed? Or for some exotic cases when your algorithm constantly keeps creating and destroying the same objects, perhaps object pooling might be a good idea?
And in the case above, only after this you might actually want to start reducing the overhead posed by your algorithm itself. Which in some exotic cases might lead you to discover that division is too slow. And you need to find a clever way to replace this with less expensive operation supported by your data structures. Or that memory barrier accompanying each write to
java.util.concurrent.atomic.AtomicBoolean is too expensive. But let’s leave those cases for another story when I will describe some of the weirdest CPU wasters I have dealt with in my life.
In conclusion – if you take up a quest to optimize your code, make sure you have thought through the requirements to both throughput and latency. And that you won’t stick to just one technique of optimization. Hopefully after reading the article you now have more tools in your arsenal.