Generally speaking, we expect our servers to be able to handle more than one request at a time. If we just ran everything on a single thread, then we may see our CPU utilization at a low percentage and feel like we’re wasting money on paying for the up time of an underused machine.
However, how do you work out how to tune the multiple parameters you end up with when you’re running a container with something like a Java app inside it? Those parameters include:
- CPU allocation – how many percentages of a whole CPU
- Container memory
- Java heap within the container’s memory – TBH, with a modern Java version (i.e. a container aware one – Java 10 onwards), you should probably use the defaults
- Number of parallel threads to do a particular task
- Amount of caching to apply to downstream resources
- Number of replicas of the container to run in parallel?
The combination of tuning parameters will depend on the exact task you’re running and may even change radically between versions of your own software.
The general approach to tuning would be to find a way to experiment with running the system under consistent, high load, and then monitor it while trying to find some sensible ball-park tuning parameters. Once you find the right approximate sizing, you can do some fine-tuning closer to the final version you’re going to deploy.
Nothing good will come from tuning a system to 100% resource usage… it will suffer dents in performance with any variation in load size, or any time needed for background jobs. Better to tune to around 80%.
Where to Start?
It’s probably easier to work out some things not to do.
With a CPU-light application that mainly uses some webservices to do its job: don’t run a handful of threads on a machine with with loads of memory.
Why? Clearly, this mythical application would need to have load and loads of threads to fill the capacity of its CPU and memory, and under lower periods of load would be wasting machine resource.
With a CPU and memory heavy application, don’t use a huge number of threads on a machine with minimal CPU and memory.
Why? Clearly, this bruiser of an application will spend most of its time with thread contention and garbage collection overhead.
With a mid-sized application, don’t run a single thread on a tiny container.
Why? Well… maybe this will be great… or maybe you’ll come to think that most of the 0.x% of a CPU time available will be spent on processing the very existence of the app. OS overheads/background jobs, task switching between the minimum number of threads in an app, etc etc.
So Don’t Start With A Silly Start
Essentially, you need to get an idea of the size of a task within your app and scale the container to find doing one of them pretty straightforward. Let’s assume a baseline CPU of 0.5 of a whole CPU and let’s assume a baseline memory of a gigabyte or so. Let’s also assume 10 thread concurrency. It’s unlikely to be 200 threads on a machine this size, and it’s unlikely to be 1, so 10’s a fair start. Let’s also assume that we can have as many replicas as it takes to meet the performance criteria once we know how much one replica can do.
Now we need to think about what the app does and move these levers appropriately.
- Does it rely on external reference data? – maybe we need a cache to avoid constant fetching
- Does it need a lot of caching? – if so, we need more memory
- Is it expensive to populate the cache (a lot of network time or demand on downstream resources)? – if so we need to consider fewer replicas and get the benefits of each instance of the cache by increasing the number of requests one instance can handle
- Are requests heavy users of memory? – grow the memory with the number of concurrent threads
- Are requests heavy users of CPU? – more CPU, or fewer concurrent tasks
- Are requests very heavy users of CPU? – this justifies favouring reduction in concurrency OVER adding CPU.
- Does the service depend on a constrained downstream service? – this suggests having fewer replicas, with more concurrency per replica, and using a connection pool over the constrained resource.
- Does the service depend on an expensive cacheable resource? – this suggests using a caching service, rather than an onboard cache. The caching service means that you can have more replicas and not increase the load on the downstream resource.
With the above rules of thumb, you can multiply up and down the initial values of CPU, memory, replicas etc, and come to some initial tuning… and the rules will give you a starting point… maybe…
But More, Much More Than This?
When I set out to write this, I considered that I might try to put some useful numbers in. I thought I might even say something like if it’s CPU heavy, add 15% to the CPU, if it’s memory hungry, divide the concurrency by 3…
But… there’s really no meaning to inventing numbers cold. The thing you need to do is run your app at consistent high (not breaking) load and see how it performs. Monitor the CPU and heap. If they’re running hot, then cool the app down, or add resources. Think about the ratio of overheads to throughput of valuable work and account for it.
I hope this helps.