Before going into detail on this subject, I should point out that one experience does not itself make for a powerful data set. I may come back to this and revise my observations. However, I’ve recently applied a technique that I felt ought to work and had the results I expected, so let’s assume my observations are correct.
We had a multi-threaded application that was running at 100% CPU on a small-size instance, delivering 1400 operations/minute. We wanted to scale that up.
There were a few likely performance bottlenecks:
- It was blocking the threads that wrote to a stream, until the stream had written – so we needed more threads – the stream itself wasn’t necessarily making full batches for its output calls
- It was calculating a data conversion for each message, so too many threads would cause thrashing
- It was writing one-by-one to a queue, rather than using batching – so it was not making efficient use of the queue API, and each thread had to block for until the operation was complete
- It was deleting the input message one-by-one from its queue – same as above
We had a thread pool of 6 threads to start with.
With blocking code, threads get blocked. So you need more of them. Then the CPU gets overloaded.
While we may have downsized the threadpool and had better throughput, I didn’t feel this solved the problem of inefficient batch operations.
The solution I picked for this was asynchronously streaming the input over the various tools. I chose to write a streaming library for this myself, rather than use Spring Reactive, or Java Flows, or other stuff, as there was a mix of different techniques needed, none of which instantly had support for any common library.
The idea was that the source message reader would read from the source queue and push each message into a non-blocking streaming operation, which would use callbacks to mark the message as done. In order to avoid sucking the whole queue into memory, back pressure was applied using a buffer which stopped new messages entering the process while it was full, and removed messages from the buffer as part of their success and failure callbacks.
The target stream – Kinesis – has a non-blocking producer library, and I wrote a similar tool for non-blocking producing to batches for SQS write and delete operations.
The idea with async is to keep one Thread extremely busy. As there were a few compromises I made in terms of which threads flushed things, I couldn’t quite do this on one thread, but I dropped the thread pool to size 3 (from the original 6) and road tested the solution.
Performance went up from 1400/min to 1700/min. Fewer threads, possibly less thrashing, and some benefit experienced from better batching, no doubt.
However, the main aim was to make it easier to scale up to a larger CPU allocation without having to find tune the number of thread.
We doubled the CPU and found that performance went up to 3600/min. This was, by the way, the required speed for the process – the process normally runs on two nodes, so with this scaling, we’d have plenty of headroom.
What’s worth noting is that at double the CPU but the same number of threads, the CPU usage was still maxed out, and the system was delivering more than twice as much throughput.
Writing a homemade async/reactive library might not have been the best move, but it worked for me.
Keeping one thread busy yet still doing all the work scales rather well.
These streams are cumbersome to compose and test. Single-threaded blocking code is easier to work with.