A common question I face is; how do you scale a Chronicle based system if it is single writer, multiple readers. While there are solutions to this problem, it is far more likely it won’t be a problem at all.
This is the term I have been using to describe a single thread to do the work currently being done by multiple servers. (Or the opposite of the trend of deploying a single application to multiple machines)
There is an assumption that the only way to scale a system is to partition or have multiple copies. This doesn’t always scale that well unless you have multiple systems as well. All the adds complexity to the development, deployment and maintenance of the system. A common problem I see is that developers are no longer able to test the end-to-end system on their workstations or in unit/functional tests
Chronicle based processing engines have a different approach. They are typically designed to handle as much load as you can give it. i.e. the bottleneck is elsewhere such as the touch points with external systems, gateways and databases. A Chronicle based processing engine can handle 100,000 to 1,000,000 inbound and out bound events per second with latencies of between 1 and 10 micro-seconds. This is more than the rate than gateways can feed data in/out of the engine so there is never a good use case for partitioning. It is an option, but it is one you should not assume you need.
The benefit of this is you have a deterministic system which is very simple architecturally. The system is deterministic because you have a record of every inbound and outbound message which can have micro-second timings and the behaviour of the system is completely reproduce-able (and restart-able). Note: one of the consequences of this is, just restarting a failed service should not help. If it failed once a particular way based on particular message, it should fail on restart exactly the same way, every time. This makes reproducing issues and testing easier.
Why is it so fast?
A single thread processing engine in one thread has
- No locking
- No TCP delays
- Can be locked to a core or CPU improving caching behaviour.
i.e. it never context switches or gives up the CPU.
- Reusable mutable objects are more practical so you can avoid churning your caches with garbage so you memory accesses can be 5x faster. (With the bonus you no longer have GC pauses)
The real reason to avoid producing garbage
A common problem in most Java application is the GC pause time. You generally want to keep this to a minimum. I advocate reducing garbage produced but I am not concerned about GC pauses as I haven’t written a system which pauses while trading for four years now. I do this by having an Eden size which is greater than the amount of garbage I produce in a day and I do a System.gc() in a maintenance window (in the middle of the night or on the weekend as appropriate). The main reason to avoid garbage in low latency applications is creating garbage causes a scrolling or flushing of the caches. If you have a system (all JVMs on a socket) which is producing 200 MB/s of garbage and you have a 20 MB L3 cache, you are effectively filling the cache with garbage every 0.1 seconds. Memory accesses to the L3 cache can be 5 times faster than access to main memory.
Note: the L3 cache is shared across all cores in your socket so if you want scalability across all your cores you want to be working your L1 cache which is typically 32 KB and if you have busy thread creating 32 MB/s of garbage it is being filled with garbage every milli-second .
For this reason, I have started using the term micro-cloud. i.e. do the work of a whole cloud in one thread. I have a couple clients who are doing just that. What was done by many JVMs across multiple systems can be done by just one, very fast thread. Obviously, one thread is much easier to test on one workstation and takes up far less rack space in your data centre in production.