The way to make any data structure or algorithm as fast as possible is for the code to do exactly what you want and no more. The problem with building a data store which does every thing anyone could want is that it won’t do anything particularly well.
What can you achieve with a custom data store in terms of performance?
You can support;
- read/write latencies of around 75 nano-seconds.
- throughputs of 40 million operations per second.
- with binary encoding and compression reduce the size of your data by a factor of 100 or more. This saves memory and increases scalability.
- Control how replication utilised your network, or is synchronized with your database.
Do we really need a customizable data store?
Most developers are not too concerned about how efficient their data store is and generic data stores work well enough and hide the details of how they really work. This can save developers a lot of time worrying about the details of how a data store works.
There are times when the choice of data store and how it operates really matter. If a data store is heavily used, how the data is arranged, the functionality it provides, and just as importantly, what it doesn’t provide really matters. You don’t want to be paying the overhead of supporting functionality you are not using.
Why do reactive systems have greater demands?
Reactive systems have greater demands on timeliness, needed to see events/updates within milli-seconds or even micro-seconds of them being committed.
Reactive systems are more likely to care about how the data got to it’s final state. Unlike polling systems where you are more likely to see just the final result of multiple changes, a reactive system might need to see exactly what changes ere made in what order.
Low latency, high throughput
A simple thread safe, segmented key-value store can have latencies around 75 nano-seconds, and support 40 million accesses (gets or puts) per second. Adding support for more functionality will impact performance so you only want to add the functionality you need if performance is also critical.
Even simple thing like adding a timestamp which can take 30 nano-second sounds quick but can mean operations take 50% longer.
What options would you like to be able to customize?
Do you need total ordering, store based ordering, segment based ordering or key based ordering?
The ordering constraints are closely related to locking or serialization of events. Locking is easier to implement and supports richer functionality, however lock free algorithms can be not only faster be more scalable with more consistent latencies.
In a data store, with total ordering you will see all changes in a consistent order. While this is the safest option, it places a global serialization requirement on all the data. This dramatically limits the options for concurrent updates. This does simplify locking as you have a global lock on all the data.
An alternative is to have ordering for a data store. This means you will know the exact order of all changes to the store, but not record changes between stores. (You can add timestamps to get an ideal of when changes happened)
To allow concurrency within a store, you can use segments or page based ordering. When you update an entry which is assigned to a segment, that segment is locked but other segments can be updated. You can get the order of all events within that segment but not between segments.
The greatest concurrency can be achieved by only limiting the order of changes to individual keys. This way any number of keys can be updated concurrently, but at least you know what a key was updated to last.
Finally, you might not need any of this. This is particularly useful if an entry is never changed, it either exists or it doesn’t. You might want to prevent any record be altered. i.e. records can only be added. If the same record with the same details is added twice this may be acceptable and ignored as a duplicate.
Shared memory data store
A feature we have found particularly useful is being able to share data between JVM on the same machine. This allows all JVMs to access the data at in memory speeds.
While this feature doesn’t slow down the solution it does place some constraints on the design to allow this to work. In particular, Java doesn’t support a heap shared between JVMs, to share memory you need to use off heap memory.
There is a number of ways to replicate data.
- Eventual consistency. We favour this model as it handles split brain situations gracefully.
- Transactional updates. An event is either visible to all nodes in a cluster or none of them.
- At least one backup. An update is saved to at least two nodes. If one fails the data is not lost. This can be faster than ensuring every node has accepted the update.
- Multi-cluster replication. While data might be freely replicated within a local cluster, you may want controls over which data is replicated to between regions and how this is performed.
- Traffic shaping you may want to control the rate of updates, or the bandwidth used, and whether compression is used.
Synchronous or asynchronous persistence
Our solutions try very hard to be as fast synchronously as most solutions performing updates asynchronously. This helps reduce overhead and complexity.
Typically a write to a memory mapped file isn’t flushed to disk immediately so the choice of disk subsystem doesn’t matter provided you haven’t overloaded it. In terms of throughput, it’s you bandwidth utilisation which matters. If you use even a fraction of your bandwidth on a sustained basis, you are likely to run out of disk space pretty quickly. If you are writing even a very modest 12 MB/s sustained, that is over 1 TB per day.
The operating systems we have tested don’t hide the disk sub-system from you entirely. For one in every ten or one in every hundred writes the latency will depend on the type of disk sub-system you have. If you care about 99% tile latencies, your choice of disk sub-system still matters.
You would assume that anyone who cares about performance would be using SSD if not PCI-SSD, because they have latencies which are around 100 times faster than spinning disk. The number of IOPS (IOs per second) for enterprise SSD is around 100 times higher as well. Desktop SSD can be 1000 times higher so you can expect this will become the norm for enterprise disk as well.
Unfortunately, it’s not that simple in large organisations and getting in SSD drives could take a very long time e.g. 6 to 12 months, if they can get approval at all.
One work around is to write data asynchronously to memory and have this spooled to disk in another thread.
Should the data be stored as text or binary?
Binary data is usually more efficient than text, unless the data is already in a text format. Some gains can be made by transforming highly verbose formats like XML, or JSon into a binary format which is turned back into text when retrieved. This is a format specific compression which can work well even when compared generic compression (See next)
Converting to a binary format can reduce the size of data by a factor from 3 to 10 times. If the format can be lossy, you can save even more space. (e.g. can white space be dropped) If generic compression is also used you can get compression ratios of 20 to 200 times.
Should the data be compressed?
Compressing data is a trade of between CPU and space consumed. There is a number of compression strategies which either use less CPU but don’t compress as well, to strategies which use more CPU and compact data further.
This can not only save disk space but memory consumption as well. This allows you to scale the amount of data you can store efficiently.
If you have plenty of memory, you might want to avoid compression to save CPU.
If your data entries are large, compressing each individual entry can work well. If your data entries are small, you can get significant gains by compressing blocks of entries.
You might even need a hybrid approach where recent data is not compressed but longer term data is compressed asynchronously.
If you use generic compression, you can get compression ratios of between 5 to 50 times.
In a reactive system, can a consumer consolidate the updates it missed?
If you have a slow consumer in your system, you want a simple way to catch up. You will always have consumers which are momentarily behind, but in some systems, they might be a long way behind. In Chronicle Queue for example, a consumer can be more than main memory behind the producer as it never drops updates.
If you drop updates, to can quickly catch up assuming there is lots of updates for the same key, or there is a simple consolidation strategy.
There are times when you need see every event/message/alteration no matter how old they are. This is useful for auditing purposes.
You might need a hybrid approach where every event is being recorded but some consumers can skip to the latest update for a key.
In a transactional data with a high per transaction overhead, using batching can really help. Batching is also useful for IO operations again to reduce overhead.
Most of our solutions try to have a very low overhead per transaction to minimise latency so the addition of batch can introduce more overhead than it saves.
More robust security models
You may need to be able to control access to individual sets, but you might also need to add Access Control Lists to every individual key.
You might need access to entries to be based on the contents of those entries. For example employees in New York might be able to update entries with a location=New York. Employees in a region, organisation or team can manage their own data.
Time stamping changes
Should updates/event be timestamped. This can be useful, but a non-trivial overhead if not used.
Auditing information and simplified security
When a change is made you may need to record additional informations such as; who made the change, when, from which client. This is useful for auditing purposes and simplifying your security model.
Instead of having strict security controls, which users may see as more of an obstacle to be avoided than useful, users can be made aware that they can do what they need to do, but all changes are being recorded so that users might think more carefully about what they should do. If you also have the ability to reverse/correct changes made, this can be another way to deal with mistakes.
Is Chronicle Journal open source?
We have two open source data storage solutions, Chronicle Queue and Chronicle Map which work for specific use cases very well and you may wish to try these first to see if they do what you need.
Chronicle Journal is designed to be more customizable which in turn needs more consulting to realise the solution. As such it is on GitHub but only accessible to clients with a support agreement.
If you are interested is getting support for Chronicle including Journal please contact email@example.com