About Jakub Holy

Jakub is an experienced Java[EE] developer working for a lean & agile consultancy in Norway. He is interested in code quality, developer productivity, testing, and in how to make projects succeed.

Making Sense Out of Datomic, The Revolutionary Non-NoSQL Database

I have finally managed to understand one of the most unusual databases of today, Datomic, and would like to share it with you. Thanks to Stuart Halloway and his workshop!

Why? Why?!?

As we shall see shortly, Datomic is very different from the traditional RDBMS databases as well as the various NoSQL databases. It even isn’t a database – it is a database on top of a database. I couldn’t wrap my head around that until now. The key to the understanding of Datomic and its unique design and advantages is actually simple.

The mainstream databases (and languages) have been designed around the following constraints of 1970s:

  • memory is expensive
  • storage is expensive
  • it is necessary to use dedicated, expensive machines

Datomic is essentially an exploration of what database we would have designed if we hadn’t these constraints. What design would we choose having gigabytes of RAM, networks with bandwidth and speed matching and exceeding harddisk access, the ability to spin and kill servers at a whim.

But Datomic isn’t an academical project. It is pragmatic, it wants to fit into our existing environments and make it easy for us to start using its futuristic capabilities now. And it is not as fresh and green as it might seem. Rich Hickey, the master mind behind Clojure and Datomic, has reportedly thought about both these projects for years and the designs have been really well thought through.

The Weird Architecture of Datomic

  1. Datomic is a database on top of another database (or rather storage) – in-memory, a file system, a traditional RDBMS, Amazon Dynamo.
  2. You do not send your query to the server and get back the result. Instead, you get back all the data you need to execute the query and run the query – and all subsequent queries – locally. Thus, “joins” are pretty cheap and you can do plenty of otherwise impossible things (combine data from multiple databases and local data structures, run any code on them, …). Each application using Datomic – a “peer” – will have the data it needs, based on its unique needs and usage patterns, close to itself.
  3. All writes go through one component, called Transactor, which essentially serializes the writes, thus ensuring ACID. It might sound as a bottleneck but it isn’t for most practical purposes[1] given the design and typical application needs. (Reportedly, Datomic could handle all transactions for all credit cards in the world. Listen to the experiences of Room Key with their rather write-heavy load in the Relevance Podcast with Kurt Zimmer (Podcast Episode 033).)
  4. Datomic works quite similarly to a version control system such as Git. It never overwrites data, there are no updates. You only mark the data as not valid anymore and add new data, which produces a new version of the database (think of git hash / svn revision number). You can then query the latest state of the database or the state as of a particular version. (Of course the whole database isn’t copied whenever you add a fact to it. Datomic is smart and efficient.)
  5. It is not a single, monolithic server, the storage, transactor, and peers are physically separate pieces.

What has made this possible?

  • Network access as fast as or faster then disk access => can fetch all the data over the network
  • Plenty of memory => can store a substantial subset of it on each peer according to its actual needs
  • Storage is huge and cheap => we can easily store historical data
  • Experiences with efficient, immutable, “persistent” data structures used in modern FP languages => cheap creation of new “database values”

The Unique Value Proposition And Capabilities of Datomic

We have now learned about and hopefully understood the unique design of Datomic. But what does it give to us, what does it distinguish from other databases?

The architecture, together with few other design decisions, provides the following key characteristics:

  • Programmability – data, schema, query input/output, transaction metadata are all just elementary data structures that you have fully available at the peer and can thus combine and process in powerful ways unimaginable before
  • Persistence/accountability – you never lose history, can annotate transactions with metadata about who/why etc., support for finding out how things were, how they have been changing, performing what-if analysis
  • Elastic scalability – since a lot of the load has been pushed to the peers
  • Flexibility – no rigid schema, easy to navigate and combine and cache data based on each peer’s unique needs, extensibility via data functions

Closing Notes

Datomic has similar goals as relational databases (especially ACID) and could be used in similar use cases. Performance-wise, if writes are more important than reads, if you need to write really a lot of data each second continuously, or if you have over billions of “rows” then you might prefer another solution. Thanks to the design and recommended architecture for heavily loaded installations, i.e. with memcached in front of the storage, the performance of the backend isn’t so important (as the peers have the data they need locally or get it from memcached) so it should be selected more based on the usage-related characteristics.

Summary

The design of Datomic – peers fetching data and running queries locally, a single coordinator of writes (transactor), building on existing databases/storage tools (and keeping all the history) seemed very strange and perhaps inefficient to me until I realized that the traditional databases are designed around constraints that do not exist anymore. Datomic now makes sense to me and seems as a tool with intriguing capabilities and great potential. I hope you see it the same way now.

I have left out some interesting topics such as what data structures can be stored in Datomic and the data model and query model used. To learn about these and more about Datomic, head to Datomic for Five Year Olds and Datomic’s home page.

Bonus Links

[1] Harizopoulos, S., Abadi, D. J., Madden, S., & Stonebraker, M. (2008, June). OLTP through the looking glass, and what we found there. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 981-992). ACM. – this paper shows that traditional RDBMS spend nearly 30% time on locking and latching, that could be eliminated with single-threaded access, as is also done in VoltDB. See also the VoltDB whitepaper.
 

Related Whitepaper:

Open Source Data Management for Big Data and NoSQL

Join Talend for this new on-demand webinar to show how data management can benefit your organization.

This on-demand webinar shows how Talend for Big Data greatly simplifies the process of working with Hadoop and NoSQL and makes Big Data integration easy, fast, and affordable.

Get it Now!  

Leave a Reply


eight − = 7



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books
Get tutored by the Geeks! JCG Academy is a fact... Join Now
Hello. Add your message here.