Trying to understand CAP

Adam WarskiSeptember 27th, 2012Last Updated: October 22nd, 2012

0 34 3 minutes read

The CAP theorem, stated by Brewer and proved by Gilbert and Lynch specifies a property of distributed systems. It states that such a system cannot guarantee at the same time Consistency, Availability and Partition tolerance. It is also often said as a catchy phrase:

Consistency, Availability, Partition Tolerance – pick any two

used mostly when talking about NoSQL databases and suggesting that a distributed system can be characterized as either CA, AP or CP (see e.g. here).

For some time I’ve been trying to understand the different combinations and what do they mean in practice; having some time at various airports I caught up on reading, and here’s what I came up with.

What’s C, A and P?

First let’s define how I understand the three guarantees, basing on some of the articles I’ve read.

Consistency is the easiest one. It roughly means that the clients get the same view of data. By saying that a system is consistent we often mean strong consistency, but it also can come in different flavors, e.g. casual.

Availability is a property saying that every request to a non-failing node will return a (meaningful) response. The response may not contain all data (so the harvest will not be 100%, see the appropriate section in [3]), but it should be useful for the client.

Partition tolerance means that the system will continue working even if any number of messages sent between nodes is lost. This can be e.g. a network failure between two datacenters, where nodes in each datacenter form a partition. Also note that a failure of any number of nodes forms a partition (it is not possible to distinguish between a network failure and a node failing and stopping to respond to messages).

The hardest part for me is understanding the difference between Availability and Partition tolerance. Also, the various articles don’t specify what they mean by saying that a system is “working” after being e.g. partitioned – does it mean that every request gets a response with useful data, or are responses “Sorry, I can’t give you data right now” acceptable also?

P+C?

Let’s assume that a system is partition tolerant and that it has more than one node. If a partition is formed, splitting the system in two, the system should continue working. Hence both partitions allow clients to write. But then, how to guarantee consistency? If one client writes to partition 1, and another to partition 2? Hence: P => ~C.

A+~P?

Suppose now that a system is available and that we have more than one node. As the system is available, it should respond to requests even if some nodes die. As noted above, some nodes dying are equivalent to a partition. So if the system is still working, we have partition tolerance. Hence: A => P.

A+C?

Summarizing the two implications above (A => P and P => ~C), we get: A => ~C, so that an available system cannot be consistent (if it has more than one node). In practice however, there are of course AC systems, e.g. single-node RDBMS. Or even master-slave/master-master replicated RDBMS, provided there’s a central router knowing which nodes live and directing client appropriately. Such a router is then a single point of failure (SPoF).

Relax?

I suspect that in reality, when e.g. NoSQL/NewSQL systems are characterized with the CAP properties, they assume some relaxed form of C/A/P. Unfortunately, all of the definitions flying around seem to be pretty vague and are more of hand-waving than proper, precise statements. I think it would be much easier to explore the ever-growing ecosystem of new datastores if they could be more easily characterized; maybe the CAP vocabulary is just not enough?

Please correct me if I’m wrong somewhere, and I probably am! :)

And don’t forget to share!

Reference: Trying to understand CAP from our JCG partner Adam Warski at the Blog of Adam Warski blog.