Since the Strangeloop talks won’t be available for a few months, I recorded a new version of the talk as a Google Hangout.
Cassandra is a Dynamo system; like Riak, it divides a hash ring into a several chunks, and keeps N replicas of each chunk on different nodes. It uses tunable quorums, hinted handoff, and active anti-entropy to keep replicas up to date. Unlike the Dynamo paper and some of its peers, Cassandra eschews vector clocks in favor of a pure last-write-wins approach.
If you read the Riak article, you might be freaking out at this point. In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency settings (R=W=PR=PW=ALL), even with a perfect lock service ensuring writes did not occur simultaneously. To understand why, I’d like to briefly review the problem with last-write-wins in asynchronous networks.
Kafka is a messaging system which provides an immutable, linearizable, sharded log of messages. Throughput and storage capacity scale linearly with nodes, and thanks to some impressive engineering tricks, Kafka can push astonishingly high volume through each node; often saturating disk, network, or both. Consumers use Zookeeper to coordinate their reads over the message log, providing efficient at-least-once delivery–and some other nice properties, like replayability.
The CAP conjecture, I am convinced, is false and can be proven false.
The CAP conjecture has been a theoretical millstone around the neck of all ACID systems. Good riddance.
This is the first wooden stake for the heart of the noSQL movement. There are more coming.
Zookeeper, or ZK for short, is a distributed CP datastore based on a consensus protocol called ZAB. ZAB is similar to Paxos in that it offers linearizable writes and is available whenever a majority quorum can complete a round, but unlike the Paxos papers, places a stronger emphasis on the role of a single leader in ensuring the consistency of commits.
Because Zookeeper uses majority quorums, in an ensemble of five nodes, any two can fail or be partitioned away without causing the system to halt. Any clients connected to a majority component of the cluster can continue to make progress safely. In addition, the linearizability property means that all clients will see all updates in the same order–although clients may drift behind the primary by an arbitrary duration.
I wish I could make more concrete policy recommendations, but in this case all I can say is “this looks troubling.” Here’s the letter I sent to my representatives today:
Dear Senator Feinstein,
In 2006, we learned that the NSA had secretly tapped all internet traffic flowing through AT&T’s San Francisco peering point. Now, the Guardian’s leaks suggest that the NSA has accrued phone and email records–some metadata, some full content–for millions of US citizens, and stored them for targeted analysis. The criteria for retention and analysis remain poorly understood.
If you, as a database vendor, implement a few features in your API, I can probably offer repeatable automated tests of your DB’s partition tolerance through Jepsen.
The outcome of these tests would be a set of normalized metrics for each DB like “supports linearizability”, “available for writes when a majority partition exists”, “available for writes when no majority available”, “fraction of writes successful”, “fraction of writes denied”, “fraction of writes acked then lost”, “95th latency during condition X”, and so forth. I’m thinking this would be a single-page web site–a spreadsheet, really–making it easy to compare and contrast DBs and find one that fits your safety needs.
At a minimum, I need to know:
I’ve been discussing Jepsen and partition tolerance with Peter Bailis over the past few weeks, and I’m honored to present this post as a collaboration between the two of us. We’d also like to extend our sincere appreciation to everyone who contributed their research and experience to this piece.
Network partitions are a contentious subject. Some claim that modern networks are reliable and that we are too concerned with designing for theoretical failure modes. They often accept that single-node failures are common but argue that we can reliably detect and handle them. Conversely, others subscribe to Peter Deutsch’s Fallacies of Distributed Computing and disagree. They attest that partitions do occur in their systems, and that, as James Hamilton of Amazon Web Services neatly summarizes, “network partitions should be rare but net gear continues to cause more issues than it should.” The answer to this debate radically affects the design of distributed databases, queues, and applications. So who’s right?
A key challenge in this dispute is the lack of evidence. We have few normalized bases for comparing network and application reliability–and even less data. We can track link availability and estimate packet loss, but understanding the end-to-end effect on applications is more difficult. The scant evidence we have is difficult to generalize: it is often deployment-specific and closely tied to particular vendors, topologies, and application designs. Worse, even when an organization has clear picture of their network’s behavior, they rarely share specifics. Finally, distributed systems are designed to resist failure, which means noticeable outages often depend on complex interactions of failure modes. Many applications silently degrade when the network fails, and resulting problems may not be understood for some time–if they are understood at all.
In response to my earlier post on Redis inconsistency, Antirez was kind enough to help clarify some points about Redis Sentinel’s design.
First, I’d like to reiterate my respect for Redis. I’ve used Redis extensively in the past with good results. It’s delightfully fast, simple to operate, and offers some of the best documentation in the field. Redis is operationally predictable. Data structures and their performance behave just how you’d expect. I hear nothing but good things about the clarity and quality of Antirez' C code. This guy knows his programming.
So far we’ve examined systems which aimed for the CP side of the CAP theorem, both with and without failover. We learned that primary-secondary failover is difficult to implement safely (though it can be done; see, for example, ZAB or Raft). Now I’d like to talk about a very different kind of database–one derived from Amazon’s Dynamo model.