Previously in Jepsen, we discussed Riak. Now we’ll review and integrate our findings.
This was a capstone post for the first four Jepsen posts; it is not the last post in the series. I’ve continued this work in the years since and produced several more posts.
We started this series with an open problem.
Notorious computer expert Joe Damato explains: “Literally no one knows.”
We’ve pushed the boundaries of our knowledge a little, though. By building a simple application which models a sequence of causally dependent writes, recording a log of that app’s view of the world, and comparing that log to the final state of the database, we were able to verify–and challenge–our assumptions about the behavior of various distributed systems. In this talk we discussed one particular type of failure mode: a stable network partition which isolated one or more primary nodes–and explored its consequences in depth.
In each case, the system did something… odd. Maybe we hadn’t fully thought through the consequences of the system, even if they were documented. Maybe the marketing or documentation were misleading, or flat-out lies. We saw design flaws, like the Redis Sentinel protocol. Some involved bugs, like MongoDB’s WriteConcern.MAJORITY treating network errors as successful acknowledgements. Other times we uncovered operational caveats, like Riak’s high latencies before setting up fallback vnodes. In each case, the unexpected behavior led to surprising new information about the challenge of building correct distributed systems.
In this series, we chose a simple network failure which we know happens to real production systems. The test encoded specific assumptions about concurrency, throughput, latency, timeout, error handling, and conflict resolution. The results demonstrate one point in a high-dimensional parameter space. The fraction of dropped writes in these Jepsen’s demos can vary wildly for all these reasons, which means we can’t make general assertions about how bad the possibility of write loss really is. Mongo could lose almost all your writes, or none at all. It completely depends on the nature of your network, application, server topology, hardware, load, and the failure itself.
To apply these findings to your systems–especially in fuzzy, probabilistic ways–you’ll need to measure your assumptions about how your system behaves. Write an app that hits your API and records responses. Cause some failures and see whether the app’s log of what happened lines up with the final state of the system. The results may be surprising.
Measurement isn’t something you do just once. Ideally, your production systems should be instrumented continuously for performance and correctness. Some of these failure modes leave traces you can detect.
Some people claim that partitions don’t happen to them. If you run in EC2 or other virtualized environments, noisy neighbors and network congestion/failures are a well-known problem. Running your own hardware doesn’t make you immune either: Amazon, with some of the best datacenter engineers on the planet, considers partitions such a major problem that they were willing to design and build Dynamo. You are probably not Amazon.
Even if your network is reliable, logical failures can be partitions, too. Nodes which become so busy they fail to respond to heartbeats are a common cause of failover. Virtual machines can do all kinds of weird things to your network and clocks. Restoring from a backup can look like a partition resolving. These failures are hard to detect, so many people don’t know they even occurred. You just… get slow for a while, or run across data corruption, weeks or years later, and wonder how what happened.
Aiming for correctness
We’ve learned a bunch of practical lessons from these examples, and I’d like to condense them briefly:
Network errors mean “I don’t know,” not “It failed.” Make the difference between success, failure, and indeterminacy explicit in your code and APIs. Consider extending consistency algorithms through the boundaries of your systems. Hand TCP clients ETags or vector clocks. Extend CRDTs to the browser itself.
Even well-known, widely deployed algorithms like two-phase commit have some caveats, like false negatives. SQL transactional consistency comes in several levels. You’re probably not using the stronger ones, and if you are, your code needs to handle conflicts. It’s not usually a big deal, but keep it on your mental checklist.
Certain problems are hard to solve well, like maintaining a single authoritative record of data with primary failover. Consistency is a property of your data, not of your nodes. Avoid systems which assume node consensus implies data consistency.
Wall clocks are only useful for ensuring responsiveness in the face of deadlock, and even then they’re not a positive guarantee of correctness. Our clocks were completely synchronized in this demo and we still lost data. Even worse things can happen if a clock gets out of sync, or a node pauses for a while. Use logical clocks on your data. Distrust systems which rely on the system time, unless you’re running GPS or atomic clocks on your nodes. Measure your clock skew anyway.
Avoid home-grown distributed algorithms. Where correctness matters, rely on techniques with a formal proof and review in the literature. There’s a huge gulf between theoretically correct algorithm and living breathing software–especially with respect to latency–but a buggy implementation of a correct algorithm is typically better than a correct implementation of a terrible algorithm. Bugs you can fix. Designs are much harder to re-evaluate.
Choose the right design for your problem space. Some parts of your architecture demand consistency, and there is software for that. Other parts can sacrifice linearizability while remaining correct, like CRDTs. Sometimes you can afford to lose data entirely. There is often a tradeoff between performance and correctness: think, experiment, and find out.
Restricting your system with particular rules can make it easier to attain safety. Immutability is an incredibly useful property, and can be combined with a mutable CP data store for powerful hybrid systems. Use idempotent operations as much as possible: it enables all sorts of queuing and retry semantics. Go one step further, if practical, and use full CRDTs.
Preventing write loss in some weakly consistent databases, like Mongo, requires a significant latency tradeoff. It might be faster to just use Postgres. Sometimes buying ridiculously reliable network and power infrastructure is cheaper than scaling out. Sometimes not.
Replication between availability zones or between data centers is much more likely to fail than a rack or agg switch in your DC. Microsoft estimates their WAN links offer 99.5% availability, IIRC, and their LANs at 99.95%. Design your system accordingly.
Embracing failure
All this analysis, measuring, and designing takes hard work. You may not have the money, experience, hardware, motivation, or time. Every system entails risk, and not quantifying that risk is a strategy in itself.
With that in mind, consider allowing your system to drop data. Spew data everywhere and repair it gradually with bulk processes. Garbage-collect structures instead of ensuring their correctness every time. Not everyone needs correct behavior right now. Some people don’t ever need correct behavior. Look at the Facebook feed, or Twitter’s DM light.
Code you can reason about is better than code you can’t. Rely on libraries written and tested by other smart people to reduce the insane quantity of stuff you have to understand. If you don’t get how to test that your merge function is associative, commutative, and idempotent, maybe you shouldn’t be writing your own CRDTs just yet. Implementing two-phase commit on top of your database may be a warning sign.
Consistent, highly available systems are usually slow. There are proofs about the minimum number of network hops required to commit an operation in a CP system. You may want to trade correctness for performance for cost reasons, or to deliver a more responsive user experience.
I hope this work inspires you to test and improve your own distributed systems. The only reason I can talk about these mistakes is because I keep making them, over and over again. We’re all in this together. Good luck. :)
http://github.com/aphyr/jepsen
Thanks
Jepsen has consumed almost every hour of my life outside work for the last three months. I’m several hundred hours into the project now–and I couldn’t have done it without the help and encouragement of friends and strangers.
My sincerest thanks to my fellow Boundary alumni Dietrich Featherston and Joe Damato for the conversations which sparked this whole endeavor. Salvatore Sanfilippo, Jordan West, Evan Vigil-McClanahan, Jared Rosoff, and Joseph Blomstedt were instrumental in helping me understand how these databases actually work. Stephen Strowes and someone whose name I’ve sadly forgotten helped me demonstrate partitions on a local cluster in the first place. My deepest appreciation to the Postgres team, the Redis project, 10Gen and Basho for their support, and for making cool databases available to everyone for free.
Sean Cribbs and Reid Draper clued me in to CRDTs and the problems of LWW. Tom Santero and Mark Phillips invited me to give this talk at RICON East. Jepsen wouldn’t have existed without their encouragement, and I am continuously indebted to the pair. Zach Tellman, John Muellerleile, Josh O’Brien, Jared Morrow, and Ryan Zezeski helped refine my arguments and slides.
Hope I didn’t forget anyone–if so, please drop me a line. Thanks for reading.
Thanks you very much for this excellent analysis. Even though it vaguely matched what I had expected, it is very useful to see actual proof.