Jepsen: Redis redux

2013-12-10

In a recent blog post, antirez detailed a new operation in Redis: WAIT. WAIT is proposed as an enhancement to Redis’ replication protocol to reduce the window of data loss in replicated Redis systems; clients can block awaiting acknowledgement of a write to a given number of nodes (or time out if the given threshold is not met). The theory here is that positive acknowledgement of a write to a majority of nodes guarantees that write will be visible in all future states of the system.

As I explained earlier, any asynchronously replicated system with primary-secondary failover allows data loss. Optional synchronous replication, antirez proposes, should make it possible for Redis to provide strong consistency for those operations.

WAIT means that if you run three nodes A, B, C where every node contains a Sentinel instance and a Redis instance, and you “WAIT 1” after every operation to reach the majority of slaves, you get a consistent system.

WAIT can be also used, by improving the failover procedure, in order to have a strong consistent system (no writes to the older master from the point the failure detection is positive, to the end of the failover when the configuration is updated, or alternative, disconnect the majority of slaves you can reach during the failure detection so that every write will fail during this time).

Antirez later qualified these claims:

I understand this not the “C” consistency of “CAP” but, before: the partition with clients and the (old) master partitioned away would receive writes that gets lost. after: under certain system models the system is consistent, like if you assume that crashed instances never start again. Of course, the existence of synchronous replication does not prove that the system is linearizable; only some types of failover preserve the ordering of writes.

As I showed in Call me maybe: Redis, Redis Sentinel will enter split-brain during network partitions, causing significant windows of data loss. Exactly how much data loss depends on the sentinel configuration and the failure topology. Antirez finally suggested that if we replace Redis Sentinel with a strongly consistent coordination service for failover, Redis WAIT could provide full linearizability.

The failover proposal

In a five-node cluster, assume every write is followed by WAIT 2 to ensure that a majority of nodes have received the write. In the event of a failure, a strong external coordinator goes through the following election process:

Totally partition the old primary P1.
Of all reachable nodes, identify the node with the highest replication offset. Let that node be P2.
Promote P2.
Inform all reachable nodes that they are to follow P2.
Have all reachable clients switch to the new primary.

There are several serious problems with this design. I hinted at these issues in the mailing list with limited success. Kelly Sommers pointed out repeatedly that this design has the same issues as Cassandra’s CL.ALL. Replication alone does not ensure linearizability; we have to be able to roll back operations which should not have happened in the first place. If those failed operations can make it into our consistent timeline in an unsafe way, perhaps corrupting our successful operations, we can lose data.

… surprisingly I think that transactional rollbacks are totally irrelevant.

Ultimately I was hoping that antirez and other contributors might realize why their proposal for a custom replication protocol was unsafe nine months ago, and abandon it in favor of an established algorithm with a formal model and a peer-reviewed proof, but that hasn’t happened yet. Redis continues to accrete homegrown consensus and replication algorithms without even a cursory nod to formal analysis.

OK, fine. Let’s talk about the failover coordinator.

The coordinator

Redis Sentinel is not linearizable; nor are its proposed improvements. Whatever failover system you’re planning to use here is going to need something stronger. In fact, we can’t even guarantee safety using a strong coordination service like ZooKeeper to serialize the failover operations, because ZooKeeper cannot guarantee the mutual exclusion of two services in the presence of message delays and clock skews. Let’s paper over that issue by introducing large delays and carefully ordering our timeouts.

It gets worse. Even if we did have a perfect mutex around the coordinator, two coordinators could issue messages to the same Redis nodes which arrive out of order. TCP does not guarantee ordering between two distinct TCP streams, which means we might see coordinator A initiate a failover process then time out halfway; followed by coordinator B which begins the failover process, only to be interrupted on some nodes by messages en-route through the network from coordinator A. Don’t believe me? TCP message delays have been reported in excess of ninety seconds. That one took out Github.

It gets even worse. If the original primary P1 is isolated from the coordinator, the coordinator will not be able to force P1 to step down. Indeed, P1 could remain a primary for the entire duration of a failover, accepting writes, making state changes, and attempting to replicate those changes to other nodes. This is dangerous because we cannot atomically guarantee that the new majority of nodes will reject those writes.

A client writes to P1, which replicates to secondaries S2, S3, S4, and S5.
The coordinator attempts to elect a new primary, and sees S2, S3, S4, and S5.
Without loss of generality, assume S2 has the highest replication offset. The coordinator promotes S2 to P2.
P1 receives acks from S3, S4, and S5, and, having reached a majority, returns success to the client.
The coordinator reparents S3, S4, and S5 to P2, destroying the write.

You might try to solve this by forcing S2–S5 into a read-only, non-replicating mode before attempting to promote a new primary, but that gets into a whole other morass of issues around multiple state transitions and partial failures. Suffice it to say: it’s difficult to solve this by simply pausing nodes first. Maybe impossible? I’m not sure.

Typically, replication protocols solve this problem by guaranteeing that writes from S1 can not be accepted after S2–S5 acknowledge to the coordinator that they will participate in a new cohort. This often takes the form of a ballot (Paxos), epoch (ZAB, Viewstamped Replication), or term (RAFT). Redis has no such construct, and antirez seems to eschew it as unnecessary:

In this model, it is possible to reach linearizability? I believe, yes, because we removed all the hard part, for which the strong protocols like Raft use epochs.

This brings us to a different, but related series of problems.

The servers

By using the offset in the replication log as the determining factor in which nodes are promotable, the proposed failover design opens the door for significant data loss.

Imagine the following sequence:

The primary P1, with log offset O1, becomes isolated from S3, S4, and S5.
Clients writing to P1 see their operations using WAIT 2 fail.
S3 is promoted to P3, with offset O1=O3. Clients writing to P3 see their writes succeed, replicated to S4 and S5.
More operations occur on P1 than on P3. O1 becomes greater than O3.
The partition heals; the coordinator can see both P1 and P3.
The coordinator sees that O1 is higher than O3, and chooses P1 as the new primary.
P3 is demoted, and all its acknowledged writes are destroyed.

Don’t believe me? Here, let’s try it. Here’s a function which implements (more or less) the proposed coordinator algorithm. Note that we’re not demoting the original primary because it may not be reachable.

(defn elect!
  "Forces an election among the given nodes. Picks the node with the highest
  replication offset, promotes it, and re-parents the secondaries."
  [nodes]
  (let [highest (highest-node nodes)]
    (log "Promoting" highest)
    (with-node highest
      (redis/slaveof "no" "one"))
    (doseq [node (remove #{highest} nodes)]
      (log "Reparenting" node "to" highest)
      (with-node node
        (redis/slaveof highest 6379)))))

And in the test, we’ll use WAIT to ensure that only writes which are successfully replicated to 2 or more replicas are considered successful:

      (add [app element]
        (try
          (redis/with-conn pool spec
            (redis/sadd key element))
          ; Block for 2 secondaries (3 total) to ack.
          (let [acks (redis/with-conn pool spec
                       (taoensso.carmine.protocol/send-request! "WAIT" 2 1000))]
            (if (< acks 2)
              (do
                (log "not enough copies: " acks)
                error)
              ok))
          (catch Exception e
            (if (->> e .getMessage (re-find #"^READONLY"))
              error
              (throw e))))

I’m gonna punt on informing clients which node is the current primary; we’ll just issue set-add requests to each node independently. Jepsen only cares about whether successful writes are lost, so we’ll let those writes fail and log ’em as unsuccessful.

Initially, the offset for all 5 nodes is 15. Writes complete successfully on P1 and fail on S2–S5.

We cut off P1 and S2 from S3, S4, and S5. S3, S4, and S5 all have equal offsets (1570), so we promote S3 to P3. As soon as the partition takes effect, writes to P1 begin to fail–we see not enough copies: 1, and an :error status for write 110, 115, and so on. Latencies on P1 jump to 1 second, since that’s how long we’re blocking for using WAIT.

Writes complete successfully on P3, since it can see a majority of nodes: itself, S4, and S5. We heal the partition and initiate a second election. Since P1’s offset (8010) is higher than P3’s (6487), we preserve P1 as a primary and demote all other nodes to follow it. All P3’s writes accepted during the partition are silently destroyed.

Note that there’s actually a window here where writes can successfully take place on either of P1 or P2 in a mixed sequence, depending on the order in which the secondaries are reparented. Both 560 and 562 complete successfully, even though 562 was written to S3, which was demoted at that point in time. Some weird opportunity for timing anomalies there.

These results are catastrophic. In a partition which lasted for roughly 45% of the test, 45% of acknowledged writes were thrown away. To add insult to injury, Redis preserved all the failed writes in place of the successful ones.

Additional issues

Two bugs amplify this problem. Note that this is the unstable branch, so this isn’t a huge deal right now:

First, Redis secondaries return -1 for their offset when they detect the primary is down. Returning a special status code makes sense… but not if you’re using the offset to determine which nodes become the primary. This could cause the highest nodes to appear the lowest, and vice versa. If a fresh node has offset 0, and all other nodes return offset -1, this could cause a cluster to erase all data ever written to it.

Second, Redis resets the replication offset to zero every time a node is promoted. Again, a reasonable choice in isolation, but it actually maximizes the chances that this particular failure mode will occur. The current design is biased towards data loss.

Even if these bugs were corrected, the problem could still occur. All that’s required is for more operations to happen on P1 than P3 after the two diverge.

Going forward

Distributed systems design is really hard, but engineers continue to assume otherwise:

However I think that distributed systems are not super hard, like kernel programming is not super hard, like C system programming is not super hard. Everything new or that you don’t do in a daily basis seems super hard, but it is actually different concepts that are definitely things everybody here in this list can master.

For sure a few months of exposure will not make you able to provide work like Raft or Paxos, but the basics can be used in order to try to design practical systems, that can be improved over time.

I assert just the opposite: we need formal theory, written proofs, computer verification, and experimental demonstration that our systems make the tradeoffs we think they make. Throughout the Redis criticism thread and discussion on Twitter, I see engineers assuming that they understand the tradeoffs despite the presence of gaping holes in the system’s safety net.

This behavior endangers users.

These list threads and blog posts are the sources that users come to, years later, to understand the safety properties of our systems. They’ll read our conjectures and idle thoughts and tease out some gestalt, and use that to build their systems on top of ours. They’ll miss subtle differences in phrasing and they won’t read every reply. Most won’t do any reading at all; they’re not even aware that these problems could exist.

Engineers routinely characterize Redis’s reliability as “rock solid”.

This is part of why I engage in these discussions so vocally. As systems engineers, we continually struggle to erase the assumption of safety before that assumption causes data loss or downtime. We need to clearly document system behaviors so that users can make the right choices.

We must understand our systems in order to explain them–and distributed systems are hard to understand. That’s why it’s so important that we rely on formal models, on proofs, instead of inventing our own consensus protocols–because much of the hard work of understanding has been done already. We can build on that work. Implementing a peer-reviewed paper is vastly simpler than trying to design and verify an algorithm from scratch–or worse, evolving one piecemeal, comprised of systems which encode subtly different assumptions about their responsibilities to the world. Those designs lead to small gaps which, viewed from the right angle, become big enough to drive a truck through.

I wholeheartedly encourage antirez, myself, and every other distributed systems engineer: keep writing code, building features, solving problems–but please, please, use existing algorithms, or learn how to write a proof.

david karapetyan on 2013-12-11

How long was it before people stopped writing their own encryption protocols? How many blog posts had to be written to convince application programmers that peer reviewed encryption schemes are better than whatever cockamamie algorithm they could come up with. I salute your efforts to educate and raise awareness but alas I fear you have embarked on a decade long journey then again even if your efforts get to at least one other person I think you can consider it a success.

antirez on 2013-12-11

Hello, thanks to Aphyr for spending the time to try stuff, but the model he tried here is not what I proposed:

https://gist.github.com/antirez/7901666

The model I suggest is non realistic, and a toy (it was just an exercise in distributed systems), but it is so strong that it is basically impossible to avoid it to be consistent because of the presence of the strong coordiantor. The test performed by Aphyr failed because it does not capture the semantics of what I proposed, and also because the current Redis implementation is broken in regard to the model (for example, when a slave is promoted to master it reset its replication offset to zero).

The point here is that the partitioned master, when the coordinator dictates it is unreachable, is no longer writable, and will be converted into a replica later. “We assume that a re-appearing master, or other slaves that are again available after partitions heal, are capable of understand what the new master is”.

The idea is that when you design a toy system like that, you can take pieces of the magical strong coordinator and can insert its capabilities one after the other into the actual nodes, with the same guarantees of liveness and safety.

For example, it is not easy to partition away the master completely in practice, how to do that? By versioning changes in the system (likely with an epoch or term) and say the slaves during the election process to don’t acknowledge writes anymore until a new configuration is delivered (I already suggested this in the google group thread linked here).

Another example, how do you deal in practice with the partitioned master incrementing its offset? Nodes claiming to be masters may not be selected, assuming this still guarantees liveness, but it is just an example to show the point, you start tearing apart the magic strong coordinator in order to end with a system that can actually be implemented.

Even if you don’t have any plan to write a distributed system like the above, the toy system is still useful to analyze. It is like when you try to analyze a block cipher with less rounds to understand the basic properties of your system. Studying it a bit I found two bugs in the replication offset handling that when fixed will result in practical better guarantees in the Redis failover process (but not strong consistency as repeated an infinite number of times).

Regards, Salvatore

@zinyando on 2013-12-11

I agree lets use proven algorithms instead of using ‘untested home grown solutions’. But I do get where antirez is coming from distributed systems are hard and he is doing the best he can with what he has learnt. I admire his willingness to learn.

antirez on 2013-12-11

Zinyando: thanks for your comment. Please note that the toy system was not something that I needed to implement, because Redis, as already said multiple times, is not going to provide strong consistency. My point was to show that “WAIT” per se is not broken not right, it is just a low-level replication tool. On top of this replication tool you can build things. For example in an AP system something like WAIT may improve durability by reaching more replicas. In Redis Sentinel elections WAIT makes more likely to pick a slave with a small or non-existent “hole” compared to the previous master, and so forth.

My feeling is that while theory is super important, people tend to stop at theory. I want to learn more distributed system theory so I try to read papers and books every time it is possible, but I don’t want to stop thinking about how the building blocks that are parts of many proven protocols work, can be combined, and so forth.

antirez on 2013-12-11

Sorry last comment, I read the post entirely only now, and it is very disappointing how this was presented.

Basically this was a sub-thread in the Redis mailing list thread where I informally proposed a model (that is misused and was not understood at all in the context of this article), just to show that WAIT per se is not consistent or not, it is simply not a “system” per se. Not only it was the model not real, just to show a point, but there was no intent to implement it.

The article here instead cherry-pick quotes in the mailing list thread to construct a story that does not exist. Actually a blog post I published recently, stated the exact contrary, that there is no interest for strong consistency in Redis Cluster.

This is very disappointing in my opinion, and highly unfair. Redis is believed to be rock solid because the implementation is solid, and the tradeoffs allow people to solve real problems. Nobody believes Redis is features strong consistency as a distributed system, nor it is going to do this.

If Aphyr was interested in a real discussion, I could just agree about the obvious, that if you can totally control the orchestration of the system then synchronous replication is obviously a building block that can be part of a strong consistent system. Apparently he as just interested in spreading FUD. Congratulations.

Pierre Chapuis on 2013-12-11

Antirez: this is the right way to do it. You cannot implement those protocols without understanding how they work anyway (*especially* consensus protocols).

However I agree with Aphyr that if you intend to use something non-standard in Redis Cluster it would be better to design it more formally and submit it to community scrutiny - it is obvious there are people out there who are very interested in trying to break your designs :) I don’t think anybody can get this kind of algorithm right by themselves the first time, and even without taking outside feedback into account, thinking formally helps find flaws in your own reasoning.

Here is what I mean by “more formal design”:

1) Define what is expected of the protocol (e.g. no loss of acknowledged writes).

2) Specify a failure model (How does the network fail? How do the nodes fail?).

3) Explicit safety properties the protocol depends on. Since you know Raft: I mean something like figure 3 page 5 of the paper (https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).

4) Write the outline of a proof of why your design achieves 1) under 2) thanks to 3).

After that, maybe use a model checker relying on formal temporal logic to validate parts of the algorithm. Aphyr suggested Lamport’s Temporal Logic of Actions. Personally I have only used Linear Temporal Logic (used in Spin and LTSA) so far. It is not always possible (or at least easy) to express all of the system in it, but you can at least validate parts of it. I would say it helps a lot to improve confidence in what you are doing (analogous to a test suite in software development).

That being said: personally I see Redis Cluster as something that replaces Sentinel (infrastructure management) and Twemproxy (help with sharding, consistent hashing, etc). But those are two very different systems with different needs. The main problem that requires consensus is master failover and Raft has been designed for that. If I understand well you like Raft so your solution should probably be something very similar…

As for WAIT, if you need something low-level like that to design your algorithms then fine, but I don’t see why it should be exposed to users, most of whom are not able to understand the trade offs it makes anyway. So I think it should not be part of the API, exactly like (if I am not mistaken) you do not provide a command to force a fsync.

Pierre Chapuis on 2013-12-11

Hmm, there was a race condition between me and Antirez :)

My comment was an answer to the one ending with:

I want to learn more distributed system theory so I try to read papers and books every time it is possible, but I don’t want to stop thinking about how the building blocks that are parts of many proven protocols work, can be combined, and so forth.

Re. Antirez’s latest comment: I do not think that Aphyr (or Kellabyte for that matter) is interested in “spreading FUD”. There are people who are somehow dishonest about Redis IMO and it was the original reason for the infamous thread, but I think these two are sincerely interested in improving something that isn’t really there yet (Cluster).

Pierre Chapuis on 2013-12-11

Last comment before I stop: some people on Twitter (*not* Aphyr or Kellabyte afaik) have written things that I consider really disrespectful towards Antirez.

Before saying things like “he doesn’t know CS”, consider that he has been a security researcher and invented a port scanning technique that is now mainstream, written a TCL interpreter, implemented Joy in TCL and, oh, written a little piece of software that stores data and, no matter what you think of it, is being used by many of the largest websites and web applications in the world.

So before you criticize him directly, ask yourself what you have done so far that gives you the legitimacy to mock.

Kelly Sommers on 2013-12-11

Redis WAIT is not a “low-level replication tool” as Antirez states. This is false.

This taken from Antirez blog is what is called a transaction protocol. http://antirez.com/news/66

redis 127.0.0.1:9999> set foo bar OK redis 127.0.0.1:9999> incr mycounter (integer) 1 redis 127.0.0.1:9999> wait 5 100 (integer) 7

Redis WAIT behaves similar as Cassandra’s ConsistencyLevel.ALL in that there are no real guarantees provided. It’s a best effort. As an optimization, Cassandra stores a hinted handoff for replicas who fail so that they can repair much quicker when they come back to being available to the rest of the cluster after partitions or intermittent failures. Cassandra is an eventually consistent database so this is part of its overall design goals. Redis WAIT could do the same thing but I’m not saying it should.

I want to be very clear here, WAIT is not a replication tool. It is a transaction protocol. Both return success or failure depending on whether the operation succeeded or failed (timeout). Your applications will act on these responses.

Since Aphyr proved Redis can preserved all the failed writes in place of the successful ones, this creates a difficult perspective for applications who act on the transaction responses.

antirez on 2013-12-11

Kelly: WAIT per se just returns an information about an event that happened, that is, if at least N replicas received a write. In a system that has certain properties, this may lead to consider the writes that reached the majority as safe. In other systems, including Redis Cluster, the returned value does not assure any special consistency property, you can still lose your data. However it is possible to show that there are specific failure modes where replicating to more nodes synchronous make the it less likely, probabilistically speaking, to lose data. No Guarantees.

About the proposed model, it was just an extreme, non practical example, to show that what the system provides with WAIT depends on the rest of the system, and indeed if you assume very strong properties of the rest, the system may feature strong consistency. The simplest system you can talk about, just to make a point, that has strong properties, is a magical system where a super-coordinator exists that avoids all the races that otherwise make the process hard.

Real consensus systems are all about doing the same without the super-coordinator, but using other means to reach the same safety level.

Ross B. on 2013-12-11

SPIN validation model or it didn’t happen.

Duarte Nunes on 2013-12-11

Side note on “ZooKeeper cannot guarantee the mutual exclusion of two services in the presence of message delays and clock skews.”: Can you elaborate a bit on this? Are you talking about the typical lock recipe?

Gustav on 2013-12-12

Aphyr: Why make a post about something from an unstable branch? The WAIT-command is not even partly tested or verified. It should be considered experimental. Antirez claims are very bold, I agree. However I don’t think this post is any constructive hanging-out Redis, if your point is that async replication does not belong in a distributed system you could just have spelled that out.

C. Scott Ananian on 2013-12-13

What Ross B. said. Spending time with SPIN will upend, and then repair, your brain.

I did my PhD on transactional memory systems. I spent several years designing lock-free algorithms before finally being prodded to use SPIN to verify them for the thesis. It turns out distributed (and lock-free) systems are perverse in unimaginable ways. I found so many subtle bugs when I sat down to formalize the system, that I would never consider writing an algorithm without SPIN (or some other such tool) by my side again. And further – I found significant bugs in other published papers, including fundamental papers in the lock-free space. I wasn’t dumb, and neither were the authors of the fundamental papers. It’s just that writing these algorithms is hard. The combinatorics of N processes interleaving their steps are just too hard to validate unaided.

john on 2015-03-28

thank you.

Jepsen: Redis redux

The failover proposal

The coordinator

The servers

Additional issues

Going forward

Post a Comment