Baishampayan Ghose
Baishampayan Ghose, on

Bravo, Kyle. Awesome work.

Alex Baranosky
Alex Baranosky, on

Hi,

Just throwing this out there as an interesting different way to implement one of your examples, rather than using the keep/when combo… I could easily have messed it up, because I haven’t executed it, but here goes anyway:

(defn client-write-complete “A reachable primary node can inform a client that its desired replication offset has been reached.” system (filter (partial valid-client? system)) (map (juxt identity (fn client) majority-acked-offset)))) (filter (fn [client offset] offset))) (map (fn [client _] :waiting] nil) (assoc-in [:clients (:name client) :writing] nil) (log (ok-op (:name client) :write (:writing client))))))))

Best, Alex

antirez
antirez, on

Hello Aphyr,

13 days ago I wrote this in the Redis mailing list, about this model:

About non linearizability, perhaps it does not apply to the case where a strong coordinator exists, but in the general case one issue is that when we read, we can’t just read because a stale master could reply with stale data, breaking linearizability. There is a trick to force the read to be acknowledged that could work:

MULTI INCR somecounter GET data EXEC WAIT

I’m a distributed systems newbie, but I believe it was never a breaking news that in asynchronously replicated systems where the master applies the write as soon as it receives it, you can read non acknowledged data, so indeed this was an obvious issue. The above problem also applies in well-designed strong consistent systems and is well documented in the literature, in general you can’t just read from what you believe is the current master/leader, conceptually reads should follow the same path as writes, even if sometimes there are optimizations that can be used to make reads cheaper.

But regardless of that, this model you already wrote two blog posts about, was never the interesting part of the discussion, it was just an argument about “WAIT is just a low-level primitive, the behavior of the system is the sum of all the parts”.

The original thread was about Redis Cluster, that is a system that does not feature strong consistency, but only heuristics to try to put a bound on how much replicas can diverge, and heuristics about how to select the history that likely has more writes when a failover is performed.

Ross B.
Ross B., on

@Aphyr Thanks for the clarification.

By the way, I completely agree with this: “This is the kind of argument you need to make, as a database engineer, before asserting a given system is linearizable” – it’s true for any type of asynchronous concurrent system really (lock-free algorithms, distributed message passing, etc). There are just too many places for bugs to hide to rely on intuition and common-case modelling. Anyone who doubts the need for some kind of formal method or automated verification should read the first few chapters of “Design and Validation of Computer Protocols” by Gerard J. Holzmann (creator of SPIN) – faulty protocols have a long and colorful history.

Aphyr
Aphyr, on

I mean that I tried several approaches, including one similar to the PARAGLIDER paper. I abandoned the brute-force exploration of the state space because I hit similar limits as the the PARAGLIDER paper. From page 8:

A straightforward way to automatically check whether a concurrent history has a corresponding linearization is to simply try all possible permutations of the concurrent history until we either find a linearization and stop, or fail to find one and report that the concurrent history is not linearizable. We refer to this approach as Automatic Linearization. While this approach is conceptually simple, its worst-case time and space complexity is exponential in the length of the concurrent history.

I call that the “brute force” approach–though SPIN does do all sorts of optimizations internally.

I don’t know the details of their use of SPIN, but I can tell you that where PARAGLIDER tops out at 20 operations for automatic extraction of linearization points, Knossos can check a 6-thread, 600-operation register history in a few seconds, and that’s without any symmetry reduction or commutative history profiling. I could be misinterpreting the paper though!

Ross B.
Ross B., on

Very interesting post. I’m still reading but may I please request clarification on something. You wrote:

“”“Microsoft Research’s PARAGLIDER, in the absence of known linearization points, relies on this brute-force approach using the SPIN model checker. …[SNIP]… In my experiments, this approach started to break down around 12 to 16-operation histories.”“”

When you say “in my experiments” do you mean that you actually tried SPIN, or that you experimented with a “brute-force” approach?

SPIN applies a number of rather clever mathematical transforms to reduce the search space, it therefore can hardly be described as a “brute force” approach. Perhaps you mean “provably exhaustive”? So again: did you try SPIN? or just a naive brute force search?

Thanks.

Harish N
Harish N, on

Thank you for this Clojure series The exercises are also a great help as they do test one’s learnings

Marko Bonaci
Marko Bonaci, on

I love you dude. And I’m strait! :)

Adit
Adit, on

You use type everywhere but class in (class 'str). Is there a reason for this?

leaf
leaf, on

Thanks for your artical. Would you please share your testing code. And how can i simulate network partion? We are estimate to use mongodb in our project.thanks

Deps
Deps, on

holy crap this is great - gorgeous piece! and thanks for all the details. i’m so new to this woodworking thing, but raw wood slabs are getting me more and more excited all the time!

Aphyr
Aphyr, on

Does ‘defmacro’ have some kind of pattern matching for resolving these zero-argument and one-argument cases before moving on to the multi-argument case?

You’re right, Elf, defmacro (like defn) have arity dispatch provided by the compiler. As for the recursive macro call, where or is defined in terms of itself, that’s also handled by the Clojure compiler. Inside (defn foo [] …), ‘foo is lexically bound to the function itself, by the compiler.

Juan Manuel
Juan Manuel, on

Why don’t you publish this as a book? The contents deserve it.

C. Scott Ananian
C. Scott Ananian, on

What Ross B. said. Spending time with SPIN will upend, and then repair, your brain.

I did my PhD on transactional memory systems. I spent several years designing lock-free algorithms before finally being prodded to use SPIN to verify them for the thesis. It turns out distributed (and lock-free) systems are perverse in unimaginable ways. I found so many subtle bugs when I sat down to formalize the system, that I would never consider writing an algorithm without SPIN (or some other such tool) by my side again. And further – I found significant bugs in other published papers, including fundamental papers in the lock-free space. I wasn’t dumb, and neither were the authors of the fundamental papers. It’s just that writing these algorithms is hard. The combinatorics of N processes interleaving their steps are just too hard to validate unaided.

Gustav
Gustav, on

Aphyr: Why make a post about something from an unstable branch? The WAIT-command is not even partly tested or verified. It should be considered experimental. Antirez claims are very bold, I agree. However I don’t think this post is any constructive hanging-out Redis, if your point is that async replication does not belong in a distributed system you could just have spelled that out.

Duarte Nunes
Duarte Nunes, on

Side note on “ZooKeeper cannot guarantee the mutual exclusion of two services in the presence of message delays and clock skews.”: Can you elaborate a bit on this? Are you talking about the typical lock recipe?

Ross B.
Ross B., on

SPIN validation model or it didn’t happen.

antirez
antirez, on

Kelly: WAIT per se just returns an information about an event that happened, that is, if at least N replicas received a write. In a system that has certain properties, this may lead to consider the writes that reached the majority as safe. In other systems, including Redis Cluster, the returned value does not assure any special consistency property, you can still lose your data. However it is possible to show that there are specific failure modes where replicating to more nodes synchronous make the it less likely, probabilistically speaking, to lose data. No Guarantees.

About the proposed model, it was just an extreme, non practical example, to show that what the system provides with WAIT depends on the rest of the system, and indeed if you assume very strong properties of the rest, the system may feature strong consistency. The simplest system you can talk about, just to make a point, that has strong properties, is a magical system where a super-coordinator exists that avoids all the races that otherwise make the process hard.

Real consensus systems are all about doing the same without the super-coordinator, but using other means to reach the same safety level.

Kelly Sommers
Kelly Sommers, on

Redis WAIT is not a “low-level replication tool” as Antirez states. This is false.

This taken from Antirez blog is what is called a transaction protocol. http://antirez.com/news/66

redis 127.0.0.1:9999> set foo bar OK redis 127.0.0.1:9999> incr mycounter (integer) 1 redis 127.0.0.1:9999> wait 5 100 (integer) 7

Redis WAIT behaves similar as Cassandra’s ConsistencyLevel.ALL in that there are no real guarantees provided. It’s a best effort. As an optimization, Cassandra stores a hinted handoff for replicas who fail so that they can repair much quicker when they come back to being available to the rest of the cluster after partitions or intermittent failures. Cassandra is an eventually consistent database so this is part of its overall design goals. Redis WAIT could do the same thing but I’m not saying it should.

I want to be very clear here, WAIT is not a replication tool. It is a transaction protocol. Both return success or failure depending on whether the operation succeeded or failed (timeout). Your applications will act on these responses.

Since Aphyr proved Redis can preserved all the failed writes in place of the successful ones, this creates a difficult perspective for applications who act on the transaction responses.

Pierre Chapuis
Pierre Chapuis, on

Last comment before I stop: some people on Twitter (*not* Aphyr or Kellabyte afaik) have written things that I consider really disrespectful towards Antirez.

Before saying things like “he doesn’t know CS”, consider that he has been a security researcher and invented a port scanning technique that is now mainstream, written a TCL interpreter, implemented Joy in TCL and, oh, written a little piece of software that stores data and, no matter what you think of it, is being used by many of the largest websites and web applications in the world.

So before you criticize him directly, ask yourself what you have done so far that gives you the legitimacy to mock.

Pierre Chapuis
Pierre Chapuis, on

Hmm, there was a race condition between me and Antirez :)

My comment was an answer to the one ending with:

I want to learn more distributed system theory so I try to read papers and books every time it is possible, but I don’t want to stop thinking about how the building blocks that are parts of many proven protocols work, can be combined, and so forth.

Re. Antirez’s latest comment: I do not think that Aphyr (or Kellabyte for that matter) is interested in “spreading FUD”. There are people who are somehow dishonest about Redis IMO and it was the original reason for the infamous thread, but I think these two are sincerely interested in improving something that isn’t really there yet (Cluster).

Pierre Chapuis
Pierre Chapuis, on

Antirez: this is the right way to do it. You cannot implement those protocols without understanding how they work anyway (*especially* consensus protocols).

However I agree with Aphyr that if you intend to use something non-standard in Redis Cluster it would be better to design it more formally and submit it to community scrutiny - it is obvious there are people out there who are very interested in trying to break your designs :) I don’t think anybody can get this kind of algorithm right by themselves the first time, and even without taking outside feedback into account, thinking formally helps find flaws in your own reasoning.

Here is what I mean by “more formal design”:

1) Define what is expected of the protocol (e.g. no loss of acknowledged writes).

2) Specify a failure model (How does the network fail? How do the nodes fail?).

3) Explicit safety properties the protocol depends on. Since you know Raft: I mean something like figure 3 page 5 of the paper (https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).

4) Write the outline of a proof of why your design achieves 1) under 2) thanks to 3).

After that, maybe use a model checker relying on formal temporal logic to validate parts of the algorithm. Aphyr suggested Lamport’s Temporal Logic of Actions. Personally I have only used Linear Temporal Logic (used in Spin and LTSA) so far. It is not always possible (or at least easy) to express all of the system in it, but you can at least validate parts of it. I would say it helps a lot to improve confidence in what you are doing (analogous to a test suite in software development).

That being said: personally I see Redis Cluster as something that replaces Sentinel (infrastructure management) and Twemproxy (help with sharding, consistent hashing, etc). But those are two very different systems with different needs. The main problem that requires consensus is master failover and Raft has been designed for that. If I understand well you like Raft so your solution should probably be something very similar…

As for WAIT, if you need something low-level like that to design your algorithms then fine, but I don’t see why it should be exposed to users, most of whom are not able to understand the trade offs it makes anyway. So I think it should not be part of the API, exactly like (if I am not mistaken) you do not provide a command to force a fsync.

antirez
antirez, on

Sorry last comment, I read the post entirely only now, and it is very disappointing how this was presented.

Basically this was a sub-thread in the Redis mailing list thread where I informally proposed a model (that is misused and was not understood at all in the context of this article), just to show that WAIT per se is not consistent or not, it is simply not a “system” per se. Not only it was the model not real, just to show a point, but there was no intent to implement it.

The article here instead cherry-pick quotes in the mailing list thread to construct a story that does not exist. Actually a blog post I published recently, stated the exact contrary, that there is no interest for strong consistency in Redis Cluster.

This is very disappointing in my opinion, and highly unfair. Redis is believed to be rock solid because the implementation is solid, and the tradeoffs allow people to solve real problems. Nobody believes Redis is features strong consistency as a distributed system, nor it is going to do this.

If Aphyr was interested in a real discussion, I could just agree about the obvious, that if you can totally control the orchestration of the system then synchronous replication is obviously a building block that can be part of a strong consistent system. Apparently he as just interested in spreading FUD. Congratulations.

antirez
antirez, on

Zinyando: thanks for your comment. Please note that the toy system was not something that I needed to implement, because Redis, as already said multiple times, is not going to provide strong consistency. My point was to show that “WAIT” per se is not broken not right, it is just a low-level replication tool. On top of this replication tool you can build things. For example in an AP system something like WAIT may improve durability by reaching more replicas. In Redis Sentinel elections WAIT makes more likely to pick a slave with a small or non-existent “hole” compared to the previous master, and so forth.

My feeling is that while theory is super important, people tend to stop at theory. I want to learn more distributed system theory so I try to read papers and books every time it is possible, but I don’t want to stop thinking about how the building blocks that are parts of many proven protocols work, can be combined, and so forth.

@zinyando
@zinyando, on

I agree lets use proven algorithms instead of using ‘untested home grown solutions’. But I do get where antirez is coming from distributed systems are hard and he is doing the best he can with what he has learnt. I admire his willingness to learn.

antirez
antirez, on

Hello, thanks to Aphyr for spending the time to try stuff, but the model he tried here is not what I proposed:

https://gist.github.com/antirez/7901666

The model I suggest is non realistic, and a toy (it was just an exercise in distributed systems), but it is so strong that it is basically impossible to avoid it to be consistent because of the presence of the strong coordiantor. The test performed by Aphyr failed because it does not capture the semantics of what I proposed, and also because the current Redis implementation is broken in regard to the model (for example, when a slave is promoted to master it reset its replication offset to zero).

The point here is that the partitioned master, when the coordinator dictates it is unreachable, is no longer writable, and will be converted into a replica later. “We assume that a re-appearing master, or other slaves that are again available after partitions heal, are capable of understand what the new master is”.

The idea is that when you design a toy system like that, you can take pieces of the magical strong coordinator and can insert its capabilities one after the other into the actual nodes, with the same guarantees of liveness and safety.

For example, it is not easy to partition away the master completely in practice, how to do that? By versioning changes in the system (likely with an epoch or term) and say the slaves during the election process to don’t acknowledge writes anymore until a new configuration is delivered (I already suggested this in the google group thread linked here).

Another example, how do you deal in practice with the partitioned master incrementing its offset? Nodes claiming to be masters may not be selected, assuming this still guarantees liveness, but it is just an example to show the point, you start tearing apart the magic strong coordinator in order to end with a system that can actually be implemented.

Even if you don’t have any plan to write a distributed system like the above, the toy system is still useful to analyze. It is like when you try to analyze a block cipher with less rounds to understand the basic properties of your system. Studying it a bit I found two bugs in the replication offset handling that when fixed will result in practical better guarantees in the Redis failover process (but not strong consistency as repeated an infinite number of times).

Regards, Salvatore

david karapetyan
david karapetyan, on

How long was it before people stopped writing their own encryption protocols? How many blog posts had to be written to convince application programmers that peer reviewed encryption schemes are better than whatever cockamamie algorithm they could come up with. I salute your efforts to educate and raise awareness but alas I fear you have embarked on a decade long journey then again even if your efforts get to at least one other person I think you can consider it a success.

Metin Amiroff
Metin Amiroff, on

Wow, what an expressive and powerful language Clojure is! The last code example really nailed it for me. Thanks for these series, as others already pointed out, they’re the best on the net. Keep them coming please…

Elf M. Sternberg
Elf M. Sternberg, on

Kyle, awesome explanation, but one thing is bothering me. The (defmacro or…) example doesn’t make sense to me. It shows three expressions in a row. If ‘or’ already existed, I would understand how the first two (the empty set and a single set) short circuit to their definitions, and the actual quasiquoted stuff doesn’t kick in until we get more than one expression to evaluate, but ‘or’ doesn’t exist, you’re defining it here.

Maybe my LISP is out of date (most of my Lisp experience is within Emacs), but I don’t see how this code resolves. Does ‘defmacro’ have some kind of pattern matching for resolving these zero-argument and one-argument cases before moving on to the multi-argument case?

x

+1 from me as well, looking forward to more.

eigenlicht
eigenlicht, on

Hey Kyle, great post. You mentioned the -> and ->> threading macros. How about mentioning the as-> macro too? One situation where it comes in handy is for example this: http://dpassen1.github.io/software/2013/11/17/embracing-the-as–macro/

Also, situations where you’d wrap certain function calls in fn’s, because those one or two function calls don’t fit into the pattern of the others inside the ->/->> form.

Wayne Conrad
Wayne Conrad, on

Perfectly paced–not too fast, not too slow, and your explanations are easy to understand. Thanks for writing these.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.