jsosic, the problem is that masters can be different for every queue, so to which node would you configure your haproxy (or any load balancing mechanism) to point to ?
Difficult to maintain this.
And if you decide to create your queues on the same node to have the same master, this scales really badly as all the resource management will occur on the same node.
Philip O'Toole, on
Interesting article – thanks. Interestingly, I have very recently heard about a split-brained correctly-configured, ES cluster. I don’t know much else, but I know for a fact the cluster was following the advice in my blog post, and was running a recent release.
I have some feedback and questions.
You mention early on “…those which are read-only copies–termed “data nodes”–but this is primarily a performance optimization”. I’ve never heard of read-only nodes. Replicas, yes, which can serve search traffic, not but not index. Is this what you mean? Also I understand data-nodes to mean any node that has shards – in contrast to node-clients, which just serve to route indexing and search requests.
Also in parts of the article you refer to “primary”, when I think you mean “master”. For example, you mention the election of a “primary”. In the context of ES, “primary” usually refers to a primary copy of a replicated shard, as opposed to the replica of the shard.
It was really wonderful to see how etcd maintainers were so responsive to your findings.
To me, as a potential etcd user, the fact that they decided to not only change the docs, but also implementation details is very encouraging, despite some arguable comments down the thread. I’d like to be able to say the same thing for Redis, though.
Keep going Aphyr, you’re doing remarkably important job here.
If I may, I’d like to suggest a couple of future “Jepsen adversaries”, from different families:
Is it possible to evaluate against the zookeeper discovery plugin. I think this fork has more upto date compatibility.
https://github.com/imotov/elasticsearch-zookeeper
I am seen lots of remarks mentioning zk plugin solves a lot of issues including split brain. And it is not officially endorsed by ES team I think.
W. Andrew Loe III, on
I believe this sentence is inverted:
“We can use the atomic constraint of linearizability to mutate state safely. We can define an operation like compare-and-set, in which we set the value of a register to a new value if, and only if, the register currently has some other value.”
Do you mean “… in which we set the value of a register to a new value if, and only if, the register has the value we are comparing.”
Eric, on
Great article, expertly augmented by Most Popular Girls in School.
State! State! State! State!
Chris, on
Love your pop culture references. I watched the whole thing just to see them all.
Xiao Yu, on
Thanks for the post, just wanted to note one thing.
It looks like you may have been using the wrong setting for ping timeouts, the correct setting appears to be discovery.zen.ping_timeout (note the underscore not dot between ping and timeout) as it appears in the docs and code. It appears the default config file distributed with elasticsearch is wrong.
Jason Kolb, on
Thanks for writing this up. We’re using Elasticsearch and I’ve raised these issues to several people there. Hopefully they’ll publish a response soon.
On another note, holy f*&k are those gif’s annoying.
The flip side to this (Captain Obvious to the resq!) is that false positives may occur during small glitches like stop-of-the-world GC.
Jason, on
Excellent article - thanks for writing these.
Mo, on
Very nice article. Thank you for the effort spent.
Jonathan Channon, on
Is RavenDB next on the list?
Jerome, on
Thanks a lot for this post (and all the others). Now how do we convince you to write a similar one for Solr? It would be really insightful to compare the two leading distributed search systems.
Chris, on
Just wanted to say that this series has been an exceptional contribution to discourse around distributed systems; thanks for all you’ve done so far, it’s been amazing reading.
Levi Figueira, on
I came here for the Barbie GIFs.
I like. :)
Alex, on
I love the animated gifs. That other guy just needs some epilepsy meds.
Nat, on
Awesome post!(but hate those annoying gifs, a big distraction while reading this blog, please avoid them in the future)
ss, on
I like Jon :)
Jon, on
Animated images on articles = NO
Anders Hovmöller, on
“reduces the odd of losing” <- should be “odds”.
Great article! Scary. But great :P
shikhar, on
seems like the ES team is moving in the right direction with testing this stuff
shikhar, on
I’ve been jepsen-testing ES using my discovery plugin (http://github.com/shikhar/eskka) and I tend to see much fewer writes lost (0-4 typically, using different partitioning strategies), which I attribute to it avoiding split-brains with multiple masters that Zen is currently prone to. But of course there should really be no acked writes lost.
It might be worth investigating that write path to make sure it won’t ack a write until a majority confirm.
Agreed! TransportShardReplicationOperationAction where this stuff is happening, goes like this, in case you are running with write consistency level quorum + sync writes:
I think this would be so much more straightforward if a write were to be fanned-out and then block uptil max of timeout for checking that the requried number of replicas succeeded (with success on primary being required).
jsosic, on
You suggest union of queues after re-connection of partitions, but wouldn’t that mean for already consumed messages on majority partition to be re-queued from minority partition(s)?
Also, would setting up haproxy in front of RabbitMQs help, in a way that haproxy would always point to a single node while other nodes would be marked as backup. That way clients would always publish to master via haproxy, and if partition occurs, master would go down, hence nothing could be published until haproxy switches over to backup node, which has to be in an active partition (if RabbitMQ cluster is set up via pause_minority).
Randall, on
Far afield from the topic, but can’t not note:
Go’s GC is not particularly sophisticated yet, and there are production reports of ten second garbage collection pauses.
Later in the thread, the poster upgrades to a newer (but still 2012-vintage) Go with (partly) parallel GC and says “now GC duration reduced to 1~2 seconds”. Apparently on a 16GB heap too–wowza don’t try this at home kids.
Of course, your point is random long pauses happen and mess up logic that depends on timing, and obviously still true. Also still true Go doesn’t do, say, generational GC like the JVM and loses to it in GC-focused benchmarks. Just noting the report at the top of that thread is not the latest.
Aphyr, on
It’s a buggy Raft implementation that considers itself leader while in the minority side of a partition.
Naw, this is fine. Raft allows multiple simultaneous leaders; the exclusivity invariant only holds for a given term. Leaders with different terms can run concurrently–and indeed, this is expected behavior during a partition.
This is why the appendEntries index+term constraint is so important–it guarantees that contemporary leaders don’t give rise to conflicting committed entries in the log. :)
Brian Olson, on
It’s a buggy Raft implementation that considers itself leader while in the minority side of a partition. (I just read the Raft paper this week.)
A Raft node can be a candidate during partition, but doesn’t get to be leader until it gets a vote from the majority of the cluster.
Maybe there’s a bug in their leader logic or in the cluster membership change logic?
The Raft paper says that all client requests should go through the leader and reads should recognize only committed results (that have been accepted by a majority of the cluster).
‘read anything’ mode will obviously improve availability and throughput, but then it’s not strict Raft anymore.
Aphyr, on
Haven’t heard an official response yet, Ed, but Rabbit’s been aware of these problems for years. I think it’s just gonna take time; either redesigning RabbitMQ clustering, or improving federation/shovel to the point where they can support the same use cases.
Ed Fine, on
Thanks for an interesting article. What were the responses of the RabbitMQ team to your suggestions? Do they have an explanation for the anomalies you saw?
jsosic, the problem is that masters can be different for every queue, so to which node would you configure your haproxy (or any load balancing mechanism) to point to ? Difficult to maintain this. And if you decide to create your queues on the same node to have the same master, this scales really badly as all the resource management will occur on the same node.
Interesting article – thanks. Interestingly, I have very recently heard about a split-brained correctly-configured, ES cluster. I don’t know much else, but I know for a fact the cluster was following the advice in my blog post, and was running a recent release.
I have some feedback and questions.
You mention early on “…those which are read-only copies–termed “data nodes”–but this is primarily a performance optimization”. I’ve never heard of read-only nodes. Replicas, yes, which can serve search traffic, not but not index. Is this what you mean? Also I understand data-nodes to mean any node that has shards – in contrast to node-clients, which just serve to route indexing and search requests.
Also in parts of the article you refer to “primary”, when I think you mean “master”. For example, you mention the election of a “primary”. In the context of ES, “primary” usually refers to a primary copy of a replicated shard, as opposed to the replica of the shard.
It would be interesting to see the testresult with the elasticsearch-zookeeper plugin. I have made a version compatible with ES 1.2 and zookeeper 3.4.6 (https://github.com/grmblfrz/elasticsearch-zookeeper/releases/download/v1.2.0/elasticsearch-zookeeper-1.2.0.zip). I’m using this because of repeated split brains with multiple masters despite setting discovery.zen.minimum_master_nodes to n/2 + 1. Now the split brain conditions are gone…
getting 404s on several of your links:
https://github.com/aphyr/jepsen/blob/master/src/jepsen/cassandra.clj#L35-L80 (Jepsen Test) https://github.com/aphyr/jepsen/blob/master/src/jepsen/cassandra.clj#L116-L149 (Adding Elements to a CQL set)
The correct link for Solr is http://lucene.apache.org/solr/.
It was really wonderful to see how etcd maintainers were so responsive to your findings. To me, as a potential etcd user, the fact that they decided to not only change the docs, but also implementation details is very encouraging, despite some arguable comments down the thread. I’d like to be able to say the same thing for Redis, though.
Keep going Aphyr, you’re doing remarkably important job here.
If I may, I’d like to suggest a couple of future “Jepsen adversaries”, from different families:
Thanks, Marko
Is it possible to evaluate against the zookeeper discovery plugin. I think this fork has more upto date compatibility. https://github.com/imotov/elasticsearch-zookeeper I am seen lots of remarks mentioning zk plugin solves a lot of issues including split brain. And it is not officially endorsed by ES team I think.
I believe this sentence is inverted:
“We can use the atomic constraint of linearizability to mutate state safely. We can define an operation like compare-and-set, in which we set the value of a register to a new value if, and only if, the register currently has some other value.”
Do you mean “… in which we set the value of a register to a new value if, and only if, the register has the value we are comparing.”
Great article, expertly augmented by Most Popular Girls in School.
State! State! State! State!
Love your pop culture references. I watched the whole thing just to see them all.
Thanks for the post, just wanted to note one thing.
It looks like you may have been using the wrong setting for ping timeouts, the correct setting appears to be
discovery.zen.ping_timeout(note the underscore not dot between ping and timeout) as it appears in the docs and code. It appears the default config file distributed with elasticsearch is wrong.Thanks for writing this up. We’re using Elasticsearch and I’ve raised these issues to several people there. Hopefully they’ll publish a response soon.
On another note, holy f*&k are those gif’s annoying.
If you want to reduce the time for failure detection, you’d need to decrease discovery.zen.fd.ping_timeout and ping_retries instead of discovery.zen.ping_timeout: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection
The flip side to this (Captain Obvious to the resq!) is that false positives may occur during small glitches like stop-of-the-world GC.
Excellent article - thanks for writing these.
Very nice article. Thank you for the effort spent.
Is RavenDB next on the list?
Thanks a lot for this post (and all the others). Now how do we convince you to write a similar one for Solr? It would be really insightful to compare the two leading distributed search systems.
Just wanted to say that this series has been an exceptional contribution to discourse around distributed systems; thanks for all you’ve done so far, it’s been amazing reading.
I came here for the Barbie GIFs.
I like. :)
I love the animated gifs. That other guy just needs some epilepsy meds.
Awesome post!(but hate those annoying gifs, a big distraction while reading this blog, please avoid them in the future)
I like Jon :)
Animated images on articles = NO
“reduces the odd of losing” <- should be “odds”.
Great article! Scary. But great :P
seems like the ES team is moving in the right direction with testing this stuff
I’ve been jepsen-testing ES using my discovery plugin (http://github.com/shikhar/eskka) and I tend to see much fewer writes lost (0-4 typically, using different partitioning strategies), which I attribute to it avoiding split-brains with multiple masters that Zen is currently prone to. But of course there should really be no acked writes lost.
Agreed! TransportShardReplicationOperationAction where this stuff is happening, goes like this, in case you are running with write consistency level quorum + sync writes:
Seems like this should work if the same cluster state is used throughout and it actually fails hard on each step. However from what I see there is a bunch of logic in performReplicas() (https://github.com/elasticsearch/elasticsearch/blob/a06fd46a72193a387024b00e226241511a3851d0/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java#L557-L676) where it decides to take into account updated cluster state, and there seem to be exceptions for certain kinds of failures being tolerated.
I think this would be so much more straightforward if a write were to be fanned-out and then block uptil max of timeout for checking that the requried number of replicas succeeded (with success on primary being required).
You suggest union of queues after re-connection of partitions, but wouldn’t that mean for already consumed messages on majority partition to be re-queued from minority partition(s)?
Also, would setting up haproxy in front of RabbitMQs help, in a way that haproxy would always point to a single node while other nodes would be marked as backup. That way clients would always publish to master via haproxy, and if partition occurs, master would go down, hence nothing could be published until haproxy switches over to backup node, which has to be in an active partition (if RabbitMQ cluster is set up via pause_minority).
Far afield from the topic, but can’t not note:
Later in the thread, the poster upgrades to a newer (but still 2012-vintage) Go with (partly) parallel GC and says “now GC duration reduced to 1~2 seconds”. Apparently on a 16GB heap too–wowza don’t try this at home kids.
Of course, your point is random long pauses happen and mess up logic that depends on timing, and obviously still true. Also still true Go doesn’t do, say, generational GC like the JVM and loses to it in GC-focused benchmarks. Just noting the report at the top of that thread is not the latest.
Naw, this is fine. Raft allows multiple simultaneous leaders; the exclusivity invariant only holds for a given term. Leaders with different terms can run concurrently–and indeed, this is expected behavior during a partition.
This is why the appendEntries index+term constraint is so important–it guarantees that contemporary leaders don’t give rise to conflicting committed entries in the log. :)
It’s a buggy Raft implementation that considers itself leader while in the minority side of a partition. (I just read the Raft paper this week.) A Raft node can be a candidate during partition, but doesn’t get to be leader until it gets a vote from the majority of the cluster. Maybe there’s a bug in their leader logic or in the cluster membership change logic? The Raft paper says that all client requests should go through the leader and reads should recognize only committed results (that have been accepted by a majority of the cluster). ‘read anything’ mode will obviously improve availability and throughput, but then it’s not strict Raft anymore.
Haven’t heard an official response yet, Ed, but Rabbit’s been aware of these problems for years. I think it’s just gonna take time; either redesigning RabbitMQ clustering, or improving federation/shovel to the point where they can support the same use cases.
Thanks for an interesting article. What were the responses of the RabbitMQ team to your suggestions? Do they have an explanation for the anomalies you saw?