Jonathan
Jonathan, on

jsosic, the problem is that masters can be different for every queue, so to which node would you configure your haproxy (or any load balancing mechanism) to point to ? Difficult to maintain this. And if you decide to create your queues on the same node to have the same master, this scales really badly as all the resource management will occur on the same node.

Philip O'Toole
Philip O'Toole, on

Interesting article – thanks. Interestingly, I have very recently heard about a split-brained correctly-configured, ES cluster. I don’t know much else, but I know for a fact the cluster was following the advice in my blog post, and was running a recent release.

I have some feedback and questions.

You mention early on “…those which are read-only copies–termed “data nodes”–but this is primarily a performance optimization”. I’ve never heard of read-only nodes. Replicas, yes, which can serve search traffic, not but not index. Is this what you mean? Also I understand data-nodes to mean any node that has shards – in contrast to node-clients, which just serve to route indexing and search requests.

Also in parts of the article you refer to “primary”, when I think you mean “master”. For example, you mention the election of a “primary”. In the context of ES, “primary” usually refers to a primary copy of a replicated shard, as opposed to the replica of the shard.

Swen Thümmler
Swen Thümmler, on

It would be interesting to see the testresult with the elasticsearch-zookeeper plugin. I have made a version compatible with ES 1.2 and zookeeper 3.4.6 (https://github.com/grmblfrz/elasticsearch-zookeeper/releases/download/v1.2.0/elasticsearch-zookeeper-1.2.0.zip). I’m using this because of repeated split brains with multiple masters despite setting discovery.zen.minimum_master_nodes to n/2 + 1. Now the split brain conditions are gone…

mbonaci
mbonaci, on

The correct link for Solr is http://lucene.apache.org/solr/.

mbonaci
mbonaci, on

It was really wonderful to see how etcd maintainers were so responsive to your findings. To me, as a potential etcd user, the fact that they decided to not only change the docs, but also implementation details is very encouraging, despite some arguable comments down the thread. I’d like to be able to say the same thing for Redis, though.

Keep going Aphyr, you’re doing remarkably important job here.

If I may, I’d like to suggest a couple of future “Jepsen adversaries”, from different families:

Thanks, Marko

mbdas
mbdas, on

Is it possible to evaluate against the zookeeper discovery plugin. I think this fork has more upto date compatibility. https://github.com/imotov/elasticsearch-zookeeper I am seen lots of remarks mentioning zk plugin solves a lot of issues including split brain. And it is not officially endorsed by ES team I think.

W. Andrew Loe III
W. Andrew Loe III, on

I believe this sentence is inverted:

“We can use the atomic constraint of linearizability to mutate state safely. We can define an operation like compare-and-set, in which we set the value of a register to a new value if, and only if, the register currently has some other value.”

Do you mean “… in which we set the value of a register to a new value if, and only if, the register has the value we are comparing.”

Eric
Eric, on

Great article, expertly augmented by Most Popular Girls in School.

State! State! State! State!

Chris
Chris, on

Love your pop culture references. I watched the whole thing just to see them all.

Xiao Yu
Xiao Yu, on

Thanks for the post, just wanted to note one thing.

It looks like you may have been using the wrong setting for ping timeouts, the correct setting appears to be discovery.zen.ping_timeout (note the underscore not dot between ping and timeout) as it appears in the docs and code. It appears the default config file distributed with elasticsearch is wrong.

Jason Kolb
Jason Kolb, on

Thanks for writing this up. We’re using Elasticsearch and I’ve raised these issues to several people there. Hopefully they’ll publish a response soon.

On another note, holy f*&k are those gif’s annoying.

Radu Gheorghe
Radu Gheorghe, on

If you want to reduce the time for failure detection, you’d need to decrease discovery.zen.fd.ping_timeout and ping_retries instead of discovery.zen.ping_timeout: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection

The flip side to this (Captain Obvious to the resq!) is that false positives may occur during small glitches like stop-of-the-world GC.

Jason
Jason, on

Excellent article - thanks for writing these.

Mo

Very nice article. Thank you for the effort spent.

Jonathan Channon
Jonathan Channon, on

Is RavenDB next on the list?

Jerome
Jerome, on

Thanks a lot for this post (and all the others). Now how do we convince you to write a similar one for Solr? It would be really insightful to compare the two leading distributed search systems.

Chris
Chris, on

Just wanted to say that this series has been an exceptional contribution to discourse around distributed systems; thanks for all you’ve done so far, it’s been amazing reading.

Levi Figueira
Levi Figueira, on

I came here for the Barbie GIFs.

I like. :)

Alex
Alex, on

I love the animated gifs. That other guy just needs some epilepsy meds.

Nat
Nat, on

Awesome post!(but hate those annoying gifs, a big distraction while reading this blog, please avoid them in the future)

ss

I like Jon :)

Jon
Jon, on

Animated images on articles = NO

Anders Hovmöller
Anders Hovmöller, on

“reduces the odd of losing” <- should be “odds”.

Great article! Scary. But great :P

shikhar
shikhar, on

seems like the ES team is moving in the right direction with testing this stuff

shikhar
shikhar, on

I’ve been jepsen-testing ES using my discovery plugin (http://github.com/shikhar/eskka) and I tend to see much fewer writes lost (0-4 typically, using different partitioning strategies), which I attribute to it avoiding split-brains with multiple masters that Zen is currently prone to. But of course there should really be no acked writes lost.

It might be worth investigating that write path to make sure it won’t ack a write until a majority confirm.

Agreed! TransportShardReplicationOperationAction where this stuff is happening, goes like this, in case you are running with write consistency level quorum + sync writes:

  • check quorum
  • perform the write on primary
  • perform the sync write on replicas

Seems like this should work if the same cluster state is used throughout and it actually fails hard on each step. However from what I see there is a bunch of logic in performReplicas() (https://github.com/elasticsearch/elasticsearch/blob/a06fd46a72193a387024b00e226241511a3851d0/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java#L557-L676) where it decides to take into account updated cluster state, and there seem to be exceptions for certain kinds of failures being tolerated.

I think this would be so much more straightforward if a write were to be fanned-out and then block uptil max of timeout for checking that the requried number of replicas succeeded (with success on primary being required).

jsosic
jsosic, on

You suggest union of queues after re-connection of partitions, but wouldn’t that mean for already consumed messages on majority partition to be re-queued from minority partition(s)?

Also, would setting up haproxy in front of RabbitMQs help, in a way that haproxy would always point to a single node while other nodes would be marked as backup. That way clients would always publish to master via haproxy, and if partition occurs, master would go down, hence nothing could be published until haproxy switches over to backup node, which has to be in an active partition (if RabbitMQ cluster is set up via pause_minority).

Randall
Randall, on

Far afield from the topic, but can’t not note:

Go’s GC is not particularly sophisticated yet, and there are production reports of ten second garbage collection pauses.

Later in the thread, the poster upgrades to a newer (but still 2012-vintage) Go with (partly) parallel GC and says “now GC duration reduced to 1~2 seconds”. Apparently on a 16GB heap too–wowza don’t try this at home kids.

Of course, your point is random long pauses happen and mess up logic that depends on timing, and obviously still true. Also still true Go doesn’t do, say, generational GC like the JVM and loses to it in GC-focused benchmarks. Just noting the report at the top of that thread is not the latest.

Aphyr
Aphyr, on

It’s a buggy Raft implementation that considers itself leader while in the minority side of a partition.

Naw, this is fine. Raft allows multiple simultaneous leaders; the exclusivity invariant only holds for a given term. Leaders with different terms can run concurrently–and indeed, this is expected behavior during a partition.

This is why the appendEntries index+term constraint is so important–it guarantees that contemporary leaders don’t give rise to conflicting committed entries in the log. :)

Brian Olson
Brian Olson, on

It’s a buggy Raft implementation that considers itself leader while in the minority side of a partition. (I just read the Raft paper this week.) A Raft node can be a candidate during partition, but doesn’t get to be leader until it gets a vote from the majority of the cluster. Maybe there’s a bug in their leader logic or in the cluster membership change logic? The Raft paper says that all client requests should go through the leader and reads should recognize only committed results (that have been accepted by a majority of the cluster). ‘read anything’ mode will obviously improve availability and throughput, but then it’s not strict Raft anymore.

Aphyr
Aphyr, on

Haven’t heard an official response yet, Ed, but Rabbit’s been aware of these problems for years. I think it’s just gonna take time; either redesigning RabbitMQ clustering, or improving federation/shovel to the point where they can support the same use cases.

Ed Fine
Ed Fine, on

Thanks for an interesting article. What were the responses of the RabbitMQ team to your suggestions? Do they have an explanation for the anomalies you saw?

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.