Evan
Evan, on

“Losing 28% of your supposedly committed data is not serializable by any definition. Next question.”

Would like an explanation of why this is happening. Are you sure the read is also occurring at quorum consistency?

Spud
Spud, on

Pretty sure chubby does a lot more than basic paxos (master failover). Cass is doing a very basic paxos state machine, although it is sharded and I’m sure adding / removing nodes to the cluster adds additional testing fun :-).

Christopher Smith
Christopher Smith, on

The nanoTime() is supposed to be a reliable measure of time passed, so if you calibrate the value with System.currentTimeMillis() it should work reliably.

Any differences between core TSC’s should be well within the margin of error on the timestamp anyway (particularly considering we are normally dealing with differences between nodes & even networks.

Christopher Smith
Christopher Smith, on

I created a bug and submitted a patch to correct it: https://issues.apache.org/jira/browse/CASSANDRA-6106

Jeremiah Gowdy
Jeremiah Gowdy, on

Is System.nanoTime() safe to call between threads/CPUs? I believe on x86 System.nanoTime() uses the TSC, which isn’t always synced properly between different cores and/or sockets on some implementations. And System.currentTimeMillis() uses the wallclock? It seems like you’d need something based on hardware like HPET to get the consistency of timestamps you’re talking about.

Christopher Smith
Christopher Smith, on

I’m pretty sure I found the issue. It’s in org.apache.cassandra.service.QueryState. It’s bad, but not quite as bad as you might think:

public long getTimestamp() { long current = System.currentTimeMillis() * 1000; clock = clock >= current ? clock + 1 : current; return clock; }

The mistake is to not initialize current/clock with System.nanoTime(). If that had been done it would avoid this problem, and frankly the code could have been simpler. There is obviously an attempt to avoid collisions for timestamps from the same node, but with multiple nodes this won’t really help.

Christopher Smith
Christopher Smith, on

I was able to isolate the problem with the millisecond timestamps. If you specify a timestamp you definitely get microsecond precision back from “writetime()”, so I am pretty sure the problem is with server side generated timestamps. This is definitely a bug, but at the same time if you are having highly concurrent writes to a record, you should be using client generated timestamps.

Liza
Liza, on

I really would like for you to do this kind of analysis on Couchbase 2.0. Of course, I would like this because I’m using Couchbase… but I think it’s solutions to many of these issues are well thought out, but i don’t know for sure.

Aphyr
Aphyr, on

Jay Kreps has written a great follow-up post with more details.

Andrew
Andrew, on

Kafka holds a new election and promotes any remaining node–which could be arbitrarily far behind the original leader. That node begins accepting requests and replicating them to the new ISR.

This seems to violate the whole concept of the ISR set, no? It’d be great if your recommendation #2. makes it in to 0.8. I’d much rather get pages about produce errors rather than have to figure out how to clean up inconsistent partitions.

Duarte Nunes
Duarte Nunes, on

Finally, a success story :) Could you elaborate though on the false positives? Does the leader also proxy writes? Thanks!

tobi
tobi, on

Great post, but the animations are very distracting. Far worse than any blinking ad. I had to remove them one by one using Chrome Inspector.

Rajiv
Rajiv, on

I wrote a similar letter to senator Feinstein and received a response full of platitudes and unverified accounts of the success of the mass scale spying. Besides providing support to the EFF I don’t know what I can do. We are slowly descending into an Orwellian society that will be very difficult to recover from. Writing to Feinstein is a symbol of defiance at best given how she is a big believer in the program and has always been one.

Howard Lee
Howard Lee, on

Interesting read. I have always wonder what would happen to data once Primary goes down.

I would like to simulate this in my test. Would you mind to share the codes?

Dave
Dave, on

Nice work! Giving me some inspiration…

Gates VP
Gates VP, on

On the one hand, Mongo advocates usually tell me “but network partitions are exceedingly rare in practice.” Then I talk to Mongo users who report their cluster fails over on a weekly basis.

At a top level, I think this points to another key problem with MongoDB, which is maintainability.

MongoDB is very complicated to administer. When a node in the system dies, that exact node has to be replaced. You can’t just bring up another piece of hardware and have it “join the cluster”, you need to replace the node that died, restored from a backup and then you need to update the replica set to which it belonged.

And during this multi-minute or multi-hour process, you have to hope that no other node in the replica set also fails because then you’re in trouble.

If the DB is neither CP nor AP and it’s hard to maintain that really makes you wonder about its usability at all. http://qr.ae/T2iX4

Alan Robertson
Alan Robertson, on

A network partition doesn’t even have to be caused by network issues. Red Hat 2.6.18-2.6.20 had scheduler bugs that would stop heartbeats from being sent for hours - and it only happened under very light load. I’ve seen similar issues more recently in RHEL6 with some Linux boot parameters related to whether to sleep or go into the idle loop when there was nothing to do.

One feature we put into Linux-HA which got pulled after I left the project was the idea of having a single node be allowed to act as a tie-breaker for dozens or even hundreds of clusters. The original reason for its creation was split-site clusters, but it would have certainly solved many of the simultaneous STONITH shootings that Github saw.

Bogdan Matei
Bogdan Matei, on

Hello! First of all I would like to present you my respect for your work!

Then I would like to ask you something (as I’m not familiar with Closure). Where do those clients 0 - 4 connect in order to try to make those insert operations? On “n1”? If you partition “n1” out, then your clients should fail until it gets back. Then what’s the point to discuss the elections of a new “master”? Sorry, I guess I’m missing your clustering model and how your clients interact with your cluster. Thank you.

Peter Odding
Peter Odding, on

Kyle, I loved this series! Thank you so much for putting in the time & effort to create it and make it available. I came here from http://www.infoq.com/articles/jepsen hoping to find more detailed information and I was not disappointed! :-)

Google’s F1 was mentioned above, I know it as Spanner (the underlying core IIUC). I agree with the previous poster; I couldn’t stop thinking about Spanner while reading this series.

Also interesting that you mention Zookeeper here and there, I’ve been looking into it for a while. Your articles only confirm my intuition that it deserves my attention :-)

worldmind
worldmind, on

$hash_of_hashes->{key}{key2}

more clean

Sean
Sean, on

Regarding the Duke University “experiment”: it was not the experiment, which was fully within the BGP spec but the bug in a specific Cisco systems that caused the problem. http://www.cisco.com/en/US/products/csa/cisco-sa-20100827-bgp.html

Further details can be found at: https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment

Broc Seib
Broc Seib, on

Awesome work Kyle.

Related and interesting: Google’s F1, using highly synchronized clocks… https://plus.google.com/u/0/+JuliaFerraioli/posts/Kz9qrac78cx

Also, I think you are spot on about the programmer-culture / social-engineering aspect to all this.** The tools and languages everyone uses don’t present this level of thinking to us while we code. I might like a language that exposes these tradeoff decisions right in my code so that I may write my application at a systems level. I don’t want to compile to a machine architecture anymore. I want to compile to a distributed architecture.

**Isn’t this all a reflection of the Law of Conservation of Complexity? “Complexity cannot be destroyed, only displaced…”

chiradeep
chiradeep, on

Also the usually-reliable long distance telephone network was down in 1990 http://www.phworld.org/history/attcrash.htm

ChadF
ChadF, on

Very informative (definitely makes you think if pondering data/service recovery methods).

And to paraphrase a wise man.. “With great STONITH power comes great responsibility”.

Marcel Kincaid
Marcel Kincaid, on

Nice survey article.

chiradeep
chiradeep, on

There’s also the AWS S3 outage from 2008 due to bit corruption in the gossip messages [http://status.aws.amazon.com/s3-20080720.html]. As a result, servers were not able to determine the true system state.

eas
eas, on

Just a guess at one of the “right academic papers”: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

Thanks for writing this series of articles.

Tom Stuart
Tom Stuart, on

‘If you write to Postgres, depending on your transaction’s consistency level, you might be able to guarantee that the write will be visible to everyone, just to yourself, or “eventually”.’

Doesn’t this depend on your readers' transaction isolation levels?

Matteo
Matteo, on

Thank you for the excellent post!

Could you please recommend some paper on the subject? I was curious to know which are “the right academic papers” you are referring to.

Eric P
Eric P, on

Really good stuff. I hope to see you look at Cassandra and HBase at some point.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.