“Losing 28% of your supposedly committed data is not serializable by any definition. Next question.”
Would like an explanation of why this is happening. Are you sure the read is also occurring at quorum consistency?
Spud, on
Pretty sure chubby does a lot more than basic paxos (master failover). Cass is doing a very basic paxos state machine, although it is sharded and I’m sure adding / removing nodes to the cluster adds additional testing fun :-).
Christopher Smith, on
The nanoTime() is supposed to be a reliable measure of time passed, so if you calibrate the value with System.currentTimeMillis() it should work reliably.
Any differences between core TSC’s should be well within the margin of error on the timestamp anyway (particularly considering we are normally dealing with differences between nodes & even networks.
Is System.nanoTime() safe to call between threads/CPUs? I believe on x86 System.nanoTime() uses the TSC, which isn’t always synced properly between different cores and/or sockets on some implementations. And System.currentTimeMillis() uses the wallclock? It seems like you’d need something based on hardware like HPET to get the consistency of timestamps you’re talking about.
Christopher Smith, on
I’m pretty sure I found the issue. It’s in org.apache.cassandra.service.QueryState. It’s bad, but not quite as bad as you might think:
public long getTimestamp()
{
long current = System.currentTimeMillis() * 1000;
clock = clock >= current ? clock + 1 : current;
return clock;
}
The mistake is to not initialize current/clock with System.nanoTime(). If that had been done it would avoid this problem, and frankly the code could have been simpler. There is obviously an attempt to avoid collisions for timestamps from the same node, but with multiple nodes this won’t really help.
Christopher Smith, on
I was able to isolate the problem with the millisecond timestamps. If you specify a timestamp you definitely get microsecond precision back from “writetime()”, so I am pretty sure the problem is with server side generated timestamps. This is definitely a bug, but at the same time if you are having highly concurrent writes to a record, you should be using client generated timestamps.
Liza, on
I really would like for you to do this kind of analysis on Couchbase 2.0. Of course, I would like this because I’m using Couchbase… but I think it’s solutions to many of these issues are well thought out, but i don’t know for sure.
Kafka holds a new election and promotes any remaining node–which could be arbitrarily far behind the original leader. That node begins accepting requests and replicating them to the new ISR.
This seems to violate the whole concept of the ISR set, no? It’d be great if your recommendation #2. makes it in to 0.8. I’d much rather get pages about produce errors rather than have to figure out how to clean up inconsistent partitions.
Duarte Nunes, on
Finally, a success story :) Could you elaborate though on the false positives? Does the leader also proxy writes? Thanks!
tobi, on
Great post, but the animations are very distracting. Far worse than any blinking ad. I had to remove them one by one using Chrome Inspector.
Rajiv, on
I wrote a similar letter to senator Feinstein and received a response full of platitudes and unverified accounts of the success of the mass scale spying. Besides providing support to the EFF I don’t know what I can do. We are slowly descending into an Orwellian society that will be very difficult to recover from. Writing to Feinstein is a symbol of defiance at best given how she is a big believer in the program and has always been one.
Howard Lee, on
Interesting read. I have always wonder what would happen to data once Primary goes down.
I would like to simulate this in my test. Would you mind to share the codes?
Dave, on
Nice work! Giving me some inspiration…
Gates VP, on
On the one hand, Mongo advocates usually tell me “but network partitions are exceedingly rare in practice.” Then I talk to Mongo users who report their cluster fails over on a weekly basis.
At a top level, I think this points to another key problem with MongoDB, which is maintainability.
MongoDB is very complicated to administer. When a node in the system dies, that exact node has to be replaced. You can’t just bring up another piece of hardware and have it “join the cluster”, you need to replace the node that died, restored from a backup and then you need to update the replica set to which it belonged.
And during this multi-minute or multi-hour process, you have to hope that no other node in the replica set also fails because then you’re in trouble.
If the DB is neither CP nor AP and it’s hard to maintain that really makes you wonder about its usability at all.
http://qr.ae/T2iX4
A network partition doesn’t even have to be caused by network issues. Red Hat 2.6.18-2.6.20 had scheduler bugs that would stop heartbeats from being sent for hours - and it only happened under very light load. I’ve seen similar issues more recently in RHEL6 with some Linux boot parameters related to whether to sleep or go into the idle loop when there was nothing to do.
One feature we put into Linux-HA which got pulled after I left the project was the idea of having a single node be allowed to act as a tie-breaker for dozens or even hundreds of clusters. The original reason for its creation was split-site clusters, but it would have certainly solved many of the simultaneous STONITH shootings that Github saw.
Bogdan Matei, on
Hello! First of all I would like to present you my respect for your work!
Then I would like to ask you something (as I’m not familiar with Closure). Where do those clients 0 - 4 connect in order to try to make those insert operations? On “n1”? If you partition “n1” out, then your clients should fail until it gets back. Then what’s the point to discuss the elections of a new “master”? Sorry, I guess I’m missing your clustering model and how your clients interact with your cluster. Thank you.
Peter Odding, on
Kyle, I loved this series! Thank you so much for putting in the time & effort to create it and make it available. I came here from http://www.infoq.com/articles/jepsen hoping to find more detailed information and I was not disappointed! :-)
Google’s F1 was mentioned above, I know it as Spanner (the underlying core IIUC). I agree with the previous poster; I couldn’t stop thinking about Spanner while reading this series.
Also interesting that you mention Zookeeper here and there, I’ve been looking into it for a while. Your articles only confirm my intuition that it deserves my attention :-)
Also, I think you are spot on about the programmer-culture / social-engineering aspect to all this.** The tools and languages everyone uses don’t present this level of thinking to us while we code. I might like a language that exposes these tradeoff decisions right in my code so that I may write my application at a systems level. I don’t want to compile to a machine architecture anymore. I want to compile to a distributed architecture.
**Isn’t this all a reflection of the Law of Conservation of Complexity? “Complexity cannot be destroyed, only displaced…”
Very informative (definitely makes you think if pondering data/service recovery methods).
And to paraphrase a wise man.. “With great STONITH power comes great responsibility”.
Marcel Kincaid, on
Nice survey article.
chiradeep, on
There’s also the AWS S3 outage from 2008 due to bit corruption in the gossip messages [http://status.aws.amazon.com/s3-20080720.html]. As a result, servers were not able to determine the true system state.
‘If you write to Postgres, depending on your transaction’s consistency level, you might be able to guarantee that the write will be visible to everyone, just to yourself, or “eventually”.’
Doesn’t this depend on your readers' transaction isolation levels?
Matteo, on
Thank you for the excellent post!
Could you please recommend some paper on the subject? I was curious to know which are “the right academic papers” you are referring to.
Eric P, on
Really good stuff. I hope to see you look at Cassandra and HBase at some point.
“Losing 28% of your supposedly committed data is not serializable by any definition. Next question.”
Would like an explanation of why this is happening. Are you sure the read is also occurring at quorum consistency?
Pretty sure chubby does a lot more than basic paxos (master failover). Cass is doing a very basic paxos state machine, although it is sharded and I’m sure adding / removing nodes to the cluster adds additional testing fun :-).
The nanoTime() is supposed to be a reliable measure of time passed, so if you calibrate the value with System.currentTimeMillis() it should work reliably.
Any differences between core TSC’s should be well within the margin of error on the timestamp anyway (particularly considering we are normally dealing with differences between nodes & even networks.
I created a bug and submitted a patch to correct it: https://issues.apache.org/jira/browse/CASSANDRA-6106
Is System.nanoTime() safe to call between threads/CPUs? I believe on x86 System.nanoTime() uses the TSC, which isn’t always synced properly between different cores and/or sockets on some implementations. And System.currentTimeMillis() uses the wallclock? It seems like you’d need something based on hardware like HPET to get the consistency of timestamps you’re talking about.
I’m pretty sure I found the issue. It’s in org.apache.cassandra.service.QueryState. It’s bad, but not quite as bad as you might think:
public long getTimestamp() { long current = System.currentTimeMillis() * 1000; clock = clock >= current ? clock + 1 : current; return clock; }The mistake is to not initialize current/clock with System.nanoTime(). If that had been done it would avoid this problem, and frankly the code could have been simpler. There is obviously an attempt to avoid collisions for timestamps from the same node, but with multiple nodes this won’t really help.
I was able to isolate the problem with the millisecond timestamps. If you specify a timestamp you definitely get microsecond precision back from “writetime()”, so I am pretty sure the problem is with server side generated timestamps. This is definitely a bug, but at the same time if you are having highly concurrent writes to a record, you should be using client generated timestamps.
I really would like for you to do this kind of analysis on Couchbase 2.0. Of course, I would like this because I’m using Couchbase… but I think it’s solutions to many of these issues are well thought out, but i don’t know for sure.
Jay Kreps has written a great follow-up post with more details.
This seems to violate the whole concept of the ISR set, no? It’d be great if your recommendation #2. makes it in to 0.8. I’d much rather get pages about produce errors rather than have to figure out how to clean up inconsistent partitions.
Finally, a success story :) Could you elaborate though on the false positives? Does the leader also proxy writes? Thanks!
Great post, but the animations are very distracting. Far worse than any blinking ad. I had to remove them one by one using Chrome Inspector.
I wrote a similar letter to senator Feinstein and received a response full of platitudes and unverified accounts of the success of the mass scale spying. Besides providing support to the EFF I don’t know what I can do. We are slowly descending into an Orwellian society that will be very difficult to recover from. Writing to Feinstein is a symbol of defiance at best given how she is a big believer in the program and has always been one.
Interesting read. I have always wonder what would happen to data once Primary goes down.
I would like to simulate this in my test. Would you mind to share the codes?
Nice work! Giving me some inspiration…
On the one hand, Mongo advocates usually tell me “but network partitions are exceedingly rare in practice.” Then I talk to Mongo users who report their cluster fails over on a weekly basis.
At a top level, I think this points to another key problem with MongoDB, which is maintainability.
MongoDB is very complicated to administer. When a node in the system dies, that exact node has to be replaced. You can’t just bring up another piece of hardware and have it “join the cluster”, you need to replace the node that died, restored from a backup and then you need to update the replica set to which it belonged.
And during this multi-minute or multi-hour process, you have to hope that no other node in the replica set also fails because then you’re in trouble.
If the DB is neither CP nor AP and it’s hard to maintain that really makes you wonder about its usability at all. http://qr.ae/T2iX4
Another partition in the wild: http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html
A network partition doesn’t even have to be caused by network issues. Red Hat 2.6.18-2.6.20 had scheduler bugs that would stop heartbeats from being sent for hours - and it only happened under very light load. I’ve seen similar issues more recently in RHEL6 with some Linux boot parameters related to whether to sleep or go into the idle loop when there was nothing to do.
One feature we put into Linux-HA which got pulled after I left the project was the idea of having a single node be allowed to act as a tie-breaker for dozens or even hundreds of clusters. The original reason for its creation was split-site clusters, but it would have certainly solved many of the simultaneous STONITH shootings that Github saw.
Hello! First of all I would like to present you my respect for your work!
Then I would like to ask you something (as I’m not familiar with Closure). Where do those clients 0 - 4 connect in order to try to make those insert operations? On “n1”? If you partition “n1” out, then your clients should fail until it gets back. Then what’s the point to discuss the elections of a new “master”? Sorry, I guess I’m missing your clustering model and how your clients interact with your cluster. Thank you.
Kyle, I loved this series! Thank you so much for putting in the time & effort to create it and make it available. I came here from http://www.infoq.com/articles/jepsen hoping to find more detailed information and I was not disappointed! :-)
Google’s F1 was mentioned above, I know it as Spanner (the underlying core IIUC). I agree with the previous poster; I couldn’t stop thinking about Spanner while reading this series.
Also interesting that you mention Zookeeper here and there, I’ve been looking into it for a while. Your articles only confirm my intuition that it deserves my attention :-)
$hash_of_hashes->{key}{key2}
more clean
Regarding the Duke University “experiment”: it was not the experiment, which was fully within the BGP spec but the bug in a specific Cisco systems that caused the problem. http://www.cisco.com/en/US/products/csa/cisco-sa-20100827-bgp.html
Further details can be found at: https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment
Awesome work Kyle.
Related and interesting: Google’s F1, using highly synchronized clocks… https://plus.google.com/u/0/+JuliaFerraioli/posts/Kz9qrac78cx
Also, I think you are spot on about the programmer-culture / social-engineering aspect to all this.** The tools and languages everyone uses don’t present this level of thinking to us while we code. I might like a language that exposes these tradeoff decisions right in my code so that I may write my application at a systems level. I don’t want to compile to a machine architecture anymore. I want to compile to a distributed architecture.
**Isn’t this all a reflection of the Law of Conservation of Complexity? “Complexity cannot be destroyed, only displaced…”
Also the usually-reliable long distance telephone network was down in 1990 http://www.phworld.org/history/attcrash.htm
Very informative (definitely makes you think if pondering data/service recovery methods).
And to paraphrase a wise man.. “With great STONITH power comes great responsibility”.
Nice survey article.
There’s also the AWS S3 outage from 2008 due to bit corruption in the gossip messages [http://status.aws.amazon.com/s3-20080720.html]. As a result, servers were not able to determine the true system state.
Just a guess at one of the “right academic papers”: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf
Thanks for writing this series of articles.
http://rethinkdb.com/
‘If you write to Postgres, depending on your transaction’s consistency level, you might be able to guarantee that the write will be visible to everyone, just to yourself, or “eventually”.’
Doesn’t this depend on your readers' transaction isolation levels?
Thank you for the excellent post!
Could you please recommend some paper on the subject? I was curious to know which are “the right academic papers” you are referring to.
Really good stuff. I hope to see you look at Cassandra and HBase at some point.