Jeremy
Jeremy, on

Riak implements vector clocks, and passes them to the client if last_write_wins is set to False and allow_mult is set to True. The default is allow_mult=False and last_write_wins=False, which still uses vector clocks but tries to figure out the merge itself rather than passing the vector clock to the client. If last_Write_wins=True, then it has the behaviour that you mentioned above, but that’s not the default.

ybrs
ybrs, on

nice read but riak doesn’t use timestamps, if my memory serves me correct, they were strongly arguing and blaming cassandra about using timestamps…

http://docs.basho.com/riak/latest/theory/concepts/Vector-Clocks/

Lindsey Kuper
Lindsey Kuper, on

This post contains answers to the exact questions that my Ph.D. advisor has been asking me about CRDTs, but I’m embarrassed to show it to him because it’s too NSFW for academia.

Debbie
Debbie, on

Beautiful job! I love the linseed oil/mineral spirits idea. I think it looks more natural and has a smooth feel to it. I am inspired as well!

rwoodsmall
rwoodsmall, on

Excellent post. Having hardware GPS clock(s) running with centralized, slurred NTP services on hardware machines with stable RTCs in DST-less UTC doesn’t guarantee that time won’t flow backwards. It’s the subtle issues that pop up after this hits, like a scheduled job running multiple times (or never running, or overlapping itself when you’ve not implemented good locking procedures and a system eats its own tail, or…) that make this a scary problem. Bolting a scheduler onto an existing system exposes poor assumptions that eventually become hard problems, albeit semi-solved ones that require a lot of elbow grease.

Steve Allen
Steve Allen, on

Cassandra and Riak on virtualized platforms have looser requirements for precise time than systems which are tracking motion of physical objects. One way to work with a leaping wall clock is to note that POSIX does not really want to care what the meaning of “day” is, and POSIX also requires that there be a mechanism that can emit the time that everyone wants. Details are at http://www.ucolick.org/~sla/leapsecs/right+gps.html along with the preprint and slides from this scheme.

Aphyr
Aphyr, on

Along the lines of Zookeeper, how about something like Twitter’s Snowflake?

Flake ensures uniqueness (and therefore a total order), but since it relies on wall clocks instead of logical clocks, it may not be the total order you want. It gives you monotonicity if you always talk to the same flake server, but timestamps could jump back and forth arbitrarily if you talk to different nodes.

anonymous
anonymous, on

Yes, the trouble with spammers: they’ve a sad pathetic life.

Justin Mason
Justin Mason, on

One way to help ameliorate the problem with NTP and LWW is to use the “-x” switch to ntpd, which forbids stepping, and enforcing that time can never move backwards (at least):

-x Normally, the time is slewed if the offset is less than the step threshold, which is 128 ms by default, and stepped if above the threshold. This option forces the time to be slewed in all cases. If the step threshold is set to zero, all offsets are stepped, regardless of value and regardless of the -x option. In general, this is not a good idea, as it bypasses the clock state machine which is designed to cope with large time and frequency errors Note: Since the slew rate is limited to 0.5 ms/s, each second of adjustment requires an amortization interval of 2000 s. Thus, an adjustment of many seconds can take hours or days to amortize. This option can be used with the -q option.

For systems built on the JVM, this is pretty important, as backwards-stepping clocks can cause concurrency primitives to hang: http://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/

This doesn’t solve the other problems with use of timestamps, of course.

Alexander Sicular
Alexander Sicular, on

Along the lines of Zookeeper, how about something like Twitter’s Snowflake?

anonymous
anonymous, on

you'er shit

Khushhal Farooqi
Khushhal Farooqi, on

what is kinetosphere?

Aphyr
Aphyr, on

Yeah, that bug with MAJORITY acks made it in to 2.4.4, if I recall correctly.

Aphyr
Aphyr, on

Since the write of X fails in this case

I’m talking about both X and Y as successful writes.

anonymous
anonymous, on

Thank you for sharing this.

anonymous
anonymous, on

That’s not quite my understanding (http://regal.csep.umflint.edu/~swturner/Classes/csc577/Online/Chapter06/img23.html) but I’m not sure it matters.

Since the write of X fails in this case and the time of commit is not know, there is no specific expected ordering with regard to Y, but the system will eventually enforce some globally consistent order.

Aphyr
Aphyr, on

That'��s sequential, aka strong, consistency, so we’re good… right? You seem to want it to be linearizable, aka strict, consistency which I didn’t think was being proposed. I guess the Cassandra docs are confused on this point.

I don’t think so. Writing X, then Y, then reading X is prohibited by sequential consistency, as well as session and monotonic consistency models.

Pierre
Pierre, on

Just a small comment to the atomicity issue in Cassandra. The error even occurs on a single computer when there is one writer and multiple readers. The writer creates a new row with two columns, and some readers see a single column …

Ben McCann
Ben McCann, on

Would love to see a Jepsen post for ElasticSearch!

anonymous
anonymous, on

it might help to read the initial Jepsen articles on what what network partitions are, and what guarantees are achievable by asynchronous networks.

Yeah, I’m sure I’m still missing something. Will investigate.

including those where a single client in a single session writes X then Y at ConsistencyLevel.ALL, and reads X for all future reads at any consistency level.

That’s sequential, aka strong, consistency, so we’re good…right? You seem to want it to be linearizable, aka strict, consistency which I didn’t think was being proposed. I guess the Cassandra docs are confused on this point.

Aphyr
Aphyr, on

Yet quorum was satisfied for immediately prior reads of the same key?

Yes. Not all nodes are symmetric, and not all nodes are always fully connected. I’m not really sure how to explain this one any clearer–it might help to read the initial Jepsen articles on what what network partitions are, and what guarantees are achievable by asynchronous networks.

https://issues.apache.org/jira/browse/CASSANDRA-2494

This only provides monotonicity if your timestamps are monotonic. Cassandra does not enforce or provide monotonic timestamps, which means that pretty much all causal histories are allowed, including those where a single client in a single session writes X then Y at ConsistencyLevel.ALL, and reads X for all future reads at any consistency level.

Moonblade
Moonblade, on

It’s 2.4.6 now. Is the problem gone now ? :D

Marco Serafini
Marco Serafini, on

Nice post! About the differences between Paxos and Zab, it is mainly due to the fact that Zookeeper uses passive replication. You can have a look here for more details http://arxiv.org/pdf/1308.2979v3

Jim Starkey
Jim Starkey, on

Any discussion of the CAP “theorem” revolves around the A – exactly what does availability mean?

The narrow sense is that availability means that all surviving nodes in all partitions continue to process work. This reduces the CAP idea to nothing more than “there is no consistency without communication.” Well, duh. This is of interest only to third rate academics a few papers short of a tenure package.

A more useful interpretation of availability is that maximizes availability while insuring that a) at most one partition survives, b) no transaction committed in the surviving partition is lost, and c) no transaction not actually committed in the surviving partition is reported or treated as committed.

A useful discussion of a database system vis a vis CAP is whether or not it maintains strict and robust ACID properties in the face of an arbitrary partition event and to degree to which the surviving partition provides availability. This is not a low bar.

The design of NuoDB had these properties: at most one partition could survive a partition event, no transaction committed in the surviving partition could be lost, and no transaction in a non-surviving partition could be reported as committed (admittedly, this required the DBA to set the highest level of commit synchronization). As I left the company before V1 shipped, I will leave the discussion of whether the implementation was faithful to the design to others.

anonymous
anonymous, on

This test run (like most in the Jepsen series) did come from a partitioned cluster.

Ok, this wasn’t clear from the article.

This is a common misunderstanding of Dynamo systems. No such property exists.

https://issues.apache.org/jira/browse/CASSANDRA-2494

Because Quorum was not satisfiable by the coordinator for some writes during the partition.

Yet quorum was satisfied for immediately prior reads of the same key?

Aphyr
Aphyr, on

For now I’m advising people they should always use client generated timestamps if they have rows that are likely to have multiple writes with overlapping fields within a few seconds (allowing for shifts in the clocks).

Probably the best thing for those folks to do is use a strong coordinator like Zookeeper for their timestamps. Could encode a fixed membership ID in the lower timestamp bits, for instance, to guarantee participants don’t collide. Either that or deal with the probability of data corruption. It’s a tricky problem to talk about in general because the collision probabilities are dependent on the write timing, time sources, and so on.

Christopher Smith
Christopher Smith, on

isn’t it the case that you’re constrained in precision by what the CPU can handle? The value returned in a “nano-second” timestamp may have misleading precision. By which I mean, if you get 2 nano-second timestamps 1ns apart, they may well show the same value anyway. It doesn’t matter how precise you think you’re being - if you get collisions at the millisecond level, using ‘nano’ times won’t necessarily save you.

I think you misunderstand the bug. Cassandra has logic to avoid collisions between two queries sent to the same node. The problem occurs with queries sent to two different nodes if those two nodes happen to assign the same timestamps to them. In this scenario, it really doesn’t matter how accurate the clocks are, but rather how much entropy exists in the timestamps. So, to a certain degree you are right about the “precision”, but only in the mathematical sense. Regardless of the precision of the system’s clock, by using nanoTime(), you maximize that entropy (as compared to using currentTimeMilliseconds(), which ironically favours getting the “correct” time over the maximal entropy in the timestamp). You could do almost as well using currentTimeMillis() * 1000 + (new Random()).nextInt(1000).

Secondly, the bug isn’t that bad things can happen when the timestamps match (that’s true, but that is a larger design problem that at least is intrinsic to the Cassandra approach). The bug is that by using milliseconds instead of microseconds, you significantly increase the probability of that scenario happening that users might not anticipate.

For now I’m advising people they should always use client generated timestamps if they have rows that are likely to have multiple writes with overlapping fields within a few seconds (allowing for shifts in the clocks). The server timestamps are just broken.

Aphyr
Aphyr, on

The Cassandra cluster wasn’t partitioned

This test run (like most in the Jepsen series) did come from a partitioned cluster. If you don’t want a network partition, you could just wait for a node to fail, high latencies, compaction, node addition, node removal, etc.

guaranteeing repeatable read semantics from a single client operating at quorum, at least.

This is a common misunderstanding of Dynamo systems. No such property exists.

Why were only half the writes acknowledged in the first place?

Because Quorum was not satisfiable by the coordinator for some writes during the partition.

possibly the timestamp is reused if the client is fast?

Writes in Jepsen are scheduled, not running as fast as possible. IIRC that run was doing something like 1 write/sec/client, plus a bit of skew to simulate clients running on disparate nodes.

Evan
Evan, on

I have read the Riak article and read the source. The Cassandra cluster wasn’t partitioned so it should have behaved the same as the Riak cluster with the perfect lock, which succeeded.

Quorum reads in Cassandra synchronously propagate the value with highest timestamp to a quorum of nodes before returning, guaranteeing repeatable read semantics from a single client operating at quorum, at least.

The test generates a millisecond timestamp without adding a sequence counter, so possibly the timestamp is reused if the client is fast? Why were only half the writes acknowledged in the first place?

Paul Evans
Paul Evans, on

I think you mean “false negatives” in the paragraph near the end:

It also typically yields 0-2 false positives: likely due to writes proxied through n1 and n2 just prior to the partition, such that the write committed, but the acknowledgement was not received by the proxying node.

Writes that were successful, but the requesting client was never informed of such.

Aphyr
Aphyr, on

“Losing 28% of your supposedly committed data is not serializable by any definition. Next question.”

Would like an explanation of why this is happening.

Ah, you’re in luck! See the linked Riak article for a detailed explanation; in short, you have no guarantee that any given write attempt will atomically succeed or fail–it might come back from the dead at a later point and destroy conflicting writes.

Are you sure the read is also occurring at quorum consistency?

I could tell you “yes” again, but if you don’t believe me, you could read the source linked to in that paragraph. ;-) https://github.com/aphyr/jepsen/blob/master/src/jepsen/cassandra.clj#L57

struff
struff, on

Regarding timestamps: isn’t it the case that you’re constrained in precision by what the CPU can handle? The value returned in a “nano-second” timestamp may have misleading precision.

By which I mean, if you get 2 nano-second timestamps 1ns apart, they may well show the same value anyway. It doesn’t matter how precise you think you’re being - if you get collisions at the millisecond level, using ‘nano’ times won’t necessarily save you.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.