Ian
Ian, on

This was a while ago, any change we can get a redux for Cassandra?

Aphyr
Aphyr, on

Sam, transactions are implicitly begun at the first SELECT, and WITH CONSISTENT SNAPSHOT is redundant; consistent snapshots are automatically taken at the first SELECT from a transactional table.

Just to double-check, I re-ran the test with an explicit START TRANSACTION WITH CONSISTENT SNAPSHOT, and the results are unchanged.

Bobby
Bobby, on

Another fervent +1 for Hazelcast. As always, great post!

bernard
bernard, on

Cool!

Sam Wilson
Sam Wilson, on

I may be completely wrong, but is

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE SET autocommit=0

the right way to enable SI? I would have thought it would be:

SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ SET autocommit=0 START TRANSACTION WITH CONSISTENT SNAPSHOT
svar
svar, on

sorry last comment was mine not anonymous!

anonymous
anonymous, on

Federico,

The select is here to prove that the total amount of money in the database is constant, in a SI system no transaction should be able to read a state that should have never existed without concurrency or clustering. A thread or a remote node is kine of same issue regarding consistency. Each implementation fail into an isolation level and definitively Galera does not fail in SI .

I agree with you most user don’t really care or dare to care as long as they have a workaround. You have resume the situation asking to use select for update on a single node.

So if one database implementation is more adapted to 99% of the usage it will be more popular, end of the story. Users will learn the workaround the hard way . Now it does not remove quality of the analyze and it highlight what shortcut was taken by Galera to achieve an insolvable issue, run concurrent transactions from far away locations without loosing to much of performance. If i have to move a galera node inside opportunity on mars, and someone ask you where is the robot now. Galera tells you where it was last position , and surely not where it is. But using wsrep_causal_read it will tell, wait until it give me the next position. but when you receive it , the robot is already away. Certification ask the robot to stop moving , and knowing where it is we can ask him to go somewhere , but would cost to much to tell him to stop every time one blogger wan’t to know. Or worth to tell the robot to come back to point i told that reader, because i may i gave him the wrong information.

Noah
Noah, on

Wow. Usually hardcore technical blogs like Aphyr’s are an exception to “it’s the Internet, oh God don’t read the comments.”

But this seems to be much closer to the normal level of Internet comments. It’s clear quite a lot of them haven’t read the actual post, for instance, and they’re fine with completely wrong documentation on Galera’s part if there might kinda-sorta be a workaround.

So: thank you Aphyr. I’m sorry you wound up with this amount of flak. But I appreciate this post very much, and I suspect I’m not the only one.

Federico Razzoli
Federico Razzoli, on

I’m not sure about the point of this post.

Galera has limitations? Yes. These limits are well-documented, so one can decide which applications can or cannot run on Galera. For example, you should not use MyISAM. Unfortunately, having no limitations is impossible for any software.

But your post fails to be merely useless: it is obviously misleading. Why “obviously”? Because it is meant to be read by developers, not marketing people. So, you cannot expect your reader to think “hey, I could need a query like that!”.

Your article has several levels of “badness”:

  • You show a SELECT + UPDATE. But the SELECT seems to be unnecessary.
  • But if you want to run useless queries, you could at least do it in the only logical way: UPDATE field + num. Your problem will magically disappear.
  • Ok, let’s ignore logic. But I still expect you to know SELECT … FOR UPDATE. Yes, it’s locking, but you wanted this lock - and anyway, I don’t think it’s a problem.
Andrei Elkin
Andrei Elkin, on

A bit of offtopic, but it’s unfair to call the Oracle’s Group replication as a reimplementation of anything. It’s a genuine solution.

Indeed either system models results of research of Swiss Federal Institute of Technology (EPFL), see section Certification-based Replication, in Understanding Replication in Databases and Distributed Systems M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, G. Alons, 1999.

Aphyr
Aphyr, on

You are spending week or month to demonstrate that a system is not SERIALIZABLE

Galera does not claim to be serializable, and this article does not evaluate that claim. Galera does claim to provide Snapshot Isolation, which is why I’ve tested it for Snapshot Isolation.

… when this is a well know design since 2011 But you don’t wan’t to spend time for demonstrating on what we advocate as correct deployment. Maxscale and Haproxy are on github and failover setting is well documented.

The topology used in this test is one of Galera’s recommended deployment variants. Adding a proxy layer won’t fix any of the problems I discussed here: it’ll just shuffle the client->server connections.

Anyway if i’m wrong here hadoop may also be wrong by design and i don’t yet see any post on that topic yet.

I have no idea what you’re trying to say here. Hadoop is not a relational database. It has a completely different architecture, use case, and invariants.

svar
svar, on

You are spending week or month to demonstrate that a system is not SERIALIZABLE when this is a well know design since 2011 But you don’t wan’t to spend time for demonstrating on what we advocate as correct deployment. Maxscale and Haproxy are on github and failover setting is well documented. Anyway if i’m wrong here hadoop may also be wrong by design and i don’t yet see any post on that topic yet.

Gavin
Gavin, on

Finally, within some databases you can obtain strong consistency by taking note of the current transaction ID

What is necessary for this to work? Is postgres one of the databases? I imagine XID wraparound might be problematic.

Aphyr
Aphyr, on

I’m surprise by the shortcut, that response is not a prove.

Jepsen analyses take weeks to months of time, and you’ve given me a hand-waving description of an in-house system that I can’t read the source to, let alone test, and even if you had fully described your solution, I don’t have the time to write a full analysis. Proof of your system’s correctness is not my responsibility. If you think it works, that’s fantastic. I just can’t advocate for it myself.

svar
svar, on

re,

“systems which rely on leader state, rather than causal chains associated with data operations themselves, tend to lose data during leader transitions. ”

I’m surprise by the shortcut, that response is not a prove. Leader election is taking place outside the causality of events of the cluster. That design impact that during leader transition the system become CP as a transaction started on old leader would not be able able to commit. Galera certification would ensure that the new leader is just up to date with any writes. I would be happy that you prove me wrong.

Galera provide a certain causal chains that is not SI, it never claimed to be serializable so far and for good reasons. Using it your way, it is an AP design with all the pain that come with it. A proxy that claims to virtualize a Galera Cluster using leader election but without forwarding errors (transaction replay) would take the risk to break C as R(x) would not be causal any more. I guess that many users would chose AP during leader transition instead of CP.

Aphyr
Aphyr, on

I feel that adopting last-write-wins (LWW) is dangers.

I completely agree; LWW is a terrible idea. The only case in which it’s safe for Riak is when your data never changes, in which case OWW (one write wins) is semantically equivalent and protects you from accidentally losing data by trying to change it. The reason the test is structured this way is because even if you happen to have totally synchronized clocks, fallback vnodes can still induce data loss in Riak with LWW–it’s a weaker hypothesis than requiring unsynchronized clocks for equivalent results. :)

Ibrahim
Ibrahim, on

Thank you for an excellent post.

“Riak is an excellent open-source adaptation of the Dynamo model. It includes a default conflict resolution mode of last-write-wins, which means that every write includes a timestamp, and when conflicts arise, it picks the one with the higher timestamp. If our clocks are perfectly synchronized, this ensures we pick the most recent value.”

I feel that adopting last-write-wins (LWW) is dangers. Because we can’t guarantee picking the most recent value, even we perfectly synchronized our wall-clock, because we can never guarantee synchronized our wall-clock because of many reasons 1. Clients and Riak /Cassandra servers are located in different regions, resulting in different timestamps.

  1. Among Riak/Cassandra servers (or replicas) is also possible to have different timestamps, so above scenario can occur. Even we use NTP to synchronize the clock, the different time can happen at least the different in milliseconds.

What do you think?

Seppo Jaakola
Seppo Jaakola, on

Indeed, the “first committer wins” rule was traded for performance, and Galera Cluster does not support Snapshot Isolation any more. We have to fix the documentation accordingly. The performance impact for aborting SI ‘victims’ was remarkable, in our benchmark profiles. This is because of InnoDB rollback, which is rather heavy operation. But, lesson learned from your experiments is that, we could support SI as session property.

BTW, MariaDB Galera Cluster has session variable: ‘wsrep_sync_wait’ for providing read causality. It does not bring you SI, but will delay readers to wait until on-the-fly replicated transactions have committed. So with that, readers should see all earlier transactions that were accepted by the cluster.

Will Newby
Will Newby, on

I really enjoy these articles, they’re a great test for product documentation and making admins aware of how their systems can fail. Some requests for future tests:

Databases: HBase MS SQL Server’s Availability Groups

Distributed Filesystems/storage: Gluster Possibly Ceph/RADOS

Keep up the great work!

Bhupinder
Bhupinder, on

Oh man, I tried so hard to get the various MySQL cluster flavors set up and had to abort after weeks of spinning my wheels. Maybe someday!

Could you please describe the details , any specific issues you had faced and or it didnt work as you expected. Perhaps I can help.

anonymous
anonymous, on

test

James Baiera
James Baiera, on

Great article, as usual. Just finished up Jay’s response article that he published. He does make an interesting point here: There’s both a Zookeeper quorum and Kafka quorum in play.

It seems that the nature of the failure mode you’re demonstrating here is that the ISR’s replicated nodes are unreachable from the leader, but those replicas are still part of some group in Zookeeper that are allowed to partake in leader elections. Perhaps it makes more sense that as soon as the ISR leader decides that some following node is unreachable it should notify ZK that the lost follower should not be allowed to run for election.

This could also prove to notify the existing leader if itself is the one that’s been ‘lost’. If it’s capable of requesting the removal of that node from the group in ZK, even if ZK is partitioned, it can only succeed in that request if it’s on the majority side of ZK’s partition. If ZK responds that it’s in read only mode from partition, then the leader knows up front that it is the odd man out (even if it can still call most of it’s ISRs) and yields accordingly.

At that point, if the leader is value and truly alone and on the majority side of ZK’s quorum, leadership becomes entirely dependent on whether or not the link with Zookeeper remains open (should ZK partition further or the partition shifts the majority), but should those quorums be split across multiple partitions, well, then it comes back to Jay’s point about the difference of being incorrect and alive vs correct and dead.

Ultimately that question being asked just further proves the confusion for classifying it as a “CA” system in the first place…

Jannis
Jannis, on

The link to the test code seems broken I get a 404 from GH

Aphyr
Aphyr, on

The official “Group Replication” feature from MySQL, which is currently a lab/preview feature, is essentially a reimplementation of Galera and should be easy to test by just duplicating your test.

Oh man, I tried so hard to get the various MySQL cluster flavors set up and had to abort after weeks of spinning my wheels. Maybe someday!

It is worth noting that Percona XtraDB Cluster is the same thing as MariaDB Galera Cluster, based on exactly the same underlying tech from Galera. I am no longer affiliated with Percona, but I believe from my experience at VividCortex that it is far more widely deployed. My claim is based on the number of various kinds of databases our customers are monitoring with VividCortex. There is a surprising (even to me) amount of Percona XtraDB Cluster running in production out there.

You know, I just started looking into Percona yesterday–haven’t managed to get it installed yet, but I expect that it’ll show the same anomalies. Amusingly their documentation claims Percona is CA–e.g. fully available and linearizable–both of which should be readily disprovable. :)

Aphyr
Aphyr, on

I think you proof might be missing a case; couldn’t transactions start at the exact same time? (Fwiw, I don’t think it changes the conclusion at all, just wanted to be super-pedantic.)

That’s a good catch–one of the subtler points of SI from the Berenson and Adya papers that I elided in the post is that start times and commit times are always distinct. There are more specific monotonicity requirements too, but I don’t think they matter for the proof. :)

Aphyr
Aphyr, on

It’s not a scoop, RTFM : The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.

It looks like you may have missed the extensive discussion of this “fucking documentation” in the post. I suggest you refer to Berenson, et al, remark 9: “REPEATABLE READ »« Snapshot Isolation.”

When used with single leader it’s very close to a full CAP system, 0 data lost failover and distributed committed read is already a nice improvement over asynchronous replication for some read write splitting workload.

In more general terms, systems which rely on leader state, rather than causal chains associated with data operations themselves, tend to lose data during leader transitions. This is a long-running theme in Jepsen: see Aerospike, Mongo, Elasticsearch, etc. Without more rigorous evidence of correctness, I can’t advocate for that kind of strategy here.

svar
svar, on

Jepsen education is always good, thanks for the delight in reading it!

Galera gave us a very good lesson on our dogmatic position on ACID: Despite A?ID, easy failover and provisioning with minimum impact on performance made it very popular!

We advice usage via leader election and to balance transactions only for disjoint conflicting write domains, that’s the reason we bring in the equation a failover proxy aka MaxScale in this game, to bring back ACID capabilities for the 0.001% SQL statement that need C.

When managing state transitions (scores, auctions, stocks) the implementation need extra caution: pick leader node or loop until getting an affected row for SET W(R'(x) +1) WHERE R(x)=R'(x)

It’s not a scoop, RTFM : The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE. The reason for this is that there is no support in the SERIALIZABLE transaction isolation level for the multi-master use case, neither in the STATEMENT nor the ROW formats.

If i get your point correctly when used for write everywhere, the isolation should not be declare a SNAPSHOT-ISOLATION but more classify as AP type of storage engine ?

When used with single leader it’s very close to a full CAP system, 0 data lost failover and distributed committed read is already a nice improvement over asynchronous replication for some read write splitting workload.

In your conclusion “The probability of data corruption” , should probably better read “The probability of data inconsistency”

Tx, can’t wait for your next post!

Baron Schwartz
Baron Schwartz, on

The official “Group Replication” feature from MySQL, which is currently a lab/preview feature, is essentially a reimplementation of Galera and should be easy to test by just duplicating your test.

It is worth noting that Percona XtraDB Cluster is the same thing as MariaDB Galera Cluster, based on exactly the same underlying tech from Galera. I am no longer affiliated with Percona, but I believe from my experience at VividCortex that it is far more widely deployed. My claim is based on the number of various kinds of databases our customers are monitoring with VividCortex. There is a surprising (even to me) amount of Percona XtraDB Cluster running in production out there.

zytek
zytek, on

Oh, this perfectly demonstrats how in the hype of “distributed, fault tolerant systems” things are not as authors promise us and how they forget that even when every piece is fault-tolerant, combined pieces a.k.a. ‘system’ is not. Awesome job.

Anonymous
Anonymous, on

I think you proof might be missing a case; couldn’t transactions start at the exact same time? (Fwiw, I don’t think it changes the conclusion at all, just wanted to be super-pedantic.)

clrnd
clrnd, on

Nice article, loved the proof.

Regarding D3js, in general they mutate the underlying objects. That has bothered me a couple of times, but for such a performance sensitive work I can understand it.

sergei
sergei, on

great article, i do use mongo, exactly to be a “log bin”, and would agree that documentation for mongo is tricky, u really have to read it all to make sure u get what u want.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.