Jepsen: Percona XtraDB Cluster

Percona’s CTO Vadim Tkachenko wrote a response to my Galera Snapshot Isolation post last week. I think Tkachenko may have misunderstood some of my results, and I’d like to clear those up now. I’ve ported the MariaDB tests to Percona XtraDB Cluster, and would like to confirm that using exclusive write locks on all reads, as Tkachenko recommends, can recover serializable histories. Finally, we’ll address Percona’s documentation.

I didn’t use the default isolation levels

But there I need to add quite IMPORTANT addition: it may leave data in inconsistent state if you use SPECIAL TYPE of transactions in default isolation levels that Aphyr uses in his test.

My tests did not use the default isolation levels. I was quite explicit that every transaction in these tests ran with Serializable isolation. Most of Tkachenko’s response addresses InnoDB’s interpretation of Repeatable Read and does not (or rather, should not) apply to the Serializable transactions used in the test.

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE set autocommit=0 select * from accounts where id = 0 select * from accounts where id = 1 UPDATE accounts SET balance = 8 WHERE id = 0 UPDATE accounts SET balance = 12 WHERE id = 1 COMMIT

Using Repeatable Read to test Galera’s support for Snapshot Isolation would have been inappropriate; Repeatable Read allows A3 phantom anomalies which are prohibited by Snapshot Isolation. Depending on the transactions involved, it would be legal for Galera to fail a Snapshot Isolation test if we used an isolation level lower than than Serializable.

This is not a single-node problem

Moreover, if we test the same workload on a simple single instance InnoDB, we will get the same result.

I’m guessing Tkachenko didn’t actually try this, because running the Snapshot Isolation test I described against a single node results in serializable histories every time:

INFO jepsen.core - Everything looks good! (‘ー`) {:valid? true, :perf {:valid? true, :latency-graph {:valid? true}, :rate-graph {:valid? true}}, :bank {:valid? true, :bad-reads []}}

Percona XtraDB Cluster performs these transactions correctly on a single node, but does not isolate them correctly when multiple nodes (or failover) are involved. This is likely not InnoDB’s fault; I believe this is a consequence of Galera’s replication not implementing Snapshot Isolation correctly. Specifically, it may be a bug in Galera’s handling of shared read-locks, which is how MySQL normally provides serializability.

Since Percona XtraDB Cluster does not provide the same invariants as a single XtraDB node, Percona should document this behavior.

There are workarounds!

I want to be very clear: this is a bug. Serializable transactions are not supposed to interleave in this way, even if Serializable means Snapshot Isolation under the hood. MySQL implements Serializability by promoting all selects to acquire shared locks, using LOCK IN SHARE MODE. Shared read locks should be sufficient, and are sufficient on a single node, to prevent data corruption in this workload. Across multiple nodes, shared read locks–either implicitly added by MySQL or explicitly included in the query–do not yield serializable histories.

The Galera ticket and Tkachenko’s post suggest two workarounds:

  1. Promote shared read locks to exclusive locks using SELECT ... FOR UPDATE.

  2. Rephrase transactions to do updates in-place, e.g. SET balance=balance-25.

The first appears to be valid, but as I noted earlier, isn’t supposed to be necessary. Shared locks are perfectly sufficient for serializable execution on a single node; their failure in Percona XtraDB Cluster likely stems from a bug in Galera. Also note that you must obtain exclusive locks even for pure-read transactions to avoid A5A Read Skew.

The second strategy… honestly, I’m not sure about. It does seem to pass, but I suspect it allows a few anomalous behaviors. First, Read Skew means the code which only applies the balance transfer if the resulting balances would remain positive is probably incorrect. Second, I’m not sure what mechanism in the query planner makes this execution correct–I suspect that this only passes because the window of concurrency for the read/write cycle is very short. In Jepsen we try to design transactions with long windows of concurrency because many anomalies are masked by short update cycles. Finally, this strategy also does not allow consistent reads; you’ll see Read Skew anomalies unless you apply SELECT ... FOR UPDATE.

What does Percona claim, anyway?

Tkachenko followed up on twitter

@tsantero Get your facts straight. Percona is not affiliated with http://galeracluster.com . Percona never claimed support SNAPSHOT ISOLATION

Agreed. Percona claimed something much stronger.

It’s been taken down now, but up until my post they did have a page discussing their safety guarantees in terms of CAP. This was the only documentation I could find on XtraDB’s transactional isolation.

Let’s take look into the well known CAP theorem for Distributed systems. Characteristics of Distributed systems:

  • C - Consistency (all your data is consistent on all nodes),
  • A - Availability (your system is AVAILABLE to handle requests in case of failure of one or several nodes ),
  • P - Partitioning tolerance (in case of inter-node connection failure, each node is still available to handle requests).

I know I sound like a broken record at this point, but this is not what the CAP Theorem says.

Gilbert & Lynch define consistency as linearizability–in an SQL system I suppose 1SR: Strong Serializability might be more apt. We have related proofs, however, classifying Cursor Stability, Repeatable Read, Snapshot Isolation, Serializability, and Strong Serializability as not totally-available in asynchronous networks.

Availability in CAP specifically refers to total availability: every request to a non-crashed node must succeed. This is different from partial availability, where some fraction of nodes can continue during a network failure.

Partition tolerance doesn’t require every node still be available to handle requests. It just means that partitions may occur. If you deploy on a typical IP network, partitions will occur; partition tolerance in these environments is not optional.

They went on to say:

Percona XtraDB Cluster has: Consistency and Availability.

Look at your life. Look at your choices.

OK first, claiming a system is CA means you’re claiming a property about the network: namely, that it can’t arbitrarily delay or drop packets. That claim is off the table the instant you let strangers run your software on their network, because chances are their network can and will partition.

Percona uses Galera for replication, and Galera’s documentation explains that a quorum is required for write availability. You need a majority component–optionally determined by node weights–to certify writesets. No majority, no writes. This is not an A system.

Galera allows stale reads so it definitely doesn’t provide linearizability, so it can’t have 1SR. That alone is sufficient to rule out the CAP Theorem’s C. But even if Percona had interpreted C as something weaker, say, Serializability or Snapshot Isolation, we’ve seen that neither of those is correct in Percona XtraDB Cluster. You can obtain Snapshot Isolation, at least in this particular workload, by limiting yourself to what Tkachenko calls a “SPECIAL TYPE of transactions”–adding exclusive locks to every read. But it certainly doesn’t provide these invariants in general.

Percona claimed to be CA, but actually provides neither C nor A.

When I mentioned this on Twitter they quickly took the page down, which is terrific! But now there’s nothing I can find in Percona’s docs describing what safety guarantees they actually do offer, and certainly no discussion of Read Skew and improper lock handling.

As a potential buyer of Percona XtraDB Cluster, I’m disappointed.

tobi
tobi, on

Denial on their part is a PR screw up. It only adds oil to the fire and makes even more people notice the problems. A public flamewar is the last thing they should engage in.

Denying severe bugs that clearly exist does not inspire confidence in a database product.

Jonathan
Jonathan, on

Agree with Tobi. It has gotten to the point where these bugs are commonplace enough (not saying that’s good) that the vendor should at least feel confident addressing it head on. I hope they will.

Felx
Felx, on

So, literally neither C nor A nor P.

What the fuck.

Henrik Ingo
Henrik Ingo, on

Hi Kyle

The inconsistency is easy to reproduce manually on a single node with REPEATABLE READ isolation level. Here…

Client 1:

mysql> select * from test; +------+------+ | id | a | +------+------+ | 1 | 10 | | 2 | 10 | | 3 | 10 | +------+------+ 3 rows in set (0.00 sec) mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> select * from test where id=1; +------+------+ | id | a | +------+------+ | 1 | 10 | +------+------+ 1 row in set (0.00 sec) mysql> select * from test where id=2; +------+------+ | id | a | +------+------+ | 2 | 10 | +------+------+ 1 row in set (0.00 sec) mysql> update test set a=11 where id=1; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> update test set a=9 where id=2; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0

Client 2:

mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> select * from test where id=2; +------+------+ | id | a | +------+------+ | 2 | 10 | +------+------+ 1 row in set (0.00 sec) mysql> select * from test where id=3; +------+------+ | id | a | +------+------+ | 3 | 10 | +------+------+ 1 row in set (0.01 sec)

Client 1:

mysql> commit; Query OK, 0 rows affected (0.00 sec) mysql> select * from test; +------+------+ | id | a | +------+------+ | 1 | 11 | | 2 | 9 | | 3 | 10 | +------+------+ 3 rows in set (0.00 sec)

Client 2:

mysql> update test set a=15 where id = 2; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> update test set a=5 where id = 3; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> commit; Query OK, 0 rows affected (0.00 sec) mysql> select * from test; +------+------+ | id | a | +------+------+ | 1 | 11 | | 2 | 15 | | 3 | 5 | +------+------+ 3 rows in set (0.00 sec)

If you use FOR UPDATE, then Client 2 will block at SELECT ... WHERE id=2 FOR UPDATE until Client 1 commits.

henrik

mysql> commit; Query OK, 0 rows affected (0.00 sec)

mysql> select * from test; +——+——+ | id | a | +——+——+ | 1 | 11 | | 2 | 9 | | 3 | 10 | +——+——+ 3 rows in set (0.00 sec)

Dave
Dave, on

CRAP perhaps?

Mortine
Mortine, on

Hello,

to add oil to the fire (I like it, I am bad) if this guy can be CTO, me-self I can be dark-vador ; halloween party 8-) @Henrik, yes it’s obvious ; anyone having a little experience would jump from his chair, reading the answer and the silly demo of Tkachenko ; this is a joke and a sad one.

Gents have a good night.

– The very bad.

Post a Comment

Please avoid writing anything here unless you are a computer: This is also a trap:

Supports github-flavored markdown for [links](http://foo.com/), *emphasis*, _underline_, `code`, and > blockquotes. Use ```clj on its own line to start a Clojure code block, and ``` to end the block.

Copyright © 2017 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.