Jepsen: MariaDB Galera Cluster

2015-09-01

Previously, on Jepsen, we saw Chronos fail to run jobs after a network partition. In this post, we’ll see MariaDB Galera Cluster allow transactions to read partially committed state.

Galera Cluster extends MySQL (and MySQL’s fork, MariaDB) to clusters of machines, all of which support reads and writes. It uses a group communication system to broadcast writesets and certify each for use. Unlike most Postgres replication systems, it handles the failure and recovery of all nodes automatically, and unlike MySQL Cluster, it has only one (as opposed to three) types of node. The MariaDB Galera packages are particularly easy to install and configure.

Galera Cluster uses the normal InnoDB isolation levels locally–but we’re interested in cluster-wide consistency guarantees. Between nodes, Galera claims to implement Snapshot Isolation–a reasonably strong consistency model.

Transaction isolation also occurs at the cluster level. Between transactions processing on separate nodes, Galera Cluster implements a transaction level called SNAPSHOT-ISOLATION. The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.

This is not strictly speaking correct–but to understand why, we’ll need to know a little more about SQL isolation levels.

SQL Consistency Models

The standard ANSI SQL isolation models define four isolation levels for transactions: read uncommitted, read committed, repeatable read, and serializable. Each level prevents an additional kind of unwanted phenomena: dirty reads, fuzzy reads, and phantoms. The standard’s definition of these phenomena is somewhat ambiguous, and has been extended to a more rigorous spec by Adya and others.

For instance, Read Uncommitted (RU) must prohibit Dirty Write, which Adya terms P0. In P0, one transaction modifies a record written by another before that transaction commits.

Dirty Write

(P0):

w₁(x) ... w₂(x)

In this notation, r₁(x) means “transaction T₁ read x”. A w denotes a write, a means abort, and c indicates transaction commit.

Read Committed (RC) prohibits P0 and also P1: Dirty Read. Repeatable Read (RR) goes one step further and prohibits P2: “Fuzzy Read”. Finally, Serializability (1SR) adds one final exclusion: P3, or “Phantom”. Serializability is a very strong constraint: all transactions must appear to execute atomically–results should be equivalent to a system in which the transactions ran strictly one after the next, rather than concurrently.

Dirty Write	(P0):	w₁(x) ... w₂(x)	Prohibited by RU, RC, RR, 1SR
Dirty Read	(P1):	w₁(x) ... r₂(x)	Prohibited by RC, RR, 1SR
Fuzzy Read	(P2):	r₁(x) ... w₂(x)	Prohibited by RR, 1SR
Phantom	(P3):	r₁(P) ... w₂(y in P)	Prohibited by 1SR

There’s a neat kind of symmetry here: P1 and P2 are duals of each other, preventing a read from seeing an uncommitted write, and preventing a write from clobbering an uncommitted read, respectively. P0 prevents two writes from stepping on each other, and we could imagine its dual r₁(x) … r₂(x)–but since reads don’t change the value of x they commute, and we don’t need to prevent them from interleaving. Finally, preventing P3 ensures the stability of a predicate P, like a where clause–if you read all people named “Maoonga”, no other transaction can sneak in and add someone with the same name until your transaction is done.

If you’re having trouble figuring out what these isolation levels actually allow, you’re not alone. The anomalies prevented (and allowed!) by Read Uncommitted, Read Committed, etc are derived from specific implementation strategies. If you use locks for concurrency control, and lock records which are written until the transaction commits (a “long” lock), you prevent P0. If you add a short lock on reads (just for the duration of the read, not until commit time), you prevent P1. If you acquire long locks on both writes and reads you prevent P2, and locking predicates prevents P3. The standard doesn’t really guarantee understandable behavior–it just codifies the behavior given by existing, lock-oriented databases.

Locks aren’t the only ways we can protect transactional integrity. Some MVCC databases, like Oracle and Postgres, offer a different kind of model: Snapshot Isolation.

Snapshot Isolation

Snapshot Isolation arises from Berenson et al’s critique of the ANSI SQL isolation levels. The paper is subtle and dense, but the gist of it is that there are some anomalies allowed by the original SQL definitions that are undesirable, and we might define a new isolation level that gives us most, but not all, of the benefits of Serializability, while still being relatively efficient.

In Snapshot Isolation, every transaction operates on an isolated snapshot of the committed data, taken at any time prior to the transaction’s first read. The transaction can freely modify its snapshot, and those values are visible within the transaction, but are not visible to other transactions unless the transaction commits.

A transaction T₁ may commit if the data it plans to write has not been written by any other transaction which committed after T₁’s initial snapshot was taken. If a committed transaction conflicts with T₁’s writes, T₁ must abort. This is called First-committer-wins, and prevents transactions from stepping on each other’s updates without seeing them. This is the “lost update” anomaly: P4.

P4: r₁(x) … w₂(x) … w₁(x) … c₁

Like Repeatable Read, Snapshot Isolation prohibits P0, P1, and P2, but allows some types of Phantom anomalies (P3). It is therefore weaker than Serializability, and strictly stronger than Read Committed. However, it differs from Repeatable Read in two respects:

Repeatable Read allows an anomaly called A3–a type of Phantom, which SI prevents.
Snapshot isolation allows an anomaly called A5B, called Write Skew, which RR prevents.

A3 occurs when T₁ reads some predicate P, T₂ modifies that predicate set and commits, and T₁ subsequently reads that predicate set again and commits.

A3: r₁(P) … w₂(y in P) … c₂ … r₁(P) … c₁.

Note that A3 is a specific case of P3: it only applies if both transactions commit, whereas P3 includes all cases where a predicate set is read and modified by another transaction. Snapshot Isolation precludes A3 (but still allows some other anomalies in P3) because T₁ always reads the data from its snapshot. So with respect to this particular anomaly, Snapshot Isolation is stronger than Repeatable Read.

On the other hand, SI allows A5B, or Write Skew. Write Skew occurs when two transactions read two distinct values, then write on top of the other transaction’s read. Formally:

A5B: r₁(x) … r₂(y) … w₁(y) … w₂(x) … (c₁ and c₂)

A5B is prevented by Repeatable Read, but allowed by Snapshot Isolation since the write sets of both transactions do not overlap. We also know of one additional anomaly allowed by Snapshot Isolation, in which a read-only transaction can sneak in between two transactions which legally commit out of order because their write sets are disjoint.

A6: r₂(x) … w₁(y) … c₁ … r₃(x) … r₃(y) … c₃ … w₂(x) … c₂

Diagram from Berenson et al showing the relationship between isolation levels

In light of this, we should revisit Galera’s claim that “SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.” This is not quite correct: Berenson and Adya are clear that SNAPSHOT ISOLATION lies between Read Committed and Serializable, but is neither a superset nor subset of Repeatable Read.

Given Snapshot Isolation allows these anomalies, a natural question to ask is whether we can prevent the anomalies, obtaining full serializability, by restricting the classes of transaction that can run on the system. It turns out the answer is yes: we can force the transactions to conflict by promoting some reads to writes (forcing the write set to intersect and preventing commit), or by analyzing dependency cycles in transactions. These techniques allow us to turn a Snapshot Isolation system into a Serializable one. This is how Postgresql’s Serializable isolation level is implemented–on top of an underlying Snapshot Isolation system.

So: we’ve learned about the ANSI SQL isolation levels (RU, RC, RR, and 1SR), and seen that Snapshot Isolation fits in alongside Repeatable Read, and below Serializability. It prevents Dirty Writes, Dirty Reads, and Lost Updates, and some types of Phantoms. However, it still allows an anomaly called Write Skew, so not all transactions in an SI system are guaranteed to serialize correctly. In order to verify Galera’s claims of Snapshot Isolation, we’ll have to design a test that fits within those guarantees.

Designing a test

Jepsen has a linearizability checker in Knossos, but that won’t work here–Snapshot Isolation, and even Serializability, don’t require that operations take place now. They simply have to take place atomically at some point in the history–maybe in the past or future. We need a different kind of checker!

Imagine a system of two bank accounts, each with a balance of $10.

create table if not exists accounts
  (id      int    not null primary key,
   balance bigint not null);
INSERT INTO accounts ( id, balance ) VALUES ( 0, 10 );
INSERT INTO accounts ( id, balance ) VALUES ( 1, 10 );

Then we’ll generate and apply transactions that transfer random amounts of money from one account to the other, so long as no account goes negative:

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
set autocommit=0
select * from accounts where id = 0
select * from accounts where id = 1
UPDATE accounts SET balance = 8 WHERE id = 0
UPDATE accounts SET balance = 12 WHERE id = 1
COMMIT

Because these transactions write every record they read, they must be serializable under Snapshot Isolation. This is not the same as Galera offering serializability for all transactions–we’re just asserting that these particular transactions appear to take place atomically. By serializable, remember, we mean any history which is equivalent to the transactions executing in some sequential order. It’s OK for operations from two transactions to interleave, so long as their outcomes are the same as if T₁ had executed by itself, then T₂ had executed after.

Claim: Given a set of transactions, each beginning with a read and writing every value they read, a Snapshot Isolation system must always produce serializable histories.

Proof by contradiction: assume these transactions may not always serialize. Then there must exist some possible history in which one transaction T₁ appears to execute interleaved with another T₂.

Lemma 1: Since the start time for a transaction must precede its first read, and the first operation in every transaction is a read, every snapshot’s start time must occur before any of its operations. Similarly, each transaction’s commit time must occur after its last operation. Therefore, the [start-time, commit-time] interval for a transaction covers all its operations.

Without loss of generality, assume T₁ starts before T₂ starts.

Case 1: T₁ commits before T₂’s start time. Operations from T₁ and T₂ cannot interleave, by Lemma 1, because their intervals do not overlap.
Case 2: T₁ and T₂ operate on disjoint sets of accounts. They serialize trivially.
Case 3: T₁ and T₂ operate on intersecting sets of accounts, and T₁ commits before T₂ commits. Then T₁ wrote data that T₂ also wrote, and committed in T₂’s interval, which violates First-committer-wins. T₂ must abort.
Case 4: T₁ and T₂ operate on intersecting sets of accounts, and T₁ commits after T₂ commits. Then T₂ wrote data that T₁ also wrote, and committed in T₁’s interval, which violates First-committer-wins. T₁ must abort.

Cases 1 and 2 contradict our assumption that T₁ and T₂ do not serialize. Cases 3 and 4 contradict the Snapshot Isolation invariants. Therefore, if the system provide Snapshot Isolation, T₁ and T₂ must serialize. Since every pair of transactions must serialize, the total history of transactions must also comprise a serializable history. ∎

Note that we’ve restricted ourselves to transactions that write every value they read in order to avoid phenomena A5B and A6. If we had transactions with intersecting read sets and disjoint writes sets, A5B and A6 might occur, and nonserializable histories could result!

Now introduce read-only transactions, listing all balances:

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
set autocommit=0
select * from accounts
COMMIT

Read-only transactions trivially serialize with one another. Do they serialize with respect to transfer transactions? The answer is yes: since every read-only transaction sees only committed data in a Snapshot Isolation system, and commits no data itself, it must appear to take place atomically at some time between other transactions.

So the set of all transfer and read transactions must serialize. This gives us two important invariants:

Since each transfer conserves money, the total amount of money in the system remains constant.
Since read transactions serialize and read every balance, every read should see the same total.

With invariant 2, we can write a function to verify that every read sees the correct total balance. Then we generate a mix of randomized transfer and read operations, and apply them to our Galera cluster.

Results

At low levels of concurrency–say five clients, performing about one op per second, things are fine.

{:valid? true,
 :perf ...
 :bank {:valid? true, :bad-reads []}}

But increase the probability of conflicting transactions by running, say, 20 clients pushing an aggregate ~150 transactions/sec for about a minute, and things go terribly, terribly wrong.

INFO  jepsen.core - Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻

{:valid? false,
 :perf ...
 :bank
 {:valid? false,
  :bad-reads
  [{:type :wrong-total,
    :expected 20,
    :found 18,
    :op {:value [6 12], :time 1717930325, :process 15, :type :ok, :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 16,
    :op {:value [2 14], :time 3253699251, :process 13, :type :ok, :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 17,
    :op {:value [8 9], :time 5110345929, :process 17, :type :ok, :f :read}}
   ...

Every transaction should have seen a total balance of 20–but instead we read different values, like 18, 16, or 17. This means our read transactions don’t see a consistent snapshot of the database. They actually see intermediate results from transfer transactions.

These inconsistent reads aren’t just limited to read-only transactions–transfers can see them too. Here’s an excerpt from the query log, showing a transfer transaction which saw an inconsistent snapshot of the world.

66 Connect   jepsen@192.168.122.1 as anonymous on jepsen
66 Query show variables like 'max_allowed_packet'
66 Query SELECT @@tx_isolation
66 Query SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
66 Query set autocommit=0
66 Query select * from accounts where id = 1
66 Query select * from accounts where id = 0
66 Query UPDATE accounts SET balance = 8 WHERE id = 1
66 Query UPDATE accounts SET balance = 9 WHERE id = 0
66 Query COMMIT
66 Query ROLLBACK
66 Query set autocommit=1
66 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
66 Quit

8 + 9 is only 17: the client read an inconsistent snapshot of the two accounts, tried to move some money from one to the other, resulting in new balances 8 and 9, and tried to write those values back. Luckily this transaction failed to commit: it rolled back–as did every inconsistent transfer transaction in this test case. In the ticket I filed for this issue, I initially believed the problem might be limited to read-only transactions–not a great result, but not as catastrophic as permanent data corruption.

Unfortunately, not all inconsistent transfer transactions correctly abort. Some can commit, permanently creating or destroying money. In this run, for instance, an inconsistent read in a transfer transaction fabricates $2 out of thin air:

   ...
   {:type :wrong-total,
    :expected 20,
    :found 22,
    :op {:value [10 12], :time 200102098751, :process 12, :type :ok, :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 22,
    :op {:value [6 16], :time 200109803013, :process 7, :type :ok, :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 22,
    :op {:value [10 12], :time 200113103237, :process 6, :type :ok, :f :read}}
   {:type :wrong-total,
    :expected 20,
    :found 22,
    :op {:value [6 16], :time 200128852818, :process 3, :type :ok, :f :read}}]}}

The transfer transactions should have kept the total amount of money at $20, but by the end of the test the totals all sum to $22. And in this run, 25% of the funds in the system mysteriously vanish. These results remain stable after all other transactions have ended–they are not a concurrency anomaly.

   ...
   {:type :wrong-total,
    :expected 20,
    :found 15,
    :op {:value [15 0], :time 130519175659, :process 14, :type :ok, :f :read}}]}}

In summary, Galera does not provide Snapshot Isolation. A transaction does not operate on an isolated snapshot of the world; other transactions may modify the data it’s reading.

Dirty reads?

So reads can see data that’s being modified by a transaction. Is this P1: Dirty Read? We saw this behavior in MongoDB, where a read could see invalid data from a transaction that never committed.

To measure this, we’ll design a similar test. This time, each operation writes a unique value to every row in a single transaction, allowing us to identify precisely which transaction was responsible for the values a read sees.

If Snapshot Isolation held, every read would see the same number in every row. We know that Galera allows inconsistent reads, so we expect to see a mixture of different numbers in each row. But we can go one step further, and distinguish between values that were written by successful transactions, and those written by failed transactions. If we see data from transactions that never committed, that would be P1: Dirty Reads.

Luckily, the test results suggest that Galera does not allow Dirty Reads.

INFO  jepsen.core - Everything looks good! ヽ(‘ー`)ノ

{:valid? true,
 :perf ...
 :dirty-reads
 {:valid? true,
  :inconsistent-reads
  [[21462 21466 21466 21466]
   [21462 21466 21466 21466]
   ...
   [34449 34449 34460 34460]
   [34460 34460 34463 34463]],
  :dirty-reads []}}

We have plenty of inconsistent reads here–writes in this test always set every row to the same value, but we see different values in reads. However, none of the values we read came from a transaction which did not (eventually) commit. So: there are no dirty reads–at least, not in this test. There might be other conditions that allow Galera to expose uncommitted data, but I haven’t found them yet. This suggests that Galera could support Read Committed.

Given these results, I suspect the inconsistent reads we’re seeing could be A5A: Read Skew. In a Read Skew anomaly, a transaction reads x, but before it can read y, a second transactions sneaks in and updates x and y together. Because both transactions commit this isn’t a Dirty Read, but it has similar effects: two records which should only change together can be changing independently.

A5A: r₁(x) … w₂(x) … w₂(y) … c₂ … r₁(y) … (c₁ or a₁).

Snapshot Isolation prohibits A5A by taking isolated snapshots for reads. This anomaly is supposed to be impossible in Galera–but, as we’ve seen, snapshots aren’t correctly isolated.

The Galera team responded by explaining that Galera does not honor first-committer-wins for performance reasons. No first-committer-wins, no snapshot isolation. No snapshot isolation, well… I’m not sure exactly what Galera does guarantee, but it’s not what it says on the tin.

Recommendations

Galera is easy to install–I spent weeks trying to set up MySQL Cluster to no avail, and got Galera Cluster running in a matter of hours. It offers support contracts, reasonable documentation, and doesn’t require you to navigate the choose-your-own-adventure of Postgres replication strategies. With homogenous nodes, simple configuration, and the wealth of MySQL tooling available, it seems like a solid choice from an operational perspective.

Unfortunately, even in totally healthy clusters, with no node failures or network failures, Galera Cluster does not satisfy its claims of Snapshot Isolation. At moderate concurrency, you should expect to read inconsistent state from other transactions, and to be able to write that state back to the database. Designing your applications to avoid write skew is insufficient: even if writes completely cover the read set, SI fails to hold.

Since Galera appears to forbid reading uncommitted data, I suspect (but cannot show) Galera supports Read Committed isolation.

The probability of data corruption scales with client concurrency, with the duration of transactions, and with the increased probability of intersecting working sets. Placing nodes further apart (for example, on the other side of the planet) will dramatically increase commit times, raising the probability of consistency errors.

I’m not aware of a workaround for these issues. I assumed materializing transaction conflicts or promoting reads to writes would be sufficient, but those techniques failed. If you’re a Galera user with a support contract, you can try asking them to address the issue. You might adopt a different database–though since Galera is the first distributed SQL system I’ve analyzed, and FoundationDB disapparated, I’m not sure what to recommend yet. Or you may decide that your transactions don’t overlap often enough, or aren’t high-value enough, to be a serious problem. Not all consistency errors are critical!

Galera has indicated that they may provide actual Snapshot Isolation, and possibly full Serializability, in future releases. So… stay tuned!

This work is a part of my research at Stripe, where we’re trying to raise the industry’s standards for distributed systems safety. My thanks to sjaakola from the Galera team for his timely response to the bug report, and to Caitie McCaffrey, Peter Bailis, Coda Hale, Jason Pellerin, Camille Fournier, Jorge Ortiz, and Peter Alvaro for their feedback.

Anonymous on 2015-09-05

I think you proof might be missing a case; couldn’t transactions start at the exact same time? (Fwiw, I don’t think it changes the conclusion at all, just wanted to be super-pedantic.)

Baron Schwartz on 2015-09-05

The official “Group Replication” feature from MySQL, which is currently a lab/preview feature, is essentially a reimplementation of Galera and should be easy to test by just duplicating your test.

It is worth noting that Percona XtraDB Cluster is the same thing as MariaDB Galera Cluster, based on exactly the same underlying tech from Galera. I am no longer affiliated with Percona, but I believe from my experience at VividCortex that it is far more widely deployed. My claim is based on the number of various kinds of databases our customers are monitoring with VividCortex. There is a surprising (even to me) amount of Percona XtraDB Cluster running in production out there.

svar on 2015-09-05

Jepsen education is always good, thanks for the delight in reading it!

Galera gave us a very good lesson on our dogmatic position on ACID: Despite A?ID, easy failover and provisioning with minimum impact on performance made it very popular!

We advice usage via leader election and to balance transactions only for disjoint conflicting write domains, that’s the reason we bring in the equation a failover proxy aka MaxScale in this game, to bring back ACID capabilities for the 0.001% SQL statement that need C.

When managing state transitions (scores, auctions, stocks) the implementation need extra caution: pick leader node or loop until getting an affected row for SET W(R'(x) +1) WHERE R(x)=R'(x)

It’s not a scoop, RTFM : The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE. The reason for this is that there is no support in the SERIALIZABLE transaction isolation level for the multi-master use case, neither in the STATEMENT nor the ROW formats.

If i get your point correctly when used for write everywhere, the isolation should not be declare a SNAPSHOT-ISOLATION but more classify as AP type of storage engine ?

When used with single leader it’s very close to a full CAP system, 0 data lost failover and distributed committed read is already a nice improvement over asynchronous replication for some read write splitting workload.

In your conclusion “The probability of data corruption” , should probably better read “The probability of data inconsistency”

Tx, can’t wait for your next post!

Aphyr on 2015-09-05

It’s not a scoop, RTFM : The SNAPSHOT-ISOLATION level occurs between REPEATABLE-READ and SERIALIZABLE.

It looks like you may have missed the extensive discussion of this “fucking documentation” in the post. I suggest you refer to Berenson, et al, remark 9: “REPEATABLE READ »« Snapshot Isolation.”

When used with single leader it’s very close to a full CAP system, 0 data lost failover and distributed committed read is already a nice improvement over asynchronous replication for some read write splitting workload.

In more general terms, systems which rely on leader state, rather than causal chains associated with data operations themselves, tend to lose data during leader transitions. This is a long-running theme in Jepsen: see Aerospike, Mongo, Elasticsearch, etc. Without more rigorous evidence of correctness, I can’t advocate for that kind of strategy here.

Aphyr on 2015-09-05

I think you proof might be missing a case; couldn’t transactions start at the exact same time? (Fwiw, I don’t think it changes the conclusion at all, just wanted to be super-pedantic.)

That’s a good catch–one of the subtler points of SI from the Berenson and Adya papers that I elided in the post is that start times and commit times are always distinct. There are more specific monotonicity requirements too, but I don’t think they matter for the proof. :)

Aphyr on 2015-09-05

The official “Group Replication” feature from MySQL, which is currently a lab/preview feature, is essentially a reimplementation of Galera and should be easy to test by just duplicating your test.

Oh man, I tried so hard to get the various MySQL cluster flavors set up and had to abort after weeks of spinning my wheels. Maybe someday!

It is worth noting that Percona XtraDB Cluster is the same thing as MariaDB Galera Cluster, based on exactly the same underlying tech from Galera. I am no longer affiliated with Percona, but I believe from my experience at VividCortex that it is far more widely deployed. My claim is based on the number of various kinds of databases our customers are monitoring with VividCortex. There is a surprising (even to me) amount of Percona XtraDB Cluster running in production out there.

You know, I just started looking into Percona yesterday–haven’t managed to get it installed yet, but I expect that it’ll show the same anomalies. Amusingly their documentation claims Percona is CA–e.g. fully available and linearizable–both of which should be readily disprovable. :)

Bhupinder on 2015-09-07

Oh man, I tried so hard to get the various MySQL cluster flavors set up and had to abort after weeks of spinning my wheels. Maybe someday!

Could you please describe the details , any specific issues you had faced and or it didnt work as you expected. Perhaps I can help.

Will Newby on 2015-09-07

I really enjoy these articles, they’re a great test for product documentation and making admins aware of how their systems can fail. Some requests for future tests:

Databases: HBase MS SQL Server’s Availability Groups

Distributed Filesystems/storage: Gluster Possibly Ceph/RADOS

Keep up the great work!

Seppo Jaakola on 2015-09-07

Indeed, the “first committer wins” rule was traded for performance, and Galera Cluster does not support Snapshot Isolation any more. We have to fix the documentation accordingly. The performance impact for aborting SI ‘victims’ was remarkable, in our benchmark profiles. This is because of InnoDB rollback, which is rather heavy operation. But, lesson learned from your experiments is that, we could support SI as session property.

BTW, MariaDB Galera Cluster has session variable: ‘wsrep_sync_wait’ for providing read causality. It does not bring you SI, but will delay readers to wait until on-the-fly replicated transactions have committed. So with that, readers should see all earlier transactions that were accepted by the cluster.

svar on 2015-09-08

re,

“systems which rely on leader state, rather than causal chains associated with data operations themselves, tend to lose data during leader transitions. ”

I’m surprise by the shortcut, that response is not a prove. Leader election is taking place outside the causality of events of the cluster. That design impact that during leader transition the system become CP as a transaction started on old leader would not be able able to commit. Galera certification would ensure that the new leader is just up to date with any writes. I would be happy that you prove me wrong.

Galera provide a certain causal chains that is not SI, it never claimed to be serializable so far and for good reasons. Using it your way, it is an AP design with all the pain that come with it. A proxy that claims to virtualize a Galera Cluster using leader election but without forwarding errors (transaction replay) would take the risk to break C as R(x) would not be causal any more. I guess that many users would chose AP during leader transition instead of CP.

Aphyr on 2015-09-09

I’m surprise by the shortcut, that response is not a prove.

Jepsen analyses take weeks to months of time, and you’ve given me a hand-waving description of an in-house system that I can’t read the source to, let alone test, and even if you had fully described your solution, I don’t have the time to write a full analysis. Proof of your system’s correctness is not my responsibility. If you think it works, that’s fantastic. I just can’t advocate for it myself.

svar on 2015-09-14

You are spending week or month to demonstrate that a system is not SERIALIZABLE when this is a well know design since 2011 But you don’t wan’t to spend time for demonstrating on what we advocate as correct deployment. Maxscale and Haproxy are on github and failover setting is well documented. Anyway if i’m wrong here hadoop may also be wrong by design and i don’t yet see any post on that topic yet.

Aphyr on 2015-09-14

You are spending week or month to demonstrate that a system is not SERIALIZABLE

Galera does not claim to be serializable, and this article does not evaluate that claim. Galera does claim to provide Snapshot Isolation, which is why I’ve tested it for Snapshot Isolation.

… when this is a well know design since 2011 But you don’t wan’t to spend time for demonstrating on what we advocate as correct deployment. Maxscale and Haproxy are on github and failover setting is well documented.

The topology used in this test is one of Galera’s recommended deployment variants. Adding a proxy layer won’t fix any of the problems I discussed here: it’ll just shuffle the client->server connections.

Anyway if i’m wrong here hadoop may also be wrong by design and i don’t yet see any post on that topic yet.

I have no idea what you’re trying to say here. Hadoop is not a relational database. It has a completely different architecture, use case, and invariants.

Andrei Elkin on 2015-09-16

A bit of offtopic, but it’s unfair to call the Oracle’s Group replication as a reimplementation of anything. It’s a genuine solution.

Indeed either system models results of research of Swiss Federal Institute of Technology (EPFL), see section Certification-based Replication, in Understanding Replication in Databases and Distributed Systems M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, G. Alons, 1999.

Federico Razzoli on 2015-09-17

I’m not sure about the point of this post.

Galera has limitations? Yes. These limits are well-documented, so one can decide which applications can or cannot run on Galera. For example, you should not use MyISAM. Unfortunately, having no limitations is impossible for any software.

But your post fails to be merely useless: it is obviously misleading. Why “obviously”? Because it is meant to be read by developers, not marketing people. So, you cannot expect your reader to think “hey, I could need a query like that!”.

Your article has several levels of “badness”:

You show a SELECT + UPDATE. But the SELECT seems to be unnecessary.
But if you want to run useless queries, you could at least do it in the only logical way: UPDATE field + num. Your problem will magically disappear.
Ok, let’s ignore logic. But I still expect you to know SELECT … FOR UPDATE. Yes, it’s locking, but you wanted this lock - and anyway, I don’t think it’s a problem.

Noah on 2015-09-18

Wow. Usually hardcore technical blogs like Aphyr’s are an exception to “it’s the Internet, oh God don’t read the comments.”

But this seems to be much closer to the normal level of Internet comments. It’s clear quite a lot of them haven’t read the actual post, for instance, and they’re fine with completely wrong documentation on Galera’s part if there might kinda-sorta be a workaround.

So: thank you Aphyr. I’m sorry you wound up with this amount of flak. But I appreciate this post very much, and I suspect I’m not the only one.

anonymous on 2015-09-18

Federico,

The select is here to prove that the total amount of money in the database is constant, in a SI system no transaction should be able to read a state that should have never existed without concurrency or clustering. A thread or a remote node is kine of same issue regarding consistency. Each implementation fail into an isolation level and definitively Galera does not fail in SI .

I agree with you most user don’t really care or dare to care as long as they have a workaround. You have resume the situation asking to use select for update on a single node.

So if one database implementation is more adapted to 99% of the usage it will be more popular, end of the story. Users will learn the workaround the hard way . Now it does not remove quality of the analyze and it highlight what shortcut was taken by Galera to achieve an insolvable issue, run concurrent transactions from far away locations without loosing to much of performance. If i have to move a galera node inside opportunity on mars, and someone ask you where is the robot now. Galera tells you where it was last position , and surely not where it is. But using wsrep_causal_read it will tell, wait until it give me the next position. but when you receive it , the robot is already away. Certification ask the robot to stop moving , and knowing where it is we can ask him to go somewhere , but would cost to much to tell him to stop every time one blogger wan’t to know. Or worth to tell the robot to come back to point i told that reader, because i may i gave him the wrong information.

svar on 2015-09-18

sorry last comment was mine not anonymous!

Sam Wilson on 2015-09-18

I may be completely wrong, but is

SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
SET autocommit=0

the right way to enable SI? I would have thought it would be:

SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
SET autocommit=0
START TRANSACTION WITH CONSISTENT SNAPSHOT

Aphyr on 2015-09-24

Sam, transactions are implicitly begun at the first SELECT, and WITH CONSISTENT SNAPSHOT is redundant; consistent snapshots are automatically taken at the first SELECT from a transactional table.

Just to double-check, I re-ran the test with an explicit START TRANSACTION WITH CONSISTENT SNAPSHOT, and the results are unchanged.

Jay Pipes on 2015-09-30

Awesome post (as always), Kyle.

It’s good to see transparent communication from the Galera team, and a forthright acknowledgement that snapshot isolation is indeed not supported by Galera (at this time) and that the docs should be updated accordingly.

Best, -jay

summivox on 2015-10-21

Any plans to test MemSQL?

Aaron S on 2015-11-08

Nice post!

I was little confused by the SET statements. Getting “Snapshot” Isolation in MySQL just involves using REPEATABLE-READ as the transaction level. This is idiosyncratic, but the same in Oracle/Postgres, though not for MS SQL Server and some others.

In any case, regular MySQL doesn’t honor first-commit-wins. I actually never heard of that aspect of the definition of snapshot until reading that PDF so I was confused for a while why that was expected of it.

MySQL SERIALIZABLE should just be snapshot+(row/gap locks on read), so I’m still baffled by one the Galera bug comments saying that the bank test somehow works for REPEATABLE-READ but not SERIALIZABLE. I get there there is wsrep bug somewhere affecting the later, but I don’t see how the former could ever work (since there is no FCW rule).

Aphyr on 2015-11-10

Getting “Snapshot” Isolation in MySQL just involves using REPEATABLE-READ as the transaction level.

No, it doesn’t. Repeatable Read doesn’t acquire the right kind of locks to provide Snapshot Isolation–and the standard doesn’t require it to. That’s why these tests use Serializable isolation.

In any case, regular MySQL doesn’t honor first-commit-wins. I actually never heard of that aspect of the definition of snapshot until reading that PDF so I was confused for a while why that was expected of it.

This is precisely why MySQL’s Repeatable Read does not provide Snapshot Isolation. Regular MySQL (and, for that matter, single-node Galera clusters) does provide first-committer-wins under Serializable isolation,.

Aaron S on 2015-11-14

Yeah, that’s why I put snapshot in quotes, since it only enforces some of the definition. Sorry if that wasn’t clear.

I don’t know what to call that isolation level. Most DBs give you this quasi-snapshot isolation when you ask for REPEATABLE-READ rather than just giving you ANSI REPEATABLE-READ. Postgres does this to. Is there an academically accepted term for this concept?

Speaking of Postgres, it also has SSI (serializable snapshot isolation), so we have the “snapshot” isolations: 1) the defacto one in lots of DBs that does rules 1 & 2 in the paper (PDF) 2) the first-committer version in the paper with the 3 rules (SELECT is from a point in time, see your own changes, first committer) 3) the SSI version (which also has papers) which does proper CAS checks on all rows read, including via predicate queries

The combination of this and almost nothing actually using literal ANSI REPEATABLE-READ makes for some serious confusion I suspect.

Aaron S. on 2015-11-14

Correction to my comment the above: Postgres REPEATABLE-READ does honor first-committer. I must have missed that from the documentation last time I read it months back.

MS SQL server has a READ_COMMITTED_SNAPSHOT option for READ_COMMITTED, though it has nothing to do with SNAPSHOT isolation (nor the mysql snapshots). They also have a proper SNAPSHOT level which seems to honor first-committer according to their documentation page.

It might seem silly, but I feel like writing “SNAPSHOT isolation” in that exact casing avoids ambiguity of other “snapshots” of sorts.

Driv3l on 2015-11-15

Great reviews on the various DBs thus far!

I also had issues trying to get MySQL Cluster up and running… it’s a bit of a pain. Here are a couple of tips… firstly, if using virtualization, make sure you’re using VMs and not containers. For some reason, I had issues getting it working using containers but with a VM, things went much smoother.

Secondly, severalnines has a free tool called clustercontrol which can manage the entire setup of MySQL cluster for you. They also have an online tool for generating the config files for the various nodes. This makes things a lot simpler.

Took me a while to figure it all out, but I finally got a basic cluster up and running….

It would be fantastic to see a review of MySQL Cluster if you managed to get that up and running!

Vladimir on 2016-04-20

Thanks for interesting articles!

I have 2 working (huh, usually working) MySQL NDB clusters and now I’m trying to test circular replication between them. If you want to test MySQL Cluster I can share configuration files and interesting articles, and can try to answer questions.

I tried to test it by myself but it take too long time for me to figure out how to configure Jepsen (it was several months ago, may be now it more user-friendly) and I had decided to leave it.

Dan Perillo on 2017-09-15

Do we know if this has been resolved in the latest releases yet or is this still a problem with MariaDB?

Jepsen: MariaDB Galera Cluster

SQL Consistency Models

Snapshot Isolation

Designing a test

Results

Dirty reads?

Recommendations

Post a Comment