In May of 2013, we showed that MongoDB 2.4.3 would lose acknowledged writes at all consistency levels. Every write concern less than MAJORITY loses data by design due to rollbacks–but even WriteConcern.MAJORITY lost acknowledged writes, because when the server encountered a network error, it returned a successful, not a failed, response to the client. Happily, that bug was fixed a few releases later.

Since then I’ve improved Jepsen significantly and written a more powerful analyzer for checking whether or not a system is linearizable. I’d like to return to Mongo, now at version 2.6.7, to verify its single-document consistency. (Mongo 3.0 was released during my testing, and I expect they’ll be hammering out single-node data loss bugs for a little while.)

In this post, we’ll see that Mongo’s consistency model is broken by design: not only can “strictly consistent” reads see stale versions of documents, but they can also return garbage data from writes that never should have occurred. The former is (as far as I know) a new result which runs contrary to all of Mongo’s consistency documentation. The latter has been a documented issue in Mongo for some time. We’ll also touch on a result from the previous Jepsen post: almost all write concern levels allow data loss.

This analysis is brought to you by Stripe, where I now work on safety and data integrity–including Jepsen–full time. I’m delighted to work here and excited to talk about new consistency problems!

We’ll start with some background, methods, and an in-depth analysis of an example failure case, but if you’re in a hurry, you can skip ahead to the discussion.

Write consistency

First, we need to understand what MongoDB claims to offer. The Fundamentals FAQ says Mongo offers “atomic writes on a per-document-level”. From the last Jepsen post and the write concern documentation, we know Mongo’s default consistency level (Acknowledged) is unsafe. Why? Because operations are only guaranteed to be durable after they have been acknowledged by a majority of nodes. We must must use write concern Majority to ensure that successful operations won’t be discarded by a rollback later.

Why can’t we use a lower write concern? Because rollbacks are only OK if any two versions of the document can be merged associatively, commutatively, and idempotently; e.g. they form a CRDT. Our documents likely don’t have this property.

For instance, consider an increment-only counter, stored as a single integer field in a MongoDB document. If you encounter a rollback and find two copies of the document with the values 5 and 7, the correct value of the counter depends on when they diverged. If the initial value was 0, we could have had five increments on one primary, and seven increments on another: the correct value is 5 + 7 = 12. If, on the other hand, the value on both replicas was 5, and only two inserts occurred on an isolated primary, the correct value should be 7. Or it could be any value in between!

And this assumes your operations (e.g. incrementing) commute! If they’re order-dependent, Mongo will allow flat-out invalid writes to go through, like claiming the same username twice, document ID conflicts, transferring $40 + $30 = $70 out of a $50 account which is never supposed to go negative, and so on.

Unless your documents are state-based CRDTs, the following table illustrates which Mongo write concern levels are actually safe:

Write concern Also called Safe?
Unacknowledged NORMAL Unsafe: Doesn’t even bother checking for errors
Acknowledged (new default) SAFE Unsafe: not even on disk or replicated
Journaled JOURNAL_SAFE Unsafe: ops could be illegal or just rolled back by another primary
Fsynced FSYNC_SAFE Unsafe: ditto, constraint violations & rollbacks
Replica Acknowledged REPLICAS_SAFE Unsafe: ditto, another primary might overrule
Majority MAJORITY Safe: no rollbacks (but check the fsync/journal fields)

So: if you use MongoDB, you should almost always be using the Majority write concern. Anything less is asking for data corruption or loss when a primary transition occurs.

Read consistency

The FAQ Fundamentals doesn’t just promise atomic writes, though: it also claims “fully-consistent reads”. The Replication Introduction goes on:

The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency for all reads from the primary.

What does “strict consistency” mean? Mongo’s glossary defines it as

A property of a distributed system requiring that all members always reflect the latest changes to the system. In a database system, this means that any system that can provide data must reflect the latest writes at all times. In MongoDB, reads from a primary have strict consistency; reads from secondary members have eventual consistency.

The Replication docs agree: “By default, in MongoDB, read operations to a replica set return results from the primary and are consistent with the last write operation”, as distinct from “eventual consistency”, where “the secondary member’s state will eventually reflect the primary’s state.”

Read consistency is controlled by the Read Preference, which emphasizes that reads from the primary will see the “latest version”:

By default, an application directs its read operations to the primary member in a replica set. Reading from the primary guarantees that read operations reflect the latest version of a document.

And goes on to warn that:

…. modes other than primary can and will return stale data because the secondary queries will not include the most recent write operations to the replica set’s primary.

All read preference modes except primary may return stale data because secondaries replicate operations from the primary with some delay. Ensure that your application can tolerate stale data if you choose to use a non-primary mode.

So, if we write at write concern Majority, and read with read preference Primary (the default), we should see the most recently written value.

CaS rules everything around me

When writes and reads are concurrent, what does “most recent” mean? Which states are we guaranteed to see? What could we see?

Concurrent operations can occur in either order

For concurrent operations, the absence of synchronized clocks prevents us from establishing a total order. We must allow each operation to come just before, or just after, any other in-flight writes. If we write a, initiate a write of b, then perform a read, we could see either a or b depending on which operation takes place first.

After operations complete, visibility is guaranteed

On the other hand, we obviously shouldn’t interact with a value from the future–we have no way to tell what those operations will be. The latest possible state an operation can see is the one just prior to the response received by the client.

And because we need to see the “most recent write operation”, we should not be able to interact with any state prior to the ops concurrent with the most recently acknowledged write. It should be impossible, for example, to write a, write b, then read a: since the writes of a and b are not concurrent, the second should always win.

This is a common strong consistency model for concurrent data structures called linearizability. In a nutshell, every successful operation must appear to occur atomically at some time between its invocation and response.

So in Jepsen, we’ll model a MongoDB document as a linearizable compare-and-set (CaS) register, supporting three operations:

  • write(x'): set the register’s value to x'
  • read(x): read the current value x. Only succeeds if the current value is actually x.
  • cas(x, x'): if and only if the value is currently x, set it to x'.

We can express this consistency model as a singlethreaded datatype in Clojure. Given a register containing a value, and an operation op, the step function returns the new state of the register–or a special inconsistent value if the operation couldn’t take place.

(defrecord CASRegister [value] Model (step [r op] (condp = (:f op) :write (CASRegister. (:value op)) :cas (let [[cur new] (:value op)] (if (= cur value) (CASRegister. new) (inconsistent (str "can't CAS " value " from " cur " to " new)))) :read (if (or (nil? (:value op)) (= value (:value op))) r (inconsistent (str "can't read " (:value op) " from register " value))))))

Then we’ll have Jepsen generate a mix of read, write, and CaS operations, and apply those operations to a five-node Mongo cluster. Over the course of a few minutes we’ll have five clients perform those random read, write, and CaS ops against the cluster, while a special nemesis process creates and resolves network partitions to induce cluster transitions.

Finally, we’ll have Knossos analyze the resulting concurrent history of all clients' operations, in search of a linearizable path through the history.

Inconsistent reads

Surprise! Even when all writes and CaS ops use the Majority write concern, and all reads use the Primary read preference, operations on a single document in MongoDB are not linearizable. Reads taking place just after the start of a network partition demonstrate impossible behaviors.

In this history, an anomaly appears shortly after the nemesis isolates nodes n1 and n3 from n2, n4, and n5. Each line shows a singlethreaded process (e.g. 2) performing (e.g. :invoke) an operation (e.g. :read), with a value (e.g. 3).

An :invoke indicates the start of an operation. If it completes successfully, the process logs a corresponding :ok. If it fails (by which we mean the operation definitely did not take place) we log :fail instead. If the operation crashes–for instance, if the network drops, a machine crashes, a timeout occurs, etc.–we log an :info message, and that operation remains concurrent with every subsequent op in the history. Crashed operations could take effect at any future time.

Not linearizable. Linearizable prefix was: 2 :invoke :read 3 4 :invoke :write 3 ... 4 :invoke :read 0 4 :ok :read 0 :nemesis :info :start "Cut off {:n5 #{:n3 :n1}, :n2 #{:n3 :n1}, :n4 #{:n3 :n1}, :n1 #{:n4 :n2 :n5}, :n3 #{:n4 :n2 :n5}}" 1 :invoke :cas [1 4] 1 :fail :cas [1 4] 3 :invoke :cas [4 4] 3 :fail :cas [4 4] 2 :invoke :cas [1 0] 2 :fail :cas [1 0] 0 :invoke :read 0 0 :ok :read 0 4 :invoke :read 0 4 :ok :read 0 1 :invoke :read 0 1 :ok :read 0 3 :invoke :cas [2 1] 3 :fail :cas [2 1] 2 :invoke :read 0 2 :ok :read 0 0 :invoke :cas [0 4] 4 :invoke :cas [2 3] 4 :fail :cas [2 3] 1 :invoke :read 4 1 :ok :read 4 3 :invoke :cas [4 2] 2 :invoke :cas [1 1] 2 :fail :cas [1 1] 4 :invoke :write 3 1 :invoke :read 3 1 :ok :read 3 2 :invoke :cas [4 2] 2 :fail :cas [4 2] 1 :invoke :cas [3 1] 2 :invoke :write 4 0 :info :cas :network-error 2 :info :write :network-error 3 :info :cas :network-error 4 :info :write :network-error 1 :info :cas :network-error 5 :invoke :write 1 5 :fail :write 1 ... more failing ops which we can ignore since they didn't take place ... 5 :invoke :read 0 Followed by inconsistent operation: 5 :ok :read 0
A diagram of the concurrent operations from this history

This is a little easier to understand if we sketch a diagram of the last few operations–the events occurring just after the network partition begins. In this notation, time moves from left to right, and each process is a horizontal track. A green bar shows the interval between :invoke and :ok for a successful operation. Processes that crashed with :info are shown as yellow bars running to infinity–they’re concurrent with all future operations.

In order for this history to be linearizable, we need to find a path which always moves forward in time, touches each OK (green) operation once, and may touch crashed (yellow) operations at most once. Along that path, the rules we described for a CaS register should hold–writes set the value, reads must see the current value, and compare-and-set ops set a new value iff the current value matches.

An initial path

The very first operation here is process 2’s read of the value 0. We know the value must be zero when it completes, because there are no other concurrent ops.

The next read we have to satisfy is process 1’s read of 4. The only operation that could have taken place between those two reads is process 0’s crashed CaS from 0 to 4, so we’ll use that as an intermediary.

Note that this path always moves forward in time: linearizability requires that operations take effect sometime between their invocation and their completion. We can choose any point along the operation’s bar for the op to take effect, but can’t travel through time by drawing a line that moves back to the left! Also note that along the purple line, our single-threaded model of a register holds: we read 0, change 0->4, then read 4.

Process 1 goes on to read a new value: 3. We have two operations which could take place before that read. Executing a compare-and-set from 4 to 2 would be legal because the current value is 4, but that would conflict with a subsequent read of 3. Instead we apply process 4’s write of 3 directly.

Changing 4 to 2, then an invalid read of 3
Writing 3, then reading 3

Now we must find a path leading from 3 to the final read of 0. We could write 4, and optionally CaS that 4 to 2, but neither of those values is consistent with a read of 0.

Writing 4, then an invalid read of 0
Writing 4, changing 4 to 2, then an invalid read of 0
Changing 3 to 1, then an invalid read of 0

We could also CaS 3 to 1, which would give us the value 1, but that’s not 0 either! And applying the write of 4 and its dependent paths won’t get us anywhere either; we already showed those are inconsistent with reading 0. We need a write of 0, or a CaS to 0, for this history to make sense–and no such operation could possibly have happened here.

Each of these failed paths is represented by a different “possible world” in the Knossos analysis. At the end of the file, Knossos comes to the same conclusion we drew from the diagram: the register could have been 1, 2, 3, or 4–but not 0. This history is illegal.

It’s almost as if process 5’s final read of zero is connected to the state that process 2 saw, just prior to the network partition. As if the state of the system weren’t linear after all–a read was allowed to travel back in time to that earlier state. Or, alternatively, the state of the system split in two–one for each side of the network partition–writes occurring on one side, and on the other, the value remaining 0.

Dirty reads and stale reads

We know from this test that MongoDB does not offer linearizable CaS registers. But why?

One possibility is that this is the result of Mongo’s read-uncommitted isolation level. Although the docs say “For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states,” they also warn:

MongoDB allows clients to read documents inserted or modified before it commits these modifications to disk, regardless of write concern level or journaling configuration…

For systems with multiple concurrent readers and writers, MongoDB will allow clients to read the results of a write operation before the write operation returns.

Which means that clients can read documents in intermediate states. Imagine the network partitions, and for a brief time, there are two primary nodes–each sharing the initial value 0. Only one of them, connected to a majority of nodes, can successfully execute writes with write concern Majority. The other, which can only see a minority, will eventually time out and step down–but this takes a few seconds.

Dirty reads take place against modified data on the minority primary. Stale reads take place against unchanged data on the minority primary, after new data has been written to the majority primary.

If a client writes (or CaS’s) the value on the minority primary to 1, MongoDB will happily modify that primary’s local state before confirming with any secondaries that the write is safe to perform! It’ll be changed back to 0 as a part of the rollback once the partition heals–but until that time, reads against the minority primary will see 1, not 0! We call this anomaly a dirty read because it exposes garbage temporary data to the client.

Alternatively, no changes could occur on the minority primary–or whatever writes do take place leave the value unchanged. Meanwhile, on the majority primary, clients execute successful writes or CaS ops, changing the value to 1. If a client executes a read against the minority primary, it will see the old value 0, even though a write of 1 has taken place.

Because Mongo allows dirty reads from the minority primary, we know it must also allow clean but stale reads against the minority primary. Both of these anomalies are present in MongoDB.

Dirty reads are a known issue, but to my knowledge, nobody is aware that Mongo allows stale reads at the strongest consistency settings. Since Mongo’s documentation repeatedly states this anomaly should not occur, I’ve filed SERVER-17975.

What does that leave us with?

What Mongo's documentation claims to offer, vs what it actually provides

Even with Majority write concern and Primary read preference, reads can see old versions of documents. You can write a, then write b, then read a back. This invalidates a basic consistency invariant for registers: Read Your Writes.

Mongo also allows you to read alternating fresh and stale versions of a document–successive reads could see a, then b, then a again, and so on. This violates Monotonic Reads.

Because both of these consistency models are implied by the PRAM memory model, we know MongoDB cannot offer PRAM consistency–which in turn rules out causal, sequential, and linearizable consistency.

Because Mongo allows you to read uncommitted garbage data, we also know MongoDB can’t offer Read Committed, Cursor Stability, Monotonic Atomic View, Repeatable Read, or serializability. On the map of consistency models to the right, this leaves Writes Follow Reads, Monotonic Write, and Read Uncommitted–all of which, incidentally, are totally-available consistency models even when partitions occur.

Conversely, Mongo’s documentation repeatedly claims that one can read “the latest version of a document”, and that Mongo offers “immediate consistency”. These properties imply that reads, writes, and CaS against a single document should be linearizable. Linearizable consistency implies that Mongo should also ensure:

  • Sequential consistency: all processes agree on op order
  • Causal consistency: causally related operations occur in order
  • PRAM: a parallel memory model
  • Read Your Writes: a process can only read data from after its last write
  • Monotonic Read: a process’s successive reads must occur in order
  • Monotonic Write: a process’s successive writes must occur in order
  • Write Follows Read: a process’s writes must logically follow its last read

Mongo 2.6.7 (and presumably 3.0.0, which has the same read-uncommitted semantics) only ensures the last two invariants. I argue that this is neither “strict” nor “immediate” consistency.

How bad are dirty reads?

Read uncommitted allows all kinds of terrible anomalies we probably don’t want as MongoDB users.

For instance, suppose we have a user registration service keyed by a unique username. Now imagine a partition occurs, and two users–Alice and Bob–try to claim the same username–one on each side of the partition. Alice’s request is routed to the majority primary, and she successfully registers the account. Bob, talking to the minority primary, will see his account creation request time out. The minority primary will eventually roll his account back to a nonexistent state, and when the partition heals, accept Alice’s version.

But until the minority primary detects the failure, Bob’s invalid user registration will still be visible for reads. After registration, the web server redirects Alice to /my/account to show her the freshly created account. However, this HTTP request winds up talking to a server whose client still thinks the minority primary is valid–and that primary happily responds to a read request for the account with Bob’s information.

Alice’s page loads, and in place of her own data, she sees Bob’s name, his home address, his photograph, and other things that never should have been leaked between users.

You can probably imagine other weird anomalies. Temporarily visible duplicate values for unique indexes. Locks that appear to be owned by two processes at once. Clients seeing purchase orders that never went through.

Or consider a reconciliation process that scans the list of all transactions a client has made to make sure their account balance is correct. It sees an attempted but invalid transaction that never took place, and happily sets the user’s balance to reflect that impossible transaction. The mischievous transaction subsequently disappears on rollback, leaving customer support to puzzle over why the numbers don’t add up.

Or, worse, an admin goes in to fix the rollback, assumes the invalid transaction should have taken place, and applies it to the new primary. The user sensibly retried their failed purchase, so they wind up paying twice for the same item. Their account balance goes negative. They get hit with an overdraft fine and have to spend hours untangling the problem with support.

Read-uncommitted is scary.

What if we just fix read-uncommitted?

dirty-stale.jpg

Mongo’s engineers initially closed the stale-read ticket as a duplicate of SERVER-18022, which addresses dirty reads. However, as the diagram illustrates, only showing committed data on the minority primary will not solve the problem of stale reads. Even if the minority primary accepts no writes at all, successful writes against the majority node will allow us to read old values.

That invalidates Monotonic Read, Read Your Writes, PRAM, causal, sequential, and linearizable consistency. The anomalies are less terrifying than Read Uncommitted, but still weird.

For instance, let’s say two documents are always supposed to be updated one after the other–like creating a new comment document, and writing a reference to it into a user’s feed. Stale reads allow you to violate the implicit foreign key constraint there: the user could load their feed and see a reference to comment 123, but looking up comment 123 in Mongo returns Not Found.

A user could change their name from Charles to Désirée, have the change go through successfully, but when the page reloads, still see their old name. Seeing old values could cause users to repeat operations that they should have only attempted once–for instance, adding two or three copies of an item to their cart, or double-posting a comment. Your clients may be other programs, not people–and computers are notorious for retrying “failed” operations very fast.

Stale reads can also cause lost updates. For instance, a web server might accept a request to change a user’s profile photo information, write the new photo URL to the user record in Mongo, and contact the thumbnailer service to generate resized copies and publish them to S3. The thumbnailer sees the old photo, not the newly written one. It happily resizes it, and uploads the old avatar to S3. As far as the backend and thumbnailer are concerned, everything went just fine–but the user’s photo never changes. We’ve lost their update.

What if we just pretend this is how it’s supposed to work?

Mongo then closed the ticket, claiming it was working as designed.

What are databases? We just don’t know.

So where does that leave us?

Despite claiming “immediate” and “strict consistency”, with reads that see “the most recent write operations”, and “the latest version”, MongoDB, even at the strongest consistency levels, allows reads to see old values of documents or even values that never should have been written. Even when Mongo supports Read Committed isolation, the stale read problem will likely persist without a fundamental redesign of the read process.

SERVER-18022 targets consistent reads for Mongo 3.1. SERVER-17975 has been re-opened, and I hope Mongo’s engineers will consider the problem and find a way to address it. Preventing stale reads requires coupling the read process to the oplog replication state machine in some way, which will probably be subtle–but Consul and etcd managed to do it in a few months, so the problem’s not insurmountable!

In the meantime, if your application requires linearizable reads, I can suggest a workaround! CaS operations appear (insofar as I’ve been able to test) to be linearizable, so you can perform a read, then try to findAndModify the current value, changing some irrelevant field. If that CaS succeeds, you know the document had that state at some time between the invocation of the read and the completion of the CaS operation. You will, however, incur an IO and latency penalty for the extra round-trips.

And remember: always use the Majority write concern, unless your data is structured as a CRDT! If you don’t, you’re looking at lost updates, and that’s way worse than any of the anomalies we’ve discussed here!

Finally, read the documentation for the systems you depend on thoroughly–then verify their claims for yourself. You may discover surprising results!

Next, on Jepsen: Elasticsearch 1.5.0

My thanks to Stripe and Marc Hedlund for giving me the opportunity to perform this research, and a special thanks to Peter Bailis, Michael Handler, Coda Hale, Leif Walsh, and Dan McKinley for their comments on early drafts. Asya Kamsky pointed out that Mongo will optimize away identity CaS operations, which prevents their use as a workaround.

Previously in Jepsen, we discussed Redis. In this post, we'll see MongoDB drop a phenomenal amount of data.

MongoDB is a document-oriented database with a similar distribution design to Redis. In a replica set, there exists a single writable primary node which accepts writes, and asynchronously replicates those writes as an oplog to N secondaries. However, there are a few key differences.

First, Mongo builds in its leader election and replicated state machine. There's no separate system which tries to observe a replica set in order to make decisions about what it should do. The replica set decides among itself which node should be primary, when to step down, how to replicate, etc. This is operationally simpler and eliminates whole classes of topology problems.

Second, Mongo allows you to ask that the primary confirm successful replication of a write by its disk log, or by secondary nodes. At the cost of latency, we can get stronger guarantees about whether or not a write was successful.

What happens when a primary becomes inaccessible?

-046.jpg
-047.jpg

The remaining secondaries will gradually detect the failed connection and attempt to come to a consensus about what to do. If they have a majority (and remember, there can be only one majority in a cluster, so this suggests we're heading towards a CP system), they'll select the node with the highest optime (a monotonic clock maintained by each node) and promote it to be a new primary. Simultaneously, the minority nodes will detect that they no longer have a quorum, and demote the primary to a secondary so it can't accept writes.

-045.jpg
-048.jpg

If our primary is on n1, and we cut off n1 and n2 from the rest of the cluster, we expect either n3, n4, or n5 to become the new primary. Because this architecture demotes the original primary on n1, we won't find ourselves in the same split-brain problem we saw with Redis.

Consistency

So is MongoDB CP? There's a popular notion that MongoDB is a CP system, including exchanges like this, where all kinds of nuanced technical assertions about strong consistency are thrown around. At the same time, Mongo's documentation for replica sets explains carefully that Mongo may “revert operations”:

In some failover situations primaries will have accepted write operations that have not replicated to the secondaries after a failover occurs. This case is rare and typically occurs as a result of a network partition with replication lag. When this member (the former primary) rejoins the replica set and attempts to continue replication as a secondary the former primary must revert these operations or “roll back” these operations to maintain database consistency across the replica set.

“Revert” certainly doesn't sound like linearizability to me, but that bit about “maintain[ing] database consistency” doesn't sound so bad. What actually happens? Let's find out!

For this example, we'll be adding integers to a list in a MongoDB document by using the update command in a CaS loop–just like you'd use with any transactionally isolated database. Yes, we could use $addInSet, but I'm using this app as an example of atomic updates in general, and they have different oplog dynamics.

Unacknowledged

-050.jpg

Up until recently, clients for MongoDB didn't bother to check whether or not their writes succeeded, by default: they just sent them and assumed everything went fine. This goes about as well as you'd expect.

lein run mongo-unsafe -n 6000 salticid jepsen.partition

For a while, writes continue to complete against n1. Then we see errors as the replica set fails over, like

3186 No replica set members available in [ { address:'n3/10.10.3.101:27017', ok:true, ping:0.8954104, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n4/10.10.3.95:27017', ok:true, ping:0.681164, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n5/10.10.3.32:27017', ok:true, ping:0.6231328, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n2/10.10.3.52:27017', ok:true, ping:0.51316977, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n1/10.10.3.242:27017', ok:true, ping:0.37008655, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, } ] for { "mode" : "primary"}

During this time, the majority nodes (n3, n4, n5) are still secondaries, but they've agreed that the old primary is inaccessible. They compare optimes and race to elect a leader:

$ salticid mongo.rs_stat 22:09:08 Starting... 22:09:08 MongoDB shell version: 2.4.1 22:09:08 connecting to: test 22:09:08 n1:27017 (not reachable/healthy) 1368940104/56 22:09:08 n2:27017 (not reachable/healthy) 1368940103/458 22:09:08 n3:27017 SECONDARY 1368940104/89 22:09:08 n4:27017 SECONDARY 1368940104/89 22:09:08 n5:27017 SECONDARY 1368940104/102 22:09:08 true 22:09:08 Finished22:09:23 n1:27017 (not reachable/healthy) 1368941926/66 22:09:23 n2:27017 (not reachable/healthy) 1368941961/70 22:09:23 n3:27017 SECONDARY 1368941962/9 22:09:23 n4:27017 SECONDARY 1368941961/45 22:09:23 n5:27017 PRIMARY 1368941963/11

N5 wins the race, and proceeds to accept writes. If we heal the partition with salticid jepsen.heal, and wait a few seconds, the nodes will detect the fully connected cluster and the new primary will step down, to allow n1 to resume its place. Now that the cluster has stabilized, we hit enter to check how many of our writes survived:

Hit enter when ready to collect results. Writes completed in 93.608 seconds 6000 total 5700 acknowledged 3319 survivors 2381 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 469 474 479 484 489 494 ... 3166 3168 3171 3173 3178 3183 0.95 ack rate 0.4177193 loss rate 0.0 unacknowledged but successful rate

42% write loss. Well, to some extent, this shouldn't be surprising, because we weren't checking to see whether the server was successful in applying our writes. Those 300 errors only came about when we tried to write to a secondary. But we never actually crashed a node, and we didn't see any signs of a split-brain condition with two simultaneous primaries–so why did Mongo drop data?

Remember those writes that completed on n1 just after the partition started? Those writes are still on n1, but never made it to n5. N5 proceeded without them. Now n1 and n5 are comparing notes, and n1 realizes that n5's optime is higher. N1 figures out the last point where the two agreed on the oplog, and rolls back to that point.

22:09:33 Sun May 19 05:09:33.032 [rsHealthPoll] replSet member n5:27017 is now in state PRIMARY 22:09:33 Sun May 19 05:09:33.207 [initandlisten] connection accepted from 10.10.3.95:37718 #6154 (23 connections now open) 22:09:33 Sun May 19 05:09:33.417 [rsBackgroundSync] replSet syncing to: n5:27017 22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replSet our last op time fetched: May 19 05:08:37:2 22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replset source's GTE: May 19 05:09:26:1 22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replSet rollback 0 22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replSet ROLLBACK 22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet rollback 1 22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet rollback 2 FindCommonPoint 22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet info rollback our last optime: May 19 05:08:37:2 22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet info rollback their last optime: May 19 05:09:33:32 22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet info rollback diff in end of log times: -56 seconds 22:09:35 Sun May 19 05:09:33.621 [initandlisten] connection accepted from 10.10.3.32:59066 #6155 (24 connections now open) 22:09:35 Sun May 19 05:09:35.221 [rsBackgroundSync] replSet rollback found matching events at May 19 05:08:24:66 22:09:35 Sun May 19 05:09:35.221 [rsBackgroundSync] replSet rollback findcommonpoint scanned : 3798 22:09:35 Sun May 19 05:09:35.221 [rsBackgroundSync] replSet replSet rollback 3 fixup 22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet rollback 3.5 22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet rollback 4 n:1 22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet minvalid=May 19 05:09:35 51985e8f:19 22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet rollback 4.6 22:09:35 Sun May 19 05:09:35.223 [rsBackgroundSync] replSet rollback 4.7 22:09:35 Sun May 19 05:09:35.223 [rsBackgroundSync] replSet rollback 5 d:0 u:1 22:09:35 Sun May 19 05:09:35.224 [rsBackgroundSync] replSet rollback 6 22:09:35 Sun May 19 05:09:35.236 [rsBackgroundSync] replSet rollback 7 22:09:35 Sun May 19 05:09:35.238 [rsBackgroundSync] replSet rollback done 22:09:35 Sun May 19 05:09:35.238 [rsBackgroundSync] replSet RECOVERING

During a rollback, all the writes the old primary accepted after the common point in the oplog are removed from the database and written to a BSON file in Mongo's rollbacks directory. If you're a sysadmin, you could go look at the rollback files to try and reconstruct the writes that the database dropped.

Well, theoretically. In my tests, it only does this in 1 out of 5 runs or so. Mostly, it just throws those writes away entirely: no rollback files, no nothing. I don't really know why.

-054.jpg
-055.jpg

This leads to an important discovery: it doesn't matter whether or not there were two primaries at the same time. We can still get conflicting writes if the old primary's state is causally unconnected from the new primary. A primary/secondary system, by itself, is not sufficient. We have to actually track causality on the writes themselves in order to be CP. Otherwise, newly elected primaries could diverge from the old one.

Safe

Aha! But that was with the old “unsafe” write concern! We should use the Safe write concern!

lein run mongo-safe -n 6000 ... 6000 total 5900 acknowledged 3692 survivors 2208 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 458 463 468 473 478 483 ... 3075 3080 3085 3090 3095 3100 0.98333335 ack rate 0.3742373 loss rate 0.0 unacknowledged but successful rate
-059.jpg
-066.jpg

Replicas-safe

WriteConcern.SAFE only verifies that the write was accepted by the primary. We need to make sure that the replicas have received our write before considering it a success.

lein run mongo-replicas-safe -n 6000 ... 6000 total 5695 acknowledged 3768 survivors 1927 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 712 717 722 727 732 737 ... 2794 2799 2804 2809 2814 2819 0.94916666 ack rate 0.338367 loss rate 0.0 unacknowledged but successful rate

Mongo still rolled back our writes. Why? Because REPLICAS_SAFE only checks to see if the write took place against two replicas. Our cluster has five nodes, so it's possible for writes to exist only on n1 and n2. A new primary can be elected without having seen our write. We need to wait until our write has been acknowledged by a majority of nodes.

Majority

lein run mongo -n 6000

Using WriteConcern.MAJORITY, we notice an improvement! When we cause the partition, writes pause immediately. The clients are blocked, waiting for the primary to confirm acknowledgement on nodes which will never respond. Eventually they time out. This is a hallmark of a CP system: we shouldn't be able to make progress without talking to a majority of nodes.

Writes completed in 157.425 seconds 6000 total 5700 acknowledged 5701 survivors 2 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ (596 598) 3 unacknowledged writes found! ヽ(´ー`)ノ (562 653 3818) 0.95 ack rate 1.754386E-4 loss rate 5.2631577E-4 unacknowledged but successful rate

So 3 writes which supposedly failed actually succeeded. That's not so bad. On the other hand, Mongo still dropped two “successful” writes. Writes which were supposedly acknowledged by a majority of nodes.

-069.jpg

I've been talking with 10gen, and they think this is a bug. When the network partitions, the server just checks off the “OK” field for the client's WriteConcern request, and sends it back. The client sees the “OK” message and… sensibly presumes the write was OK. This should be fixed in master, but is still present in 2.4.3, the most recent release.

Even if this bug is fixed, Mongo still isn't consistent. Those three writes which “failed” but showed up in the result set? Those are writes which were replicated to a majority node just prior to the partition, but never had the chance to acknowledge. Single writes are not atomic without a proper consensus protocol: those failed writes could materialize never, now, or some time in the future; potentially overwriting valid data.

Strategies for working with Mongo

On the one hand, Mongo advocates usually tell me “but network partitions are exceedingly rare in practice.” Then I talk to Mongo users who report their cluster fails over on a weekly basis. One thing to keep in mind is that heavy load–like seasonal writes, recovering from a crash, or performing a rollback–can slow a node down to the point where other nodes declare it dead. This is a partition. I've seen my test cluster perform dozens of rollbacks as nodes go unavailable attempting to elect a new primary. You should probably instrument your cluster to watch for these events in production.

As we've discussed before, one option is simply to accept data loss. Not all applications need consistency.

At the same time, you should watch those rollback files. Sometimes they don't appear even though they're supposed to, and not all data types will actually be rolled back. Conflicts in capped collections, for example, appear to simply discard all data in the collection past the conflict point by design.

People use capped collections for distributed queues. Think about that for a minute.

Moreover, a rollback file doesn't give you enough information to actually reconstruct the correct state of the system–at least in general. It's just a snapshot of “some state” the database had to discard. Because there's no well-defined ordering for these writes, you'll have to decide what that means for your particular data structures. If you can structure your documents as CRDTs and write a merge function, you'll be able to safely merge. If there's no conflicting copy of the document in the database, and you never delete those kinds of documents, you can restore it automatically. Immutable records can always be recovered, too.

Finally, you can drastically reduce the probability of write loss by using WriteConcern.MAJORITY. This is gonna impose a big performance hit. That's another hallmark of more-available CP systems.

To recap: MongoDB is neither AP nor CP. The defaults can cause significant loss of acknowledged writes. The strongest consistency offered has bugs which cause false acknowledgements, and even if they're fixed, doesn't prevent false failures.

In the next post, we'll talk about a database which emphasizes availability and partition tolerance: Riak.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.