Jepsen: MongoDB stale reads

2015-04-20

Please note: our followup analysis of 3.4.0-rc3 revealed additional faults in MongoDB’s replication algorithms which could lead to the loss of acknowledged documents–even with Majority Write Concern, journaling, and fsynced writes.

In May of 2013, we showed that MongoDB 2.4.3 would lose acknowledged writes at all consistency levels. Every write concern less than MAJORITY loses data by design due to rollbacks–but even WriteConcern.MAJORITY lost acknowledged writes, because when the server encountered a network error, it returned a successful, not a failed, response to the client. Happily, that bug was fixed a few releases later.

Since then I’ve improved Jepsen significantly and written a more powerful analyzer for checking whether or not a system is linearizable. I’d like to return to Mongo, now at version 2.6.7, to verify its single-document consistency. (Mongo 3.0 was released during my testing, and I expect they’ll be hammering out single-node data loss bugs for a little while.)

In this post, we’ll see that Mongo’s consistency model is broken by design: not only can “strictly consistent” reads see stale versions of documents, but they can also return garbage data from writes that never should have occurred. The former is (as far as I know) a new result which runs contrary to all of Mongo’s consistency documentation. The latter has been a documented issue in Mongo for some time. We’ll also touch on a result from the previous Jepsen post: almost all write concern levels allow data loss.

This analysis is brought to you by Stripe, where I now work on safety and data integrity–including Jepsen–full time. I’m delighted to work here and excited to talk about new consistency problems!

We’ll start with some background, methods, and an in-depth analysis of an example failure case, but if you’re in a hurry, you can skip ahead to the discussion.

Write consistency

First, we need to understand what MongoDB claims to offer. The Fundamentals FAQ says Mongo offers “atomic writes on a per-document-level”. From the last Jepsen post and the write concern documentation, we know Mongo’s default consistency level (Acknowledged) is unsafe. Why? Because operations are only guaranteed to be durable after they have been acknowledged by a majority of nodes. We must must use write concern Majority to ensure that successful operations won’t be discarded by a rollback later.

Why can’t we use a lower write concern? Because rollbacks are only OK if any two versions of the document can be merged associatively, commutatively, and idempotently; e.g. they form a CRDT. Our documents likely don’t have this property.

For instance, consider an increment-only counter, stored as a single integer field in a MongoDB document. If you encounter a rollback and find two copies of the document with the values 5 and 7, the correct value of the counter depends on when they diverged. If the initial value was 0, we could have had five increments on one primary, and seven increments on another: the correct value is 5 + 7 = 12. If, on the other hand, the value on both replicas was 5, and only two inserts occurred on an isolated primary, the correct value should be 7. Or it could be any value in between!

And this assumes your operations (e.g. incrementing) commute! If they’re order-dependent, Mongo will allow flat-out invalid writes to go through, like claiming the same username twice, document ID conflicts, transferring $40 + $30 = $70 out of a $50 account which is never supposed to go negative, and so on.

Unless your documents are state-based CRDTs, the following table illustrates which Mongo write concern levels are actually safe:

Write concern	Also called	Safe?
Unacknowledged	NORMAL	Unsafe: Doesn't even bother checking for errors
Acknowledged (new default)	SAFE	Unsafe: not even on disk or replicated
Journaled	JOURNAL_SAFE	Unsafe: ops could be illegal or just rolled back by another primary
Fsynced	FSYNC_SAFE	Unsafe: ditto, constraint violations & rollbacks
Replica Acknowledged	REPLICAS_SAFE	Unsafe: ditto, another primary might overrule
Majority	MAJORITY	Safe: no rollbacks (but check the fsync/journal fields)

So: if you use MongoDB, you should almost always be using the Majority write concern. Anything less is asking for data corruption or loss when a primary transition occurs.

Read consistency

The FAQ Fundamentals doesn’t just promise atomic writes, though: it also claims “fully-consistent reads”. The Replication Introduction goes on:

The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency for all reads from the primary.

What does “strict consistency” mean? Mongo’s glossary defines it as

A property of a distributed system requiring that all members always reflect the latest changes to the system. In a database system, this means that any system that can provide data must reflect the latest writes at all times. In MongoDB, reads from a primary have strict consistency; reads from secondary members have eventual consistency.

The Replication docs agree: “By default, in MongoDB, read operations to a replica set return results from the primary and are consistent with the last write operation”, as distinct from “eventual consistency”, where “the secondary member’s state will eventually reflect the primary’s state.”

Read consistency is controlled by the Read Preference, which emphasizes that reads from the primary will see the “latest version”:

By default, an application directs its read operations to the primary member in a replica set. Reading from the primary guarantees that read operations reflect the latest version of a document.

And goes on to warn that:

…. modes other than primary can and will return stale data because the secondary queries will not include the most recent write operations to the replica set’s primary.

All read preference modes except primary may return stale data because secondaries replicate operations from the primary with some delay. Ensure that your application can tolerate stale data if you choose to use a non-primary mode.

So, if we write at write concern Majority, and read with read preference Primary (the default), we should see the most recently written value.

CaS rules everything around me

When writes and reads are concurrent, what does “most recent” mean? Which states are we guaranteed to see? What could we see?

Concurrent operations can occur in either order

For concurrent operations, the absence of synchronized clocks prevents us from establishing a total order. We must allow each operation to come just before, or just after, any other in-flight writes. If we write a, initiate a write of b, then perform a read, we could see either a or b depending on which operation takes place first.

After operations complete, visibility is guaranteed

On the other hand, we obviously shouldn’t interact with a value from the future–we have no way to tell what those operations will be. The latest possible state an operation can see is the one just prior to the response received by the client.

And because we need to see the “most recent write operation”, we should not be able to interact with any state prior to the ops concurrent with the most recently acknowledged write. It should be impossible, for example, to write a, write b, then read a: since the writes of a and b are not concurrent, the second should always win.

This is a common strong consistency model for concurrent data structures called linearizability. In a nutshell, every successful operation must appear to occur atomically at some time between its invocation and response.

So in Jepsen, we’ll model a MongoDB document as a linearizable compare-and-set (CaS) register, supporting three operations:

write(x'): set the register’s value to x'
read(x): read the current value x. Only succeeds if the current value is actually x.
cas(x, x'): if and only if the value is currently x, set it to x'.

We can express this consistency model as a singlethreaded datatype in Clojure. Given a register containing a value, and an operation op, the step function returns the new state of the register–or a special inconsistent value if the operation couldn’t take place.

(defrecord CASRegister [value]
  Model
  (step [r op]
    (condp = (:f op)
      :write (CASRegister. (:value op))
      :cas   (let [[cur new] (:value op)]
               (if (= cur value)
                 (CASRegister. new)
                 (inconsistent (str "can't CAS " value " from " cur
                                    " to " new))))
      :read  (if (or (nil? (:value op))
                     (= value (:value op)))
               r
               (inconsistent (str "can't read " (:value op)
                                  " from register " value))))))

Then we’ll have Jepsen generate a mix of read, write, and CaS operations, and apply those operations to a five-node Mongo cluster. Over the course of a few minutes we’ll have five clients perform those random read, write, and CaS ops against the cluster, while a special nemesis process creates and resolves network partitions to induce cluster transitions.

Finally, we’ll have Knossos analyze the resulting concurrent history of all clients’ operations, in search of a linearizable path through the history.

Inconsistent reads

Surprise! Even when all writes and CaS ops use the Majority write concern, and all reads use the Primary read preference, operations on a single document in MongoDB are not linearizable. Reads taking place just after the start of a network partition demonstrate impossible behaviors.

In this history, an anomaly appears shortly after the nemesis isolates nodes n1 and n3 from n2, n4, and n5. Each line shows a singlethreaded process (e.g. 2) performing (e.g. :invoke) an operation (e.g. :read), with a value (e.g. 3).

An :invoke indicates the start of an operation. If it completes successfully, the process logs a corresponding :ok. If it fails (by which we mean the operation definitely did not take place) we log :fail instead. If the operation crashes–for instance, if the network drops, a machine crashes, a timeout occurs, etc.–we log an :info message, and that operation remains concurrent with every subsequent op in the history. Crashed operations could take effect at any future time.

Not linearizable. Linearizable prefix was:
2       :invoke :read   3
4       :invoke :write  3
...
4       :invoke :read   0
4       :ok     :read   0
:nemesis :info  :start  "Cut off {:n5 #{:n3 :n1},
                                  :n2 #{:n3 :n1},
                                  :n4 #{:n3 :n1},
                                  :n1 #{:n4 :n2 :n5},
                                  :n3 #{:n4 :n2 :n5}}"
1       :invoke :cas    [1 4]
1       :fail   :cas    [1 4]
3       :invoke :cas    [4 4]
3       :fail   :cas    [4 4]
2       :invoke :cas    [1 0]
2       :fail   :cas    [1 0]
0       :invoke :read   0
0       :ok     :read   0
4       :invoke :read   0
4       :ok     :read   0
1       :invoke :read   0
1       :ok     :read   0
3       :invoke :cas    [2 1]
3       :fail   :cas    [2 1]
2       :invoke :read   0
2       :ok     :read   0
0       :invoke :cas    [0 4]
4       :invoke :cas    [2 3]
4       :fail   :cas    [2 3]
1       :invoke :read   4
1       :ok     :read   4
3       :invoke :cas    [4 2]
2       :invoke :cas    [1 1]
2       :fail   :cas    [1 1]
4       :invoke :write  3
1       :invoke :read   3
1       :ok     :read   3
2       :invoke :cas    [4 2]
2       :fail   :cas    [4 2]
1       :invoke :cas    [3 1]
2       :invoke :write  4
0       :info   :cas    :network-error
2       :info   :write  :network-error
3       :info   :cas    :network-error
4       :info   :write  :network-error
1       :info   :cas    :network-error
5       :invoke :write  1
5       :fail   :write  1

... more failing ops which we can ignore since they didn't take place ...

5       :invoke :read   0

Followed by inconsistent operation:
5       :ok     :read   0

A diagram of the concurrent operations from this history

This is a little easier to understand if we sketch a diagram of the last few operations–the events occurring just after the network partition begins. In this notation, time moves from left to right, and each process is a horizontal track. A green bar shows the interval between :invoke and :ok for a successful operation. Processes that crashed with :info are shown as yellow bars running to infinity–they’re concurrent with all future operations.

In order for this history to be linearizable, we need to find a path which always moves forward in time, touches each OK (green) operation once, and may touch crashed (yellow) operations at most once. Along that path, the rules we described for a CaS register should hold–writes set the value, reads must see the current value, and compare-and-set ops set a new value iff the current value matches.

The very first operation here is process 2’s read of the value 0. We know the value must be zero when it completes, because there are no other concurrent ops.

The next read we have to satisfy is process 1’s read of 4. The only operation that could have taken place between those two reads is process 0’s crashed CaS from 0 to 4, so we’ll use that as an intermediary.

Note that this path always moves forward in time: linearizability requires that operations take effect sometime between their invocation and their completion. We can choose any point along the operation’s bar for the op to take effect, but can’t travel through time by drawing a line that moves back to the left! Also note that along the purple line, our single-threaded model of a register holds: we read 0, change 0->4, then read 4.

Process 1 goes on to read a new value: 3. We have two operations which could take place before that read. Executing a compare-and-set from 4 to 2 would be legal because the current value is 4, but that would conflict with a subsequent read of 3. Instead we apply process 4’s write of 3 directly.

Changing 4 to 2, then an invalid read of 3

Now we must find a path leading from 3 to the final read of 0. We could write 4, and optionally CaS that 4 to 2, but neither of those values is consistent with a read of 0.

Writing 4, changing 4 to 2, then an invalid read of 0

Changing 3 to 1, then an invalid read of 0

We could also CaS 3 to 1, which would give us the value 1, but that’s not 0 either! And applying the write of 4 and its dependent paths won’t get us anywhere either; we already showed those are inconsistent with reading 0. We need a write of 0, or a CaS to 0, for this history to make sense–and no such operation could possibly have happened here.

Each of these failed paths is represented by a different “possible world” in the Knossos analysis. At the end of the file, Knossos comes to the same conclusion we drew from the diagram: the register could have been 1, 2, 3, or 4–but not 0. This history is illegal.

It’s almost as if process 5’s final read of zero is connected to the state that process 2 saw, just prior to the network partition. As if the state of the system weren’t linear after all–a read was allowed to travel back in time to that earlier state. Or, alternatively, the state of the system split in two–one for each side of the network partition–writes occurring on one side, and on the other, the value remaining 0.

Dirty reads and stale reads

We know from this test that MongoDB does not offer linearizable CaS registers. But why?

One possibility is that this is the result of Mongo’s read-uncommitted isolation level. Although the docs say “For all inserts and updates, MongoDB modifies each document in isolation: clients never see documents in intermediate states,” they also warn:

MongoDB allows clients to read documents inserted or modified before it commits these modifications to disk, regardless of write concern level or journaling configuration…

For systems with multiple concurrent readers and writers, MongoDB will allow clients to read the results of a write operation before the write operation returns.

Which means that clients can read documents in intermediate states. Imagine the network partitions, and for a brief time, there are two primary nodes–each sharing the initial value 0. Only one of them, connected to a majority of nodes, can successfully execute writes with write concern Majority. The other, which can only see a minority, will eventually time out and step down–but this takes a few seconds.

Dirty reads take place against modified data on the minority primary. Stale reads take place against unchanged data on the minority primary, after new data has been written to the majority primary.

If a client writes (or CaS’s) the value on the minority primary to 1, MongoDB will happily modify that primary’s local state before confirming with any secondaries that the write is safe to perform! It’ll be changed back to 0 as a part of the rollback once the partition heals–but until that time, reads against the minority primary will see 1, not 0! We call this anomaly a dirty read because it exposes garbage temporary data to the client.

Alternatively, no changes could occur on the minority primary–or whatever writes do take place leave the value unchanged. Meanwhile, on the majority primary, clients execute successful writes or CaS ops, changing the value to 1. If a client executes a read against the minority primary, it will see the old value 0, even though a write of 1 has taken place.

Because Mongo allows dirty reads from the minority primary, we know it must also allow clean but stale reads against the minority primary. Both of these anomalies are present in MongoDB.

Dirty reads are a known issue, but to my knowledge, nobody is aware that Mongo allows stale reads at the strongest consistency settings. Since Mongo’s documentation repeatedly states this anomaly should not occur, I’ve filed SERVER-17975.

What does that leave us with?

What Mongo's documentation claims to offer, vs what it actually provides

Even with Majority write concern and Primary read preference, reads can see old versions of documents. You can write a, then write b, then read a back. This invalidates a basic consistency invariant for registers: Read Your Writes.

Mongo also allows you to read alternating fresh and stale versions of a document–successive reads could see a, then b, then a again, and so on. This violates Monotonic Reads.

Because both of these consistency models are implied by the PRAM memory model, we know MongoDB cannot offer PRAM consistency–which in turn rules out causal, sequential, and linearizable consistency.

Because Mongo allows you to read uncommitted garbage data, we also know MongoDB can’t offer Read Committed, Cursor Stability, Monotonic Atomic View, Repeatable Read, or serializability. On the map of consistency models to the right, this leaves Writes Follow Reads, Monotonic Write, and Read Uncommitted–all of which, incidentally, are totally-available consistency models even when partitions occur.

Conversely, Mongo’s documentation repeatedly claims that one can read “the latest version of a document”, and that Mongo offers “immediate consistency”. These properties imply that reads, writes, and CaS against a single document should be linearizable. Linearizable consistency implies that Mongo should also ensure:

Sequential consistency: all processes agree on op order
Causal consistency: causally related operations occur in order
PRAM: a parallel memory model
Read Your Writes: a process can only read data from after its last write
Monotonic Read: a process’s successive reads must occur in order
Monotonic Write: a process’s successive writes must occur in order
Write Follows Read: a process’s writes must logically follow its last read

Mongo 2.6.7 (and presumably 3.0.0, which has the same read-uncommitted semantics) only ensures the last two invariants. I argue that this is neither “strict” nor “immediate” consistency.

How bad are dirty reads?

Read uncommitted allows all kinds of terrible anomalies we probably don’t want as MongoDB users.

For instance, suppose we have a user registration service keyed by a unique username. Now imagine a partition occurs, and two users–Alice and Bob–try to claim the same username–one on each side of the partition. Alice’s request is routed to the majority primary, and she successfully registers the account. Bob, talking to the minority primary, will see his account creation request time out. The minority primary will eventually roll his account back to a nonexistent state, and when the partition heals, accept Alice’s version.

But until the minority primary detects the failure, Bob’s invalid user registration will still be visible for reads. After registration, the web server redirects Alice to /my/account to show her the freshly created account. However, this HTTP request winds up talking to a server whose client still thinks the minority primary is valid–and that primary happily responds to a read request for the account with Bob’s information.

Alice’s page loads, and in place of her own data, she sees Bob’s name, his home address, his photograph, and other things that never should have been leaked between users.

You can probably imagine other weird anomalies. Temporarily visible duplicate values for unique indexes. Locks that appear to be owned by two processes at once. Clients seeing purchase orders that never went through.

Or consider a reconciliation process that scans the list of all transactions a client has made to make sure their account balance is correct. It sees an attempted but invalid transaction that never took place, and happily sets the user’s balance to reflect that impossible transaction. The mischievous transaction subsequently disappears on rollback, leaving customer support to puzzle over why the numbers don’t add up.

Or, worse, an admin goes in to fix the rollback, assumes the invalid transaction should have taken place, and applies it to the new primary. The user sensibly retried their failed purchase, so they wind up paying twice for the same item. Their account balance goes negative. They get hit with an overdraft fine and have to spend hours untangling the problem with support.

Read-uncommitted is scary.

What if we just fix read-uncommitted?

Mongo’s engineers initially closed the stale-read ticket as a duplicate of SERVER-18022, which addresses dirty reads. However, as the diagram illustrates, only showing committed data on the minority primary will not solve the problem of stale reads. Even if the minority primary accepts no writes at all, successful writes against the majority node will allow us to read old values.

That invalidates Monotonic Read, Read Your Writes, PRAM, causal, sequential, and linearizable consistency. The anomalies are less terrifying than Read Uncommitted, but still weird.

For instance, let’s say two documents are always supposed to be updated one after the other–like creating a new comment document, and writing a reference to it into a user’s feed. Stale reads allow you to violate the implicit foreign key constraint there: the user could load their feed and see a reference to comment 123, but looking up comment 123 in Mongo returns Not Found.

A user could change their name from Charles to Désirée, have the change go through successfully, but when the page reloads, still see their old name. Seeing old values could cause users to repeat operations that they should have only attempted once–for instance, adding two or three copies of an item to their cart, or double-posting a comment. Your clients may be other programs, not people–and computers are notorious for retrying “failed” operations very fast.

Stale reads can also cause lost updates. For instance, a web server might accept a request to change a user’s profile photo information, write the new photo URL to the user record in Mongo, and contact the thumbnailer service to generate resized copies and publish them to S3. The thumbnailer sees the old photo, not the newly written one. It happily resizes it, and uploads the old avatar to S3. As far as the backend and thumbnailer are concerned, everything went just fine–but the user’s photo never changes. We’ve lost their update.

What if we just pretend this is how it’s supposed to work?

Mongo then closed the ticket, claiming it was working as designed.

What are databases? We just don’t know.

So where does that leave us?

Despite claiming “immediate” and “strict consistency”, with reads that see “the most recent write operations”, and “the latest version”, MongoDB, even at the strongest consistency levels, allows reads to see old values of documents or even values that never should have been written. Even when Mongo supports Read Committed isolation, the stale read problem will likely persist without a fundamental redesign of the read process.

SERVER-18022 targets consistent reads for Mongo 3.1. SERVER-17975 has been re-opened, and I hope Mongo’s engineers will consider the problem and find a way to address it. Preventing stale reads requires coupling the read process to the oplog replication state machine in some way, which will probably be subtle–but Consul and etcd managed to do it in a few months, so the problem’s not insurmountable!

In the meantime, if your application requires linearizable reads, I can suggest a workaround! CaS operations appear (insofar as I’ve been able to test) to be linearizable, so you can perform a read, then try to findAndModify the current value, changing some irrelevant field. If that CaS succeeds, you know the document had that state at some time between the invocation of the read and the completion of the CaS operation. You will, however, incur an IO and latency penalty for the extra round-trips.

And remember: always use the Majority write concern, unless your data is structured as a CRDT! If you don’t, you’re looking at lost updates, and that’s way worse than any of the anomalies we’ve discussed here!

Finally, read the documentation for the systems you depend on thoroughly–then verify their claims for yourself. You may discover surprising results!

Next, on Jepsen: Elasticsearch 1.5.0

My thanks to Stripe and Marc Hedlund for giving me the opportunity to perform this research, and a special thanks to Peter Bailis, Michael Handler, Coda Hale, Leif Walsh, and Dan McKinley for their comments on early drafts. Asya Kamsky pointed out that Mongo will optimize away identity CaS operations, which prevents their use as a workaround.

Fred Smith on 2015-04-21

Yikes!

As a Stripe customer, I sure hope you’re not using Mongo to keep track of my transactions.

anonymous on 2015-04-22

Stripe totally uses Mongo :P

or29544 on 2015-04-22

OK so from what you are saying…Mongo should be banned. Because I can’t use a database that does not work with the default settings. I mean I am not going to start configuring my transactions to use Majority. This is just unacceptable. What I understand from your article is that Mongo is fundamentally flawed and for minimal reliable work I need to use Majority writes which are not even the default. And even worst, ALL other write methods are bad and insufficient. WHAT THE FUCK IS THIS? Who the fuck wrote this piece of shit database? And then as a conclusion you are talking about workarounds? SERIOUSLY? And you give your readers a pat on the back saying that the problem is not that hard to fix? Are you fucking joking lad? This Mongo behavior is SO FLAWED, that Mongo should disappear from the database scene this instant. You don’t just FIX this kind of problems. Maybe I was just planning to use Mongo for a fucking banking application or gather radio signals from SETI - you think I am interested in the workaround?

john on 2015-04-22

See? This is why I hate startup-teenager-off-from-school-into-database-business kiddies. This is why startup culture sucks. Microsoft SQL Server 2012 R2 is not the best because it’s Microsoft behind it, no. It’s the best because it has 25+ years behind it. Trust me at this point all its features have been tested so much, and serialization is such a basic ACID request, that you can be sure you won’t have problems even if the whole data center gets bombed by Iraq soldiers. That’s the difference. MongoDB? A bunch of script kiddies decided they can create a database over a cup of coffee and a few beers. This is the result. Basic database functionality is completely flawed. Not just flawed - completely, utterly flawed.

Richard on 2015-04-22

Blame the investment industry not an individual company!

Venture Capital is interested in a quick multiple return on their investment. To achieve this they have to generate or fuel “bubbles”. It really doesn’t matter what the bubble is about. Last year NoSQL, this year containers (e.g. Docker) and Platforms. It really doesn’t matter if the technologies actually address real issues. The only thing of importance to the investor is

1: The the technology trendy 2: Can it damage an existing vendor.

MongoDB was trendy, MongoDB could have taken revenue from the established DataBase folks. Hence lots of investment.

However reality eventually bites. It takes time, effort and talent engineers to build robust distributed software systems. Screw that! Where’s the next bubble!

BTW - awesome site. (I’m not US West Coast - I’m a Brit - and so I do use that word sparingly). ;)

mtrycz on 2015-04-22

john, can you run MS SQLServer through jepsen and post the results?

Davide on 2015-04-22

Thanks. Just thanks. This article is just perfect to show why any company of any size should evaluate new technologies very carefully before jumping on them just because their are “cool” or, better, “webscale”.

Kenneth Owens on 2015-04-22

After reading this post and the Mongo documentation, I’m a bit confused. Your post claims that during a network partition, a minority subset of a replication set can host a primary member.

“Imagine the network partitions, and for a brief time, there are two primary nodes–each sharing the initial value 0. Only one of them, connected to a majority of nodes, can successfully execute writes with write concern Majority. The other, which can only see a minority, will eventually time out and step down–but this takes a few seconds.”

This seems to strongly contradict the Mongo documentation pertaining to leader election.

“A replica set member cannot become primary unless it can connect to a majority of the members in the replica set. For the purposes of elections, a majority refers to the total number of votes, rather than the total number of members.”

I read the above as, “A primary can exist if and only if it has a connection to all members of a majority partition of the cluster”

After examining the multiple, misleading discrepancies you’ve pointed in the Mongo documentation, I am not strongly inclined toward assuming its veracity. However, can you please confirm your claim with respect to the behavior of a replication set under partition? Could you also describe the circumstances under which you were able to cause this to occur?

Thanks, -Ken

Kenneth Owens on 2015-04-22

Disregard the above please. I was able to reproduce this condition trivially by introducing a network partition where the previous primary was now in a minority partition. The majority partition elected a new primary prior to the primary in the minority partition stepping down. Thus, in the interval after successful primary election in the majority, and primary step down in the minority, the systems (There where now effectively two distributed systems that operating independently of each other) maintained two simultaneous primaries for the same data set. The statement in the Mongo documentation is therefore correct. My interpretation, and my assumption that the system maintained either availability or consistency under network partition were erroneous. I do however feel like this product should include a big red warning label to the effect of “Not Consistent or Available Under Network Partition” or “Not a Distributed System by Traditional Definition”.

Kash Badami on 2015-04-22

First off, thank you so much for writing such a detailed insightful post. It is simply fantastic and we are very grateful for the time you took to break down the consistency issues in general not just with MongoDB. Building a database that is highly distributed and yet supports ACID transactions is difficult. Oracle did not have ACID transactions until version 7 of the product is my understanding. In other words it lost data in version 6 and it was sorta ACID compliant.

MarkLogic has been around for 13 years and we compete with MongoDB. We are not open source and it is painful to watch folks spend a huge amount of time and money only to discover these consistency problems wont allow them to actually go into production and run their businesses on the technology. Obviously I work for MarkLogic and we are the only ACID compliant document database that runs over 600 large scale applications in production for major banks, trading companies, healthcare, DoD and Intelligence agencies.

I would recommend to anyone that requires ACID transactions, horizontal scale, document model and schema agnostic database storage look at MarkLogic. Its easy to setup with a 5 minute install and 5 minute scale out. Its been around for a long time, it scales beautifully and has been ACID compliant since version 1 of the product.

no on 2015-04-22

“I work for MarkLogic and we are the only ACID compliant document database”

Postgres 9.4 has JSONB (and JSON) data type for fields, so, this isn’t true. You can store documents in Postgres fields now without needing an entire NoSQL database to get the functionality.

Alvaro on 2015-04-22

For those interested, there’s ToroDB. It is an open source database that implements the Mongo protocol and stores data on reliable PostgreSQL. As of today (April 2015) is still under heavy development, but we will go through Jepsen tests asap. We build on read committed transactions for writes and repeatable read for reads, which should protect for most of the issues encountered here. We will update about this asap.

Dave on 2015-04-22

Mongo did not start offering atomic writes on per document level until version 3.0. You even post the 3.0 faq url. But then you say ‘I’m testing 2.6.7’. You have successfully proven that the 3.0 features do not exist in 2.6.7.

march on 2015-04-23

First of all, MongoDB claims atomic write on the document level - that’s been true for a long time and nothing here shows it not to be true. that means that if changing two fields in the same document in a single update, other clients will never see this document with only one field changed. It means if you see the new document you see it with all of update applied.

For those of you saying how your favorite RDBMS doesn’t have the problems with stale reads, etc - come back and say that again when your RDMBS supports a distributed topology and still preserves your strong consistency during network partitions.

If you run a single MongoDB server, you can’t have more than one primary (duh) and you have your consistency. Surprisingly, people seem to want HA as well though.

Alexander on 2015-04-23

@john: Although I agree that Mongo has a bad track record, you are marginalizing the efforts of the Mongo team by not acknowledging that it’s significantly more difficult to achieve serialization in a distributed system than in a centralized one such as SQL Server.

IMO the reception of Kyles ticket speaks more of how much you should trust Mongo. After all, most databases started out simple and established themselves in their niche by continuously improving and adapting to customer requirements. Key here is to be attentive. Wanting to close a potentially very valuable improvement ticket at all cost is an indication of a really dangerous attitude in the business of data integrity.

On the other hand, Mongos niche was always ease of use and easy distributed setup, at the cost of decreased safety and data integrity. So it is possible that they are making the right business decision in downplaying this issue. Even so, I applaud Kyle for making this abundantly clear and calling them out on their deceptive marketing.

Giorgio Sironi on 2015-04-23

Would quorum reads (contacting a majority of nodes in the replica set) solve the problem on the MongoDB driver side? Assuming the values are returned with some kind of metadata such as the oplog version.

Klaus Mogensen on 2015-04-23

MongoDB is simply not for primary data storage. It’s fine for secondary data storage for special purposes

Alvaro on 2015-04-23

Does this problem affect only distributed Mongo databases? (i.e. on more than one server, with sharding, replicas or whatever)

How reliable is MongoDB when used in a single instance in one server?

Is it consistent then, as @march says?

Juan on 2015-04-23

Hi Kyle, a very interesting blog.

It is good to see people doing thorough tests like this, good job.

I have been using MongoDB for a while and for obvious reasons I am very interested in your findings.

I’m happy to take your finding about the dirty read scenario at face value, where a Majority write can be read before it is fully propagated. This, even though not ideal, is something I could probably live with and I hope the MongoDB team fixes it.

The finding that worries me more is the one where you state that MongoDB reads are not linear. Looking at the details above I feel that I am missing additional information before I can say that I agree with your conclusions. Don’t get me wrong, you may still be right but the information provided is not enough for me.

For example, looking at the case where process 5 reads a 0, when I look in the attached full log I can see the following just before:

9       :fail   :cas    [3 0]
6       :fail   :write  4
7       :fail   :write  4
5       :invoke :read   0

Followed by inconsistent operation:
5       :ok     :read   0

Now, the thing that is bugging me is that process 9 has just tried to change the value to 0, your logs state that the attempt failed, but did it really? Maybe the write worked but the response got lost because a new network partition cut it before the client got the results. Unfortunately there is no way to say whether that was what happened.

One way this could be resolved is by ensuring that each write (write or CaS), not only writes the value but also an UUID. MongoDB is a document DB after all so it shouldn’t be too difficult to add the UUID as an extra field without impacting the essence of your test. The writes would log the UUID on the output and the reads would display the UUID from the read record. This way you will always know with certainty which write produced the result of each read and thus it would be a lot easier to proof whether the writes are linear or not.

Do you think you could tweak your test with a UUID as described above? I’m afraid clojure is not my thing and I wouldn’t feel confident doing these changes properly.

Kind regards, Juan

Itamar Haber on 2015-04-23

So good to have you back roaming your natural habitat, oh destroyer of databases.

Bair on 2015-04-24

Thanks @Alexander and @march for your balancing comments. Oh, and thanks to the author for your efforts and insights!

All this might(!) sound like coming from ivory tower only because we can’t find a word about different use cases and their “impact” on the technical base of MongoDB. Different uses of MongoDB must gain different results, also regarding the problems, which you have described very well founded.

Juan on 2015-04-24

Kyle,

Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.

Another valid scenario where process 5 could return 0, is if it read it from the minority partition, which would have gone on read only mode once it found itself on minority.

Looking at your full logs, it seems that the last successful write before the partition was actually a 0. Because you are using the Majority concern, the minority partition would never be able to do a write after that 0, even before it realised it was in a partition, as it wouldn’t get enough nodes to accept the write, so it would remain on the last write before the partition, in this case 0?

Kind regards, Juan

Gleb on 2015-04-24

Just confirms my suspicions that MongoDb is really overestimated. Analyzing SQL Server that way would be really interesting.

Marko Bonaci on 2015-04-24

Great post in a series of great posts. One thing that bugs me a bit though, is that you hardly ever respond to questions and compliments.

Oh, our dearest Vektor Klokov, we praise you :)

mark on 2015-04-24

It is an interesting question whether SQL Server Mirroring (and Availability Groups) are AP or CP. See the DBA Stack Exchange question “What happens in this strange SQL Server Mirroring situation?”. (Use a search engine; I can’t post a link here.)

The answers and commentary did not want to admit that there might be a problem. Forbidden thoughts.

Aphyr on 2015-04-27

Now, the thing that is bugging me is that process 9 has just tried to change the value to 0, your logs state that the attempt failed, but did it really? Maybe the write worked but the response got lost because a new network partition cut it before the client got the results. Unfortunately there is no way to say whether that was what happened.

Take another look at the “Inconsistent reads” section–Jepsen distinguishes definite from indefinite failures.

Aphyr on 2015-04-27

Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.

Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.

Aphyr on 2015-04-27

Great post in a series of great posts. One thing that bugs me a bit though, is that you hardly ever respond to questions and compliments.

Sorry, been really busy with Craft Conf in Budapest, haha. Didn’t have laptop power for most of the trip. ;-)

Aphyr on 2015-04-27

Does this problem affect only distributed Mongo databases? (i.e. on more than one server, with sharding, replicas or whatever)

How reliable is MongoDB when used in a single instance in one server?

Is it consistent then, as @march says?

Mongo says it’s read-uncommitted, so I wouldn’t trust it to offer single-node read-committed semantics. That said, it does have a database-global lock, right? So you might get away with it on a single node.

If you’re running a single node, though, why not just use Postgres? ducks

Adam Mullin on 2015-05-01

Postgres Ducks.

I would like 7.

Thank you.

Juan on 2015-05-05

Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.

As it stands and from the contents of the article alone, you cannot prove that your conclusions are correct, they are just one of the possibilities, along with other ones including some that support that mongo is working as expected.

For example:

Or, alternatively, the state of the system split in two–one for each side of the network partition–writes occurring on one side, and on the other, the value remaining 0.

If you had put an UUID as I suggested along with the node each operation goes to (or any other approach that produces similar data) you would know what happened there, instead of having to hypothesise.

In fact and based on my mongo knowledge, that is exactly what happened, as I mentioned in my second post. In a partition situation, the minority partition will go into read only mode and won’t accept writes but would accept reads. Mongo is normally a CP, but during a partition the minority partition/s behave more like an AP. These are the kind of trade-offs that you have to take when choosing a DB for your use case. It doesn’t mean it mongo is broken, reading a stale value is more desirable in a lot of uses cases than getting an error.

Kind regards, Juan

Don’t use an UUID if you don’t want, but you should provide a way to identify the node that takes each operation

DC on 2015-05-12

“MarkLogic has been around for 13 years and…” have you run the Jepsen tests and can post the link to your results? If not, any hope Kyle has MarkLogic in his crossharis? It sure would be nice to hear the results from the closed-source vendor competing in this distributed document space.

Yaniv Shemesh on 2015-05-20

Did anyone got to test whether all those issues especial the stale read one are resolved when using TokuMX?

Since it uses its own consensus algorithm (Ark) one might guess that those issues should be at least mitigated.

Yaniv Shemesh on 2015-05-20

Forgot to add the link of TokuMX Ark:

//www.tokutek.com/2014/07/introducing-ark-a-consensus-algorithm-for-tokumx-and-mongodb

Aphyr on 2015-05-20

As it stands and from the contents of the article alone, you cannot prove that your conclusions are correct, they are just one of the possibilities, along with other ones including some that support that mongo is working as expected.

You may want to refresh yourself with the notion of linearizability. Mongo’s histories are not equivalent to any singlethreaded history, regardless of which nodes are contacted.

In a partition situation, the minority partition will go into read only mode and won’t accept writes but would accept reads.

This is an asynchronous network, a non-realtime environment, and we have no synchronized clocks. Mongo has no way to reliably detect a partition, let alone enter a read-only mode. Besides, Mongo, as far as I can tell, never refuses a write which is valid on the local node–even at write concern majority it’s more than happy to commit to an isolated primary, even if it won’t acknowledge the operation to the client.

Mongo is normally a CP, but during a partition the minority partition/s behave more like an AP.

I suggest you re-read the CAP theorem; By allowing stale and dirty reads, Mongo does not provide C. By refusing to service requests on some nodes during a partition, it does not provide A.

These are the kind of trade-offs that you have to take when choosing a DB for your use case.

No, they really aren’t. Mongo’s chosen the lower availability of a linearizable system, and the less useful consistency of a totally-available system. It’s, well, strictly speaking not the worst of both worlds, but definitely non-optimal, haha.

reading a stale value is more desirable in a lot of uses cases than getting an error.

That’s true, which is why I devoted a significant portion of the article to showing where Mongo’s documentation claims stale reads will not occur.

Juan on 2015-05-21

Hi Kyle,

Thanks for answering my points. I have read the link to the linearizability paper, pretty interesting.

Having then re-read your article clarifies where you are coming from. At the time I thought that your issue was that you weren’t sure of where the 0 came from and that you thought it was some kind of “garbage”. Because that value was perfectly explainable based on how I understand Mongo to work I got a bit confused.

I do agree with you that the way that minority partitions work at the moment mean that Mongo it is not linearizable. With SERVER-18022 finally fixed, your idea for a minority primary seeking quorum for reads sounds like a great approach to deal with this problem. Particularly if this could be configured at connection and query level so clients could select their consistency level based on their use case.

Let’s see what Mongo does with SERVER-17975, I have added my vote to it.

Thanks again, Juan

Leif on 2015-05-21

Yaniv,

TokuMX only fixes the data loss problems in the mongo replication design (by tying acknowledgements to election terms a la raft). TokuMX doesn’t fix the stale/dirty reads problem.

Mongo clients just don’t expect to verify that the server they’re talking to is actually the primary; without changing clients or adding a lot of server-side logic to proxy for clients, I don’t think this is fixable in the replication design mongo uses and TokuMX inherited.

Yaniv Shemesh on 2015-05-28

Merci Leif!

sergei on 2015-09-01

great article, i do use mongo, exactly to be a “log bin”, and would agree that documentation for mongo is tricky, u really have to read it all to make sure u get what u want.

Mahfuz Khalili on 2015-10-10

I am new to Mongo and found this blog today very handy and informative.

I am well aware of some of the issues. Great discussion and analysis. I have been using RDBMS since mid 80s. Some of these issues plus others not discussed here are well known in the RDBMS world. In my experience many large organizations build their own custom database systems (some of the high techs) spending billions of dollars, they don’t use Oracle, Mongo, or Microsoft. Although there are known limitations I am going to use Mongo for many use cases where RDBMS has other limitations that Mongo will do a better job. Rationale is cost vs. utility and concurrency over latency while using billions of records. Some chances of dirty read or loss of data will not matter for many applications that can be run on commodity hardware using terabytes of data.

Again thanks everybody for useful comments.

Bill Towers on 2016-01-27

After reading this article I’m so glad we went with Couchbase. We needed a Document database that could scale and offer no downtime.

archana on 2016-02-25

nice

angeline roja on 2016-02-25

fantastic..

Art on 2016-05-20

Does the version 3.2 of MongoDB solve any of problems mentioned in this article? In 3.2 read concern was introduced.

Aphyr on 2016-05-24

Read concern does not address stale reads, no. It does, however, appear to prevent dirty reads, at least in my cursory tests.

Wei Shan on 2016-06-03

Why not?

In your blog post:

“Even if the minority primary accepts no writes at all, successful writes against the majority node will allow us to read old values.”

Read concern will only read committed data that’s durable across the majority of the cluster. Even during a split brain period, where 2 nodes believe that they are the primary, the client that connects to minority primary will not be able to see the latest change on the majority primary.

But that’s normal isn’t it? Because the client doesn’t know which is the “valid” primary.

Aphyr on 2016-06-03

Yep, Wei, I think you’ve got it: read concern doesn’t prevent divergence. I haven’t looked into it deeply though.

Rob on 2016-11-30

Update re https://jira.mongodb.org/browse/SERVER-17975

“We have completed implementation of a new "linearizable” read concern under SERVER-18285, and have undertaken some documentation updates under DOCS-8298. As such, I’m resolving this ticket as “fixed” for MongoDB 3.4.0-rc3. The code is actually present and enabled in 3.4.0-rc2, for those interested in further test. Our own testing included, among other things, integrating jepsen tests into our continous integration system.“

Binh Thanh Nguyen on 2016-12-13

Thanks, nice post

Andreas Simpleton on 2017-02-21

Are there still stale-reads in the 3.4.0-rc3 Version of MongoDB or is this issue now resolved?

Aphyr on 2017-02-21

It’s complicated. See the 3.4.0-rc3 analysis for a complete discussion of stale reads in recent versions of MongoDB.

Abhishek Kumar Singh on 2017-08-29

You table on WriteConsistency looks like inaccurate.

Acknowledged (new default) SAFE ->Unsafe: not even on disk or replicated

The above statement is NOT TRUE. You can actually enable journaling and it will not ACK unless your writes are written to the disk. You can refer to the documentation here, https://docs.mongodb.com/manual/reference/write-concern/#replica-sets

Aphyr on 2017-08-29

Abhishek, as the introduction notes, this post discusses MongoDB 2.6.7; you’re talking about 3.4. Journaling was also unsafe, and is discussed on the two lines immediately following the one you’re complaining about here.

The situation is actually much worse than I knew at the time of this post. If you read the followup analysis on 3.4 (also linked in the introduction) showed that even majority write concern with fsync was fundamentally unsafe, due to MongoDB’s dependence on synchronized wall clocks.

Dave Anders on 2018-04-06

I am disappointed by the lack of music video screenshots in this post. Just kidding, you rock.

optimizdba on 2018-09-26

With this blog you really took our attention to the points that we never thought about. Thanks for sharing this with all of us. All the best, way to go

Post a Comment