So good to have you back roaming your natural habitat, oh destroyer of databases.
Juan, on
Hi Kyle, a very interesting blog.
It is good to see people doing thorough tests like this, good job.
I have been using MongoDB for a while and for obvious reasons I am very interested in your findings.
I’m happy to take your finding about the dirty read scenario at face value, where a Majority write can be read before it is fully propagated. This, even though not ideal, is something I could probably live with and I hope the MongoDB team fixes it.
The finding that worries me more is the one where you state that MongoDB reads are not linear. Looking at the details above I feel that I am missing additional information before I can say that I agree with your conclusions. Don’t get me wrong, you may still be right but the information provided is not enough for me.
For example, looking at the case where process 5 reads a 0, when I look in the attached full log I can see the following just before:
Now, the thing that is bugging me is that process 9 has just tried to change the value to 0, your logs state that the attempt failed, but did it really? Maybe the write worked but the response got lost because a new network partition cut it before the client got the results. Unfortunately there is no way to say whether that was what happened.
One way this could be resolved is by ensuring that each write (write or CaS), not only writes the value but also an UUID. MongoDB is a document DB after all so it shouldn’t be too difficult to add the UUID as an extra field without impacting the essence of your test.
The writes would log the UUID on the output and the reads would display the UUID from the read record. This way you will always know with certainty which write produced the result of each read and thus it would be a lot easier to proof whether the writes are linear or not.
Do you think you could tweak your test with a UUID as described above? I’m afraid clojure is not my thing and I wouldn’t feel confident doing these changes properly.
Kind regards,
Juan
Alvaro, on
Does this problem affect only distributed Mongo databases? (i.e. on more than one server, with sharding, replicas or whatever)
How reliable is MongoDB when used in a single instance in one server?
Is it consistent then, as @march says?
Klaus Mogensen, on
MongoDB is simply not for primary data storage. It’s fine for secondary data storage for special purposes
geodge, on
you mean ,the consensus of elasticsearch is provided by CRDT?
Giorgio Sironi, on
Would quorum reads (contacting a majority of nodes in the replica set) solve the problem on the MongoDB driver side? Assuming the values are returned with some kind of metadata such as the oplog version.
Garry Groucer, on
this is really a nice article.
Alexander, on
@john: Although I agree that Mongo has a bad track record, you are marginalizing the efforts of the Mongo team by not acknowledging that it’s significantly more difficult to achieve serialization in a distributed system than in a centralized one such as SQL Server.
IMO the reception of Kyles ticket speaks more of how much you should trust Mongo. After all, most databases started out simple and established themselves in their niche by continuously improving and adapting to customer requirements. Key here is to be attentive. Wanting to close a potentially very valuable improvement ticket at all cost is an indication of a really dangerous attitude in the business of data integrity.
On the other hand, Mongos niche was always ease of use and easy distributed setup, at the cost of decreased safety and data integrity. So it is possible that they are making the right business decision in downplaying this issue. Even so, I applaud Kyle for making this abundantly clear and calling them out on their deceptive marketing.
march, on
First of all, MongoDB claims atomic write on the document level - that’s been true for a long time and nothing here shows it not to be true. that means that if changing two fields in the same document in a single update, other clients will never see this document with only one field changed. It means if you see the new document you see it with all of update applied.
For those of you saying how your favorite RDBMS doesn’t have the problems with stale reads, etc - come back and say that again when your RDMBS supports a distributed topology and still preserves your strong consistency during network partitions.
If you run a single MongoDB server, you can’t have more than one primary (duh) and you have your consistency. Surprisingly, people seem to want HA as well though.
Dave, on
Mongo did not start offering atomic writes on per document level until version 3.0. You even post the 3.0 faq url. But then you say ‘I’m testing 2.6.7’. You have successfully proven that the 3.0 features do not exist in 2.6.7.
Alvaro, on
For those interested, there’s ToroDB. It is an open source database that implements the Mongo protocol and stores data on reliable PostgreSQL. As of today (April 2015) is still under heavy development, but we will go through Jepsen tests asap. We build on read committed transactions for writes and repeatable read for reads, which should protect for most of the issues encountered here. We will update about this asap.
ku, on
Thank you for the update about 2.6, Nickola.
no, on
“I work for MarkLogic and we are the only ACID compliant document database”
Postgres 9.4 has JSONB (and JSON) data type for fields, so, this isn’t true. You can store documents in Postgres fields now without needing an entire NoSQL database to get the functionality.
Kash Badami, on
First off, thank you so much for writing such a detailed insightful post. It is simply fantastic and we are very grateful for the time you took to break down the consistency issues in general not just with MongoDB. Building a database that is highly distributed and yet supports ACID transactions is difficult. Oracle did not have ACID transactions until version 7 of the product is my understanding. In other words it lost data in version 6 and it was sorta ACID compliant.
MarkLogic has been around for 13 years and we compete with MongoDB. We are not open source and it is painful to watch folks spend a huge amount of time and money only to discover these consistency problems wont allow them to actually go into production and run their businesses on the technology. Obviously I work for MarkLogic and we are the only ACID compliant document database that runs over 600 large scale applications in production for major banks, trading companies, healthcare, DoD and Intelligence agencies.
I would recommend to anyone that requires ACID transactions, horizontal scale, document model and schema agnostic database storage look at MarkLogic. Its easy to setup with a 5 minute install and 5 minute scale out. Its been around for a long time, it scales beautifully and has been ACID compliant since version 1 of the product.
Kenneth Owens, on
Disregard the above please. I was able to reproduce this condition trivially by introducing a network partition where the previous primary was now in a minority partition. The majority partition elected a new primary prior to the primary in the minority partition stepping down. Thus, in the interval after successful primary election in the majority, and primary step down in the minority, the systems (There where now effectively two distributed systems that operating independently of each other) maintained two simultaneous primaries for the same data set.
The statement in the Mongo documentation is therefore correct. My interpretation, and my assumption that the system maintained either availability or consistency under network partition were erroneous. I do however feel like this product should include a big red warning label to the effect of “Not Consistent or Available Under Network Partition” or “Not a Distributed System by Traditional Definition”.
Kenneth Owens, on
After reading this post and the Mongo documentation, I’m a bit confused. Your post claims that during a network partition, a minority subset of a replication set can host a primary member.
“Imagine the network partitions, and for a brief time, there are two primary nodes–each sharing the initial value 0. Only one of them, connected to a majority of nodes, can successfully execute writes with write concern Majority. The other, which can only see a minority, will eventually time out and step down–but this takes a few seconds.”
This seems to strongly contradict the Mongo documentation pertaining to leader election.
“A replica set member cannot become primary unless it can connect to a majority of the members in the replica set. For the purposes of elections, a majority refers to the total number of votes, rather than the total number of members.”
I read the above as, “A primary can exist if and only if it has a connection to all members of a majority partition of the cluster”
After examining the multiple, misleading discrepancies you’ve pointed in the Mongo documentation, I am not strongly inclined toward assuming its veracity. However, can you please confirm your claim with respect to the behavior of a replication set under partition? Could you also describe the circumstances under which you were able to cause this to occur?
Thanks,
-Ken
Davide, on
Thanks. Just thanks. This article is just perfect to show why any company of any size should evaluate new technologies very carefully before jumping on them just because their are “cool” or, better, “webscale”.
mtrycz, on
john, can you run MS SQLServer through jepsen and post the results?
Richard, on
Blame the investment industry not an individual company!
Venture Capital is interested in a quick multiple return on their investment. To achieve this they have to generate or fuel “bubbles”. It really doesn’t matter what the bubble is about. Last year NoSQL, this year containers (e.g. Docker) and Platforms. It really doesn’t matter if the technologies actually address real issues. The only thing of importance to the investor is
1: The the technology trendy
2: Can it damage an existing vendor.
MongoDB was trendy, MongoDB could have taken revenue from the established DataBase folks. Hence lots of investment.
However reality eventually bites. It takes time, effort and talent engineers to build robust distributed software systems. Screw that! Where’s the next bubble!
BTW - awesome site. (I’m not US West Coast - I’m a Brit - and so I do use that word sparingly). ;)
Prakash, on
Regarding Question - How do the “ifs” in your serializability example force an order?
You said that -“They force an order because there is only one sequence in which those operations could take place.”
How is that possible given when we are considering execution in a concurrent environment. For example, we can have different threads, one checking if x is 3 and then printing it, and another setting x to 2. Can you please explain how we are maintaining sequence in this scenario?
john, on
See? This is why I hate startup-teenager-off-from-school-into-database-business kiddies. This is why startup culture sucks. Microsoft SQL Server 2012 R2 is not the best because it’s Microsoft behind it, no. It’s the best because it has 25+ years behind it. Trust me at this point all its features have been tested so much, and serialization is such a basic ACID request, that you can be sure you won’t have problems even if the whole data center gets bombed by Iraq soldiers. That’s the difference. MongoDB? A bunch of script kiddies decided they can create a database over a cup of coffee and a few beers. This is the result. Basic database functionality is completely flawed. Not just flawed - completely, utterly flawed.
or29544, on
OK so from what you are saying…Mongo should be banned. Because I can’t use a database that does not work with the default settings. I mean I am not going to start configuring my transactions to use Majority. This is just unacceptable. What I understand from your article is that Mongo is fundamentally flawed and for minimal reliable work I need to use Majority writes which are not even the default. And even worst, ALL other write methods are bad and insufficient. WHAT THE FUCK IS THIS? Who the fuck wrote this piece of shit database? And then as a conclusion you are talking about workarounds? SERIOUSLY? And you give your readers a pat on the back saying that the problem is not that hard to fix? Are you fucking joking lad? This Mongo behavior is SO FLAWED, that Mongo should disappear from the database scene this instant. You don’t just FIX this kind of problems. Maybe I was just planning to use Mongo for a fucking banking application or gather radio signals from SETI - you think I am interested in the workaround?
anonymous, on
Stripe totally uses Mongo :P
Fred Smith, on
Yikes!
As a Stripe customer, I sure hope you’re not using Mongo to keep track of my transactions.
Max G. Faraday, on
Hey,
Love this… Great, Exceptional body of work you have here. I am sure I am not the only one with this sentiment: THANK YOU. I have been coding for 24 years now and in Clojure for the last few. Thanks, really good job.
Dan Jay, on
Awesome post. Love it.
ignace, on
Hi Kyle,
im stuck with with the log macro (#4), not because i’ve tried ( id did try to get into what’s asked) but because i’ve a problem understanding the particulars of the question. I don’t want tips. but would it be possible to phrase the problem in another way, like for total new person that wants to learn clojure?
Ankit Malpani, on
Could you redo these tests on Elasticsearch 1.5 which was released recently? Several of resiliency and consistency issues have been fixed since then.
leandro moreira, on
Hi,
Thanks for this amazing series, I think you might need to fix:
This “After the second def, it had one entry.” by “After the second def, it had two entries.”
Tony, on
Would protocol buffers work just as well as option maps in a language like Go, where you creating a new PB looks a lot like your option map code?
Bill Smith, on
@Paul I think he used the correct term. A “false positive” is an outcome that seems to find a problem when in fact there is no problem. The client perceived an error, but as you pointed out, the write succeeded.
tofubland, on
Thank you for this general overview of programming. I’m currently learning Java and Unity3D related tools in an effort to one day build with my imagination. Just remember, it isn’t talent, experience, or even skill which drives us to do things; it is passion, that which keeps us going.
So good to have you back roaming your natural habitat, oh destroyer of databases.
Hi Kyle, a very interesting blog.
It is good to see people doing thorough tests like this, good job.
I have been using MongoDB for a while and for obvious reasons I am very interested in your findings.
I’m happy to take your finding about the dirty read scenario at face value, where a Majority write can be read before it is fully propagated. This, even though not ideal, is something I could probably live with and I hope the MongoDB team fixes it.
The finding that worries me more is the one where you state that MongoDB reads are not linear. Looking at the details above I feel that I am missing additional information before I can say that I agree with your conclusions. Don’t get me wrong, you may still be right but the information provided is not enough for me.
For example, looking at the case where process 5 reads a 0, when I look in the attached full log I can see the following just before:
9 :fail :cas [3 0] 6 :fail :write 4 7 :fail :write 4 5 :invoke :read 0 Followed by inconsistent operation: 5 :ok :read 0Now, the thing that is bugging me is that process 9 has just tried to change the value to 0, your logs state that the attempt failed, but did it really? Maybe the write worked but the response got lost because a new network partition cut it before the client got the results. Unfortunately there is no way to say whether that was what happened.
One way this could be resolved is by ensuring that each write (write or CaS), not only writes the value but also an UUID. MongoDB is a document DB after all so it shouldn’t be too difficult to add the UUID as an extra field without impacting the essence of your test. The writes would log the UUID on the output and the reads would display the UUID from the read record. This way you will always know with certainty which write produced the result of each read and thus it would be a lot easier to proof whether the writes are linear or not.
Do you think you could tweak your test with a UUID as described above? I’m afraid clojure is not my thing and I wouldn’t feel confident doing these changes properly.
Kind regards, Juan
Does this problem affect only distributed Mongo databases? (i.e. on more than one server, with sharding, replicas or whatever)
How reliable is MongoDB when used in a single instance in one server?
Is it consistent then, as @march says?
MongoDB is simply not for primary data storage. It’s fine for secondary data storage for special purposes
you mean ,the consensus of elasticsearch is provided by CRDT?
Would quorum reads (contacting a majority of nodes in the replica set) solve the problem on the MongoDB driver side? Assuming the values are returned with some kind of metadata such as the oplog version.
this is really a nice article.
@john: Although I agree that Mongo has a bad track record, you are marginalizing the efforts of the Mongo team by not acknowledging that it’s significantly more difficult to achieve serialization in a distributed system than in a centralized one such as SQL Server.
IMO the reception of Kyles ticket speaks more of how much you should trust Mongo. After all, most databases started out simple and established themselves in their niche by continuously improving and adapting to customer requirements. Key here is to be attentive. Wanting to close a potentially very valuable improvement ticket at all cost is an indication of a really dangerous attitude in the business of data integrity.
On the other hand, Mongos niche was always ease of use and easy distributed setup, at the cost of decreased safety and data integrity. So it is possible that they are making the right business decision in downplaying this issue. Even so, I applaud Kyle for making this abundantly clear and calling them out on their deceptive marketing.
First of all, MongoDB claims atomic write on the document level - that’s been true for a long time and nothing here shows it not to be true. that means that if changing two fields in the same document in a single update, other clients will never see this document with only one field changed. It means if you see the new document you see it with all of update applied.
For those of you saying how your favorite RDBMS doesn’t have the problems with stale reads, etc - come back and say that again when your RDMBS supports a distributed topology and still preserves your strong consistency during network partitions.
If you run a single MongoDB server, you can’t have more than one primary (duh) and you have your consistency. Surprisingly, people seem to want HA as well though.
Mongo did not start offering atomic writes on per document level until version 3.0. You even post the 3.0 faq url. But then you say ‘I’m testing 2.6.7’. You have successfully proven that the 3.0 features do not exist in 2.6.7.
For those interested, there’s ToroDB. It is an open source database that implements the Mongo protocol and stores data on reliable PostgreSQL. As of today (April 2015) is still under heavy development, but we will go through Jepsen tests asap. We build on read committed transactions for writes and repeatable read for reads, which should protect for most of the issues encountered here. We will update about this asap.
Thank you for the update about 2.6, Nickola.
“I work for MarkLogic and we are the only ACID compliant document database”
Postgres 9.4 has JSONB (and JSON) data type for fields, so, this isn’t true. You can store documents in Postgres fields now without needing an entire NoSQL database to get the functionality.
First off, thank you so much for writing such a detailed insightful post. It is simply fantastic and we are very grateful for the time you took to break down the consistency issues in general not just with MongoDB. Building a database that is highly distributed and yet supports ACID transactions is difficult. Oracle did not have ACID transactions until version 7 of the product is my understanding. In other words it lost data in version 6 and it was sorta ACID compliant.
MarkLogic has been around for 13 years and we compete with MongoDB. We are not open source and it is painful to watch folks spend a huge amount of time and money only to discover these consistency problems wont allow them to actually go into production and run their businesses on the technology. Obviously I work for MarkLogic and we are the only ACID compliant document database that runs over 600 large scale applications in production for major banks, trading companies, healthcare, DoD and Intelligence agencies.
I would recommend to anyone that requires ACID transactions, horizontal scale, document model and schema agnostic database storage look at MarkLogic. Its easy to setup with a 5 minute install and 5 minute scale out. Its been around for a long time, it scales beautifully and has been ACID compliant since version 1 of the product.
Disregard the above please. I was able to reproduce this condition trivially by introducing a network partition where the previous primary was now in a minority partition. The majority partition elected a new primary prior to the primary in the minority partition stepping down. Thus, in the interval after successful primary election in the majority, and primary step down in the minority, the systems (There where now effectively two distributed systems that operating independently of each other) maintained two simultaneous primaries for the same data set. The statement in the Mongo documentation is therefore correct. My interpretation, and my assumption that the system maintained either availability or consistency under network partition were erroneous. I do however feel like this product should include a big red warning label to the effect of “Not Consistent or Available Under Network Partition” or “Not a Distributed System by Traditional Definition”.
After reading this post and the Mongo documentation, I’m a bit confused. Your post claims that during a network partition, a minority subset of a replication set can host a primary member.
“Imagine the network partitions, and for a brief time, there are two primary nodes–each sharing the initial value 0. Only one of them, connected to a majority of nodes, can successfully execute writes with write concern Majority. The other, which can only see a minority, will eventually time out and step down–but this takes a few seconds.”
This seems to strongly contradict the Mongo documentation pertaining to leader election.
“A replica set member cannot become primary unless it can connect to a majority of the members in the replica set. For the purposes of elections, a majority refers to the total number of votes, rather than the total number of members.”
I read the above as, “A primary can exist if and only if it has a connection to all members of a majority partition of the cluster”
After examining the multiple, misleading discrepancies you’ve pointed in the Mongo documentation, I am not strongly inclined toward assuming its veracity. However, can you please confirm your claim with respect to the behavior of a replication set under partition? Could you also describe the circumstances under which you were able to cause this to occur?
Thanks, -Ken
Thanks. Just thanks. This article is just perfect to show why any company of any size should evaluate new technologies very carefully before jumping on them just because their are “cool” or, better, “webscale”.
john, can you run MS SQLServer through jepsen and post the results?
Blame the investment industry not an individual company!
Venture Capital is interested in a quick multiple return on their investment. To achieve this they have to generate or fuel “bubbles”. It really doesn’t matter what the bubble is about. Last year NoSQL, this year containers (e.g. Docker) and Platforms. It really doesn’t matter if the technologies actually address real issues. The only thing of importance to the investor is
1: The the technology trendy 2: Can it damage an existing vendor.
MongoDB was trendy, MongoDB could have taken revenue from the established DataBase folks. Hence lots of investment.
However reality eventually bites. It takes time, effort and talent engineers to build robust distributed software systems. Screw that! Where’s the next bubble!
BTW - awesome site. (I’m not US West Coast - I’m a Brit - and so I do use that word sparingly). ;)
Regarding Question - How do the “ifs” in your serializability example force an order?
You said that -“They force an order because there is only one sequence in which those operations could take place.”
How is that possible given when we are considering execution in a concurrent environment. For example, we can have different threads, one checking if x is 3 and then printing it, and another setting x to 2. Can you please explain how we are maintaining sequence in this scenario?
See? This is why I hate startup-teenager-off-from-school-into-database-business kiddies. This is why startup culture sucks. Microsoft SQL Server 2012 R2 is not the best because it’s Microsoft behind it, no. It’s the best because it has 25+ years behind it. Trust me at this point all its features have been tested so much, and serialization is such a basic ACID request, that you can be sure you won’t have problems even if the whole data center gets bombed by Iraq soldiers. That’s the difference. MongoDB? A bunch of script kiddies decided they can create a database over a cup of coffee and a few beers. This is the result. Basic database functionality is completely flawed. Not just flawed - completely, utterly flawed.
OK so from what you are saying…Mongo should be banned. Because I can’t use a database that does not work with the default settings. I mean I am not going to start configuring my transactions to use Majority. This is just unacceptable. What I understand from your article is that Mongo is fundamentally flawed and for minimal reliable work I need to use Majority writes which are not even the default. And even worst, ALL other write methods are bad and insufficient. WHAT THE FUCK IS THIS? Who the fuck wrote this piece of shit database? And then as a conclusion you are talking about workarounds? SERIOUSLY? And you give your readers a pat on the back saying that the problem is not that hard to fix? Are you fucking joking lad? This Mongo behavior is SO FLAWED, that Mongo should disappear from the database scene this instant. You don’t just FIX this kind of problems. Maybe I was just planning to use Mongo for a fucking banking application or gather radio signals from SETI - you think I am interested in the workaround?
Stripe totally uses Mongo :P
Yikes!
As a Stripe customer, I sure hope you’re not using Mongo to keep track of my transactions.
Hey, Love this… Great, Exceptional body of work you have here. I am sure I am not the only one with this sentiment: THANK YOU. I have been coding for 24 years now and in Clojure for the last few. Thanks, really good job.
Awesome post. Love it.
Hi Kyle,
im stuck with with the log macro (#4), not because i’ve tried ( id did try to get into what’s asked) but because i’ve a problem understanding the particulars of the question. I don’t want tips. but would it be possible to phrase the problem in another way, like for total new person that wants to learn clojure?
Could you redo these tests on Elasticsearch 1.5 which was released recently? Several of resiliency and consistency issues have been fixed since then.
Hi,
Thanks for this amazing series, I think you might need to fix:
This “After the second def, it had one entry.” by “After the second def, it had two entries.”
Would protocol buffers work just as well as option maps in a language like Go, where you creating a new PB looks a lot like your option map code?
@Paul I think he used the correct term. A “false positive” is an outcome that seems to find a problem when in fact there is no problem. The client perceived an error, but as you pointed out, the write succeeded.
Thank you for this general overview of programming. I’m currently learning Java and Unity3D related tools in an effort to one day build with my imagination. Just remember, it isn’t talent, experience, or even skill which drives us to do things; it is passion, that which keeps us going.