The legend on the latency graph isn’t really well placed IMO.
DD, on
upside down?
Juan, on
Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.
As it stands and from the contents of the article alone, you cannot prove that your conclusions are correct, they are just one of the possibilities, along with other ones including some that support that mongo is working as expected.
For example:
Or, alternatively, the state of the system split in two–one for each side of the network partition–writes occurring on one side, and on the other, the value remaining 0.
If you had put an UUID as I suggested along with the node each operation goes to (or any other approach that produces similar data) you would know what happened there, instead of having to hypothesise.
In fact and based on my mongo knowledge, that is exactly what happened, as I mentioned in my second post. In a partition situation, the minority partition will go into read only mode and won’t accept writes but would accept reads. Mongo is normally a CP, but during a partition the minority partition/s behave more like an AP. These are the kind of trade-offs that you have to take when choosing a DB for your use case. It doesn’t mean it mongo is broken, reading a stale value is more desirable in a lot of uses cases than getting an error.
Kind regards,
Juan
Don’t use an UUID if you don’t want, but you should provide a way to identify the node that takes each operation
Larry Weya, on
Great post, ES is my to go search DB but after some mishaps with MongoDB I made it a rule to always store my primary data in a RDBMS (read Postgres) and ressync whenever necessary but had no idea ES had such reliability issues. Thanks for sharing.
Elliott Clark, on
So here’s my quick tracing of the write syscall to try and validate that the read() syscall will always hit page cache
vfs_read calls __vfs_read which will call new_sync_read
new_sync_read will ask the implemented file system for a read_iter. Most filesystems will use use the generic one:
github.com/torvalds/linux/blob/master/fs/ext4/file.c#L625
github.com/torvalds/linux/blob/master/mm/filemap.c#L1689
generic_file_read_iter will then ( assuming no O_DIRECT ) call do_generic_file_read
github.com/torvalds/linux/blob/master/mm/filemap.c#L1461
do_generic_file_read will then get all of it’s data from the page cache. So read should always get the dirty page from the page cache.
Ok so then in looking into it more it looks like ES is actually buffering in user space before calling into kernel write()
github.com/elastic/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/translog/fs/BufferingFsTranslogFile.java
The BufferingFSTranslogFile uses WrapperOutputStream which just copies data into an array:
github.com/elastic/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/translog/fs/BufferingFsTranslogFile.java#L247
Tehmasp Chaudhri, on
Thanks for sharing your notes! I’ve been hearing more about this issue on my Twitterverse for the last ~2weeks. Your post was a great explanation. [Of course I’m assuming it is 100% accurate ;)]
Cheers,
Tehmasp
Aphyr, on
Great work. I’m wondering if you’ve had any experience with frequency and severity of network partitioning in AWS?
Thank you! I’ve talked to quite a few folks running clusters on AWS about the frequency of app-affecting partitions (or symptoms like “random leader election and rollback files in our mongo cluster”), and I’d say the frequency is somewhere between “daily” to “every few months”.
Aphyr, on
From my understanding all reads for that file will read from the dirty page. So I’m not sure that fsync really has anything to do with the crashed datanodes
You know, I’ve asked around about this behavior and heard quite a few stories about what’s supposed to happen, depending on the stdlib and file options! It might be the case that Elasticsearch isn’t even issuing flush or write commands to the transaction log until those five seconds are up! I haven’t had time to dig in and figure out the full story.
What we definitely know is that it’s not fsyncing, so I stuck with that explanation here. :)
SleeperSmith, on
Great work. I’m wondering if you’ve had any experience with frequency and severity of network partitioning in AWS?
Btw, you need the table flipping back in Jepsen
(╯°□°)╯︵ ┻━┻
anonymous, on
Killing a process with a signal will not cause un-fsync’d data to be lost. The linux page cache will eventually flush to disk as long as the kernel is up and working. From my understanding all reads for that file will read from the dirty page. So I’m not sure that fsync really has anything to do with the crashed datanodes
Denis Gobo, on
Love these posts, especially the MongoDB ones, keep them coming
Does this problem affect only distributed Mongo databases? (i.e. on more than one server, with sharding, replicas or whatever)
How reliable is MongoDB when used in a single instance in one server?
Is it consistent then, as @march says?
Mongo says it’s read-uncommitted, so I wouldn’t trust it to offer single-node read-committed semantics. That said, it does have a database-global lock, right? So you might get away with it on a single node.
If you’re running a single node, though, why not just use Postgres? ducks
Aphyr, on
Great post in a series of great posts. One thing that bugs me a bit though, is that you hardly ever respond to questions and compliments.
Sorry, been really busy with Craft Conf in Budapest, haha. Didn’t have laptop power for most of the trip. ;-)
Aphyr, on
Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.
Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.
Aphyr, on
Now, the thing that is bugging me is that process 9 has just tried to change the value to 0, your logs state that the attempt failed, but did it really? Maybe the write worked but the response got lost because a new network partition cut it before the client got the results. Unfortunately there is no way to say whether that was what happened.
Take another look at the “Inconsistent reads” section–Jepsen distinguishes definite from indefinite failures.
Martin Grotzke, on
I’d also be interested to see what jepsen can say about postgresql multi master replication!
sejarah judi, on
great article and good post..
mark, on
It is an interesting question whether SQL Server Mirroring (and Availability Groups) are AP or CP. See the DBA Stack Exchange question “What happens in this strange SQL Server Mirroring situation?”. (Use a search engine; I can’t post a link here.)
The answers and commentary did not want to admit that there might be a problem. Forbidden thoughts.
Marko Bonaci, on
Great post in a series of great posts.
One thing that bugs me a bit though,
is that you hardly ever respond to
questions and compliments.
Oh, our dearest Vektor Klokov, we praise you :)
Gleb, on
Just confirms my suspicions that MongoDb is really overestimated. Analyzing SQL Server that way would be really interesting.
Juan, on
Kyle,
Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.
Another valid scenario where process 5 could return 0, is if it read it from the minority partition, which would have gone on read only mode once it found itself on minority.
Looking at your full logs, it seems that the last successful write before the partition was actually a 0. Because you are using the Majority concern, the minority partition would never be able to do a write after that 0, even before it realised it was in a partition, as it wouldn’t get enough nodes to accept the write, so it would remain on the last write before the partition, in this case 0?
Kind regards,
Juan
Bair, on
Thanks @Alexander and @march for your balancing comments. Oh, and thanks to the author for your efforts and insights!
All this might(!) sound like coming from ivory tower only because we can’t find a word about different use cases and their “impact” on the technical base of MongoDB. Different uses of MongoDB must gain different results, also regarding the problems, which you have described very well founded.
Yeah, +1 for testing hazelcast!
rss appears to be down and I am sad. Should I pull request somewhere?
Would love to see something on RethinkDB. Sorry - I know it’s out of context here but I’d really like some reliability info on RethinkDB. Cheers!
“Next, process 4 writes 4 successfully” - shouldn’t that be process 10?
Wonderful as always!
Fervent test request: Hazelcast?
Shouldn’t this be process 10 writes 4?
Next, process 4 writes 4 successfully–and process 12 executes a compare-and-set...Great write up!
Patches welcome. :)
The legend on the latency graph isn’t really well placed IMO.
upside down?
As it stands and from the contents of the article alone, you cannot prove that your conclusions are correct, they are just one of the possibilities, along with other ones including some that support that mongo is working as expected.
For example:
If you had put an UUID as I suggested along with the node each operation goes to (or any other approach that produces similar data) you would know what happened there, instead of having to hypothesise.
In fact and based on my mongo knowledge, that is exactly what happened, as I mentioned in my second post. In a partition situation, the minority partition will go into read only mode and won’t accept writes but would accept reads. Mongo is normally a CP, but during a partition the minority partition/s behave more like an AP. These are the kind of trade-offs that you have to take when choosing a DB for your use case. It doesn’t mean it mongo is broken, reading a stale value is more desirable in a lot of uses cases than getting an error.
Kind regards, Juan
Don’t use an UUID if you don’t want, but you should provide a way to identify the node that takes each operation
Great post, ES is my to go search DB but after some mishaps with MongoDB I made it a rule to always store my primary data in a RDBMS (read Postgres) and ressync whenever necessary but had no idea ES had such reliability issues. Thanks for sharing.
So here’s my quick tracing of the write syscall to try and validate that the read() syscall will always hit page cache
The read syscall is defined here:
github.com/torvalds/linux/blob/master/fs/read_write.c#L562
That then goes to vfs_read
github.com/torvalds/linux/blob/master/fs/read_write.c#L440
vfs_read calls __vfs_read which will call new_sync_read
new_sync_read will ask the implemented file system for a read_iter. Most filesystems will use use the generic one: github.com/torvalds/linux/blob/master/fs/ext4/file.c#L625 github.com/torvalds/linux/blob/master/mm/filemap.c#L1689
generic_file_read_iter will then ( assuming no O_DIRECT ) call do_generic_file_read github.com/torvalds/linux/blob/master/mm/filemap.c#L1461
do_generic_file_read will then get all of it’s data from the page cache. So read should always get the dirty page from the page cache.
Ok so then in looking into it more it looks like ES is actually buffering in user space before calling into kernel write() github.com/elastic/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/translog/fs/BufferingFsTranslogFile.java
The BufferingFSTranslogFile uses WrapperOutputStream which just copies data into an array: github.com/elastic/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/translog/fs/BufferingFsTranslogFile.java#L247
Thanks for sharing your notes! I’ve been hearing more about this issue on my Twitterverse for the last ~2weeks. Your post was a great explanation. [Of course I’m assuming it is 100% accurate ;)]
Cheers, Tehmasp
Thank you! I’ve talked to quite a few folks running clusters on AWS about the frequency of app-affecting partitions (or symptoms like “random leader election and rollback files in our mongo cluster”), and I’d say the frequency is somewhere between “daily” to “every few months”.
You know, I’ve asked around about this behavior and heard quite a few stories about what’s supposed to happen, depending on the stdlib and file options! It might be the case that Elasticsearch isn’t even issuing
flushorwritecommands to the transaction log until those five seconds are up! I haven’t had time to dig in and figure out the full story.What we definitely know is that it’s not
fsyncing, so I stuck with that explanation here. :)Great work. I’m wondering if you’ve had any experience with frequency and severity of network partitioning in AWS?
Btw, you need the table flipping back in Jepsen (╯°□°)╯︵ ┻━┻
Killing a process with a signal will not cause un-fsync’d data to be lost. The linux page cache will eventually flush to disk as long as the kernel is up and working. From my understanding all reads for that file will read from the dirty page. So I’m not sure that fsync really has anything to do with the crashed datanodes
Love these posts, especially the MongoDB ones, keep them coming
Caitie and I eloped in Budapest!
“my lovely wife”
wat. confused.
Postgres Ducks.
I would like 7.
Thank you.
Mongo says it’s read-uncommitted, so I wouldn’t trust it to offer single-node read-committed semantics. That said, it does have a database-global lock, right? So you might get away with it on a single node.
If you’re running a single node, though, why not just use Postgres? ducks
Sorry, been really busy with Craft Conf in Budapest, haha. Didn’t have laptop power for most of the trip. ;-)
Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.
Take another look at the “Inconsistent reads” section–Jepsen distinguishes definite from indefinite failures.
I’d also be interested to see what jepsen can say about postgresql multi master replication!
great article and good post..
It is an interesting question whether SQL Server Mirroring (and Availability Groups) are AP or CP. See the DBA Stack Exchange question “What happens in this strange SQL Server Mirroring situation?”. (Use a search engine; I can’t post a link here.)
The answers and commentary did not want to admit that there might be a problem. Forbidden thoughts.
Great post in a series of great posts. One thing that bugs me a bit though, is that you hardly ever respond to questions and compliments.
Oh, our dearest Vektor Klokov, we praise you :)
Just confirms my suspicions that MongoDb is really overestimated. Analyzing SQL Server that way would be really interesting.
Kyle,
Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.
Another valid scenario where process 5 could return 0, is if it read it from the minority partition, which would have gone on read only mode once it found itself on minority.
Looking at your full logs, it seems that the last successful write before the partition was actually a 0. Because you are using the Majority concern, the minority partition would never be able to do a write after that 0, even before it realised it was in a partition, as it wouldn’t get enough nodes to accept the write, so it would remain on the last write before the partition, in this case 0?
Kind regards, Juan
Thanks @Alexander and @march for your balancing comments. Oh, and thanks to the author for your efforts and insights!
All this might(!) sound like coming from ivory tower only because we can’t find a word about different use cases and their “impact” on the technical base of MongoDB. Different uses of MongoDB must gain different results, also regarding the problems, which you have described very well founded.