Gary
Gary, on

Yeah, +1 for testing hazelcast!

JM

rss appears to be down and I am sad. Should I pull request somewhere?

Dan Stroot
Dan Stroot, on

Would love to see something on RethinkDB. Sorry - I know it’s out of context here but I’d really like some reliability info on RethinkDB. Cheers!

V

“Next, process 4 writes 4 successfully” - shouldn’t that be process 10?

HCUser
HCUser, on

Wonderful as always!

Fervent test request: Hazelcast?

Carl Sverre
Carl Sverre, on

Shouldn’t this be process 10 writes 4?

Next, process 4 writes 4 successfully–and process 12 executes a compare-and-set...

Great write up!

Andy
Andy, on

The legend on the latency graph isn’t really well placed IMO.

DD

upside down?

Juan
Juan, on

Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.

As it stands and from the contents of the article alone, you cannot prove that your conclusions are correct, they are just one of the possibilities, along with other ones including some that support that mongo is working as expected.

For example:

Or, alternatively, the state of the system split in two–one for each side of the network partition–writes occurring on one side, and on the other, the value remaining 0.

If you had put an UUID as I suggested along with the node each operation goes to (or any other approach that produces similar data) you would know what happened there, instead of having to hypothesise.

In fact and based on my mongo knowledge, that is exactly what happened, as I mentioned in my second post. In a partition situation, the minority partition will go into read only mode and won’t accept writes but would accept reads. Mongo is normally a CP, but during a partition the minority partition/s behave more like an AP. These are the kind of trade-offs that you have to take when choosing a DB for your use case. It doesn’t mean it mongo is broken, reading a stale value is more desirable in a lot of uses cases than getting an error.

Kind regards, Juan

Don’t use an UUID if you don’t want, but you should provide a way to identify the node that takes each operation

Larry Weya
Larry Weya, on

Great post, ES is my to go search DB but after some mishaps with MongoDB I made it a rule to always store my primary data in a RDBMS (read Postgres) and ressync whenever necessary but had no idea ES had such reliability issues. Thanks for sharing.

Elliott Clark
Elliott Clark, on

So here’s my quick tracing of the write syscall to try and validate that the read() syscall will always hit page cache

The read syscall is defined here:

github.com/torvalds/linux/blob/master/fs/read_write.c#L562

That then goes to vfs_read

github.com/torvalds/linux/blob/master/fs/read_write.c#L440

vfs_read calls __vfs_read which will call new_sync_read

new_sync_read will ask the implemented file system for a read_iter. Most filesystems will use use the generic one: github.com/torvalds/linux/blob/master/fs/ext4/file.c#L625 github.com/torvalds/linux/blob/master/mm/filemap.c#L1689

generic_file_read_iter will then ( assuming no O_DIRECT ) call do_generic_file_read github.com/torvalds/linux/blob/master/mm/filemap.c#L1461

do_generic_file_read will then get all of it’s data from the page cache. So read should always get the dirty page from the page cache.

Ok so then in looking into it more it looks like ES is actually buffering in user space before calling into kernel write() github.com/elastic/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/translog/fs/BufferingFsTranslogFile.java

The BufferingFSTranslogFile uses WrapperOutputStream which just copies data into an array: github.com/elastic/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/translog/fs/BufferingFsTranslogFile.java#L247

Tehmasp Chaudhri
Tehmasp Chaudhri, on

Thanks for sharing your notes! I’ve been hearing more about this issue on my Twitterverse for the last ~2weeks. Your post was a great explanation. [Of course I’m assuming it is 100% accurate ;)]

Cheers, Tehmasp

Aphyr
Aphyr, on

Great work. I’m wondering if you’ve had any experience with frequency and severity of network partitioning in AWS?

Thank you! I’ve talked to quite a few folks running clusters on AWS about the frequency of app-affecting partitions (or symptoms like “random leader election and rollback files in our mongo cluster”), and I’d say the frequency is somewhere between “daily” to “every few months”.

Aphyr
Aphyr, on

From my understanding all reads for that file will read from the dirty page. So I’m not sure that fsync really has anything to do with the crashed datanodes

You know, I’ve asked around about this behavior and heard quite a few stories about what’s supposed to happen, depending on the stdlib and file options! It might be the case that Elasticsearch isn’t even issuing flush or write commands to the transaction log until those five seconds are up! I haven’t had time to dig in and figure out the full story.

What we definitely know is that it’s not fsyncing, so I stuck with that explanation here. :)

SleeperSmith
SleeperSmith, on

Great work. I’m wondering if you’ve had any experience with frequency and severity of network partitioning in AWS?

Btw, you need the table flipping back in Jepsen (╯°□°)╯︵ ┻━┻

anonymous
anonymous, on

Killing a process with a signal will not cause un-fsync’d data to be lost. The linux page cache will eventually flush to disk as long as the kernel is up and working. From my understanding all reads for that file will read from the dirty page. So I’m not sure that fsync really has anything to do with the crashed datanodes

Denis Gobo
Denis Gobo, on

Love these posts, especially the MongoDB ones, keep them coming

Aphyr
Aphyr, on

Caitie and I eloped in Budapest!

a random fan
a random fan, on

“my lovely wife”

wat. confused.

Adam Mullin
Adam Mullin, on

Postgres Ducks.

I would like 7.

Thank you.

Aphyr
Aphyr, on

Does this problem affect only distributed Mongo databases? (i.e. on more than one server, with sharding, replicas or whatever)

How reliable is MongoDB when used in a single instance in one server?

Is it consistent then, as @march says?

Mongo says it’s read-uncommitted, so I wouldn’t trust it to offer single-node read-committed semantics. That said, it does have a database-global lock, right? So you might get away with it on a single node.

If you’re running a single node, though, why not just use Postgres? ducks

Aphyr
Aphyr, on

Great post in a series of great posts. One thing that bugs me a bit though, is that you hardly ever respond to questions and compliments.

Sorry, been really busy with Craft Conf in Budapest, haha. Didn’t have laptop power for most of the trip. ;-)

Aphyr
Aphyr, on

Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.

Clients balance requests between multiple nodes and may retry an operation against multiple nodes; this isn’t a particularly useful piece of information for Mongo.

Aphyr
Aphyr, on

Now, the thing that is bugging me is that process 9 has just tried to change the value to 0, your logs state that the attempt failed, but did it really? Maybe the write worked but the response got lost because a new network partition cut it before the client got the results. Unfortunately there is no way to say whether that was what happened.

Take another look at the “Inconsistent reads” section–Jepsen distinguishes definite from indefinite failures.

Martin Grotzke
Martin Grotzke, on

I’d also be interested to see what jepsen can say about postgresql multi master replication!

sejarah judi
sejarah judi, on

great article and good post..

mark
mark, on

It is an interesting question whether SQL Server Mirroring (and Availability Groups) are AP or CP. See the DBA Stack Exchange question “What happens in this strange SQL Server Mirroring situation?”. (Use a search engine; I can’t post a link here.)

The answers and commentary did not want to admit that there might be a problem. Forbidden thoughts.

Marko Bonaci
Marko Bonaci, on

Great post in a series of great posts. One thing that bugs me a bit though, is that you hardly ever respond to questions and compliments.

Oh, our dearest Vektor Klokov, we praise you :)

Gleb
Gleb, on

Just confirms my suspicions that MongoDb is really overestimated. Analyzing SQL Server that way would be really interesting.

Juan
Juan, on

Kyle,

Also, another thing that would be useful besides the UUID would be to print what node each requests go to. Then you can always correlate the operation with the MongoDB operational log.

Another valid scenario where process 5 could return 0, is if it read it from the minority partition, which would have gone on read only mode once it found itself on minority.

Looking at your full logs, it seems that the last successful write before the partition was actually a 0. Because you are using the Majority concern, the minority partition would never be able to do a write after that 0, even before it realised it was in a partition, as it wouldn’t get enough nodes to accept the write, so it would remain on the last write before the partition, in this case 0?

Kind regards, Juan

Bair
Bair, on

Thanks @Alexander and @march for your balancing comments. Oh, and thanks to the author for your efforts and insights!

All this might(!) sound like coming from ivory tower only because we can’t find a word about different use cases and their “impact” on the technical base of MongoDB. Different uses of MongoDB must gain different results, also regarding the problems, which you have described very well founded.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.