Did anyone got to test whether all those issues especial the stale read one are resolved when using TokuMX?
Since it uses its own consensus algorithm (Ark) one might guess that those issues should be at least mitigated.
milleniumkenny, on
Hi, I wanted to add my thanks for writing a really informative how-to article.
I’ve recently built a dining table of my own, using a solid acacia wood slab and 10' 6x6 cedar posts from Lowe’s cut down to true 4x4s for the legs and beams. I wasn’t sure about using cedar, but was determined to work with 4x4 and not 3.5 for a hefty look.
From there I took many of my tips from your article on how to build it. As a novice I found that creating mortise and tenon joints was the hardest part. After fussing with chisels and drills and not getting the results I hoped for, I ended up using a multi-tool Dremel with an e-blade to cut the holes. The e-blade is easy to plunge in and makes clean edges, quick work.
I used PL to glue the joints, for extra strength. I’ve read varying reports on using glue, purists may object but let me tell you those joints won’t move ever. On gluing the last joints, a ratchet strap came in handy to hold it together.
The slab is finished with the water-based urethane, satin finish, both sides. Again read some varying reports on whether to seal both sides, but I decided it would be better to try to seal it up from the air completely to prevent any further checking. I worked about ½; cup into a couple of the cracks, no risk of water getting in there. I’m extremely pleased with the result, the table still has a good natural feel without the “plastic” look I see on some glossy finishes.
Aphyr, on
the animated gifs made my eyes bleed
Please see an optometrist immediately!
Kevin Burton, on
Thanks for posting this. I was actually able to read it this time as the animated gifs made my eyes bleed :-P
Really sad that this is still the case. Right now we have dual Cassandra and Elasticsearch nodes. We reply our firehose API on top of Elasticsearch doing upserts exactly like you said. This allows us to decouple both systems but of course it’s more expensive.
Aphyr, on
The guarantees made by RC are extremely non-sufficient in practice.
I said it was possible, not useful, haha. ;-)
Aphyr, on
The cases generally discussed fall in the category of “n = n + 1”, where the update is a fixed value. An update like “n = n * 1.1” is not at all idempotent.
Sure it is. Addition and multiplication are both associative and commutative but not idempotent; PN-counters give you the extra context necessary to recover idempotent merges. Just replace the sum with multiplication in a PN-counter, and n = n * 1.1 works just fine.
Vlad Dragos, on
Hello,
Can you please do a follow-up and redo the tests using Kafka 0.8.X.?
Best regards.
DC, on
“MarkLogic has been around for 13 years and…” have you run the Jepsen tests and can post the link to your results? If not, any hope Kyle has MarkLogic in his crossharis? It sure would be nice to hear the results from the closed-source vendor competing in this distributed document space.
Zed, on
“Read-committed is achievable” True, but what good is RC when it can be extremely out of date (such as an empty database). The guarantees made by RC are extremely non-sufficient in practice. You at least want “kind of” current data.
Rick G, on
First of all, thanks for this terrific series of blogs. I thought I was the only one who thought so much about consistency and availability, but you proved me wrong.
The concept of CRDTs is new to me, and they seem to be useful, but there seem to be some pretty significant limits on their use. The first is that many updates, in particular, are not at all idempotent. The cases generally discussed fall in the category of “n = n + 1”, where the update is a fixed value. An update like “n = n * 1.1” is not at all idempotent.
Secondly, as is acknowledged, the concept of “eventual” is really essentially incompatible with the idea of consistency, since replications arriving in a different order may cause the appearance of data that never existed in the real world. Let’s say you had an insert from node A being replicated to node B, and an insert from node C, which occurred after the node A insert, also being replicated to node B, but that arrived at node B first. You would have a situation where the node C insert existed in the node B environment and the earlier node A insert did not - I would call that just wrong data. So even though the actual order of application would be commutative, you would lose data integrity, and have no way of knowing that you had. This is bad.
I think that the drawing that you have of the node partition is a little confusing. It seems like after a node partition occurs, you would have the complementary c nodes in the partition - like n1, n2, c3, c4 and c5. When I think of the partition this way, it makes more sense.
Thanks again for these great explorations. I intend to read them all!
sbobet, on
Hey very nice blog!! Guy .. Excellent .. Amazing .. I will bookmark your site and take the feeds also?I am glad to find numerous useful info here in the put up, we want work out more techniques in this regard, thanks for sharing
Aphyr, on
Cassandra is less reliant on the network
I’m not sure I would say that. They have to do roughly the same amount of traffic for equivalent safety.
stribik, on
I mean 2 ^ 61.
stribik, on
122 of the 128 bits of an UUID are random, you should start seeing collisions at around 261 inserts, not a few thousand.
The UUID collisions could be the result of extremely low entropy conditions. This can happen if you are using VMs for testing, especially if the saved random seed is cloned. VMs have no access to disk seek timings and key press timings - the primary source of entropy. To make things worse, I’ve seen virtualization code emulate RDRAND by calling rand(3).
sturrockad, on
Hey, thanks for the great read.
So how would you compare Aerospike and Cassandra in a scenario like this since they seem to have a similar issue in your conclusion. From reading both it sounds like Cassandra is less reliant on the network but still costs data, though there are more ways to cope with this in the newer versions, whereas Aerospike is always going to be reliant on the network?
Aphyr, on
Unclear whether these methods return a new chart with different parameters, or mutate the existing parameters in scope. I generally prefer the first because it reduces mutability, reduces the need for defensive copying, etc.
walterra, on
What do you think of Bostock’s getter-setter-methods? http ://bost.ocks.org/mike/chart/
Heather, on
Interesting – so the node gets booted from the cluster quickly, but the primary still takes 90s to failover to another node?
Aphyr, on
Yo this is from, like, 2005. Use the ATOM URIs; they’re linked as rel=alternate.
Aphyr, on
For example: That “90-second hardcoded timeout” can be reduced by setting
discovery.zen.fd.ping_timeout, discovery.zen.fd.ping_interval or
discovery.zen.fd.ping_retries. I’d be curious to know if the increased detection
time is worth the added churn in an uncertain network.
I wish it could! I lowered every configurable timeout and spent a long time in the code before working out that you can’t tune this parameter. Here’s the config I finally came up with–it detects failure sooner, but you’re still looking at 90 seconds of downtime minimum. :(
# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
discovery.zen.ping.timeout: 3s
# After joining, send a join request with this timeout? Docs are kinda unclear
# here.
discovery.zen.join_timeout: 5s
# Fault detection timeouts. Our network is super reliable and uncongested
# except when we explicitly simulate failure, so I'm gonna lower these to speed
# up cluster convergence.
discovery.zen.fd.ping_interval: 1s
discovery.zen.fd.ping_timeout: 1s
discovery.zen.fd.ping_retries: 2
# Publish cluster state updates more frequently
discovery.zen.publish_timeout: 5s
Heather, on
Great post, thanks for the informative deep dive. Have you had a chance to try tweaking some of ElasticSearch’s settings while running these tests?
For example:
That “90-second hardcoded timeout” can be reduced by setting discovery.zen.fd.ping_timeout, discovery.zen.fd.ping_interval or discovery.zen.fd.ping_retries. I’d be curious to know if the increased detection time is worth the added churn in an uncertain network.
You can also force ElasticSearch to fsync after every write by setting that index.gateway.local.sync value to 0.
Vados, on
I’d second the request for an article on RethinkDB.
Mao Geng, on
Grrrr… I don’t know why my last comment is posted but previous comments are not. I wanna thank you - I learned many about distributed system from your articles. Also I am curious if you will test Couchbase, which is kind of similar to Aerospike?
Mao Geng, on
Sorry for commenting many times. I didn’t mean it. Just realized maybe your are moderating comments. Sorry.
Saurajeet, on
rss.cgi throwing route errors from sinatra.
Ratnesh, on
Thanks for your post! ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In other words Zookeeper is a replicated synchronization service with eventual consistency. More at www.youtube.com/watch?v=1jMR4cHBwZE
Martin, on
Thanks for your great write up. And thanks also for leaving on a high note, very good write up. I think the marketing material claiming ACID needs to be clarified / revisited. It’s not good business practice to promise one thing to customers, and deliver something different. The distributed database world is a better place with your excellent contributions.
zlosim, on
Great article!
+1 for hazelcast
Julian, on
I hate that our syntax clouds our intent in programming. Quick, someone, write some intent-translating software, then the syntax won’t matter so much. Oh, yeah, right, Charles Simonyi already did, but it’s closed source. Sadtimes.
Aphyr, on
rss appears to be down and I am sad. Should I pull request somewhere?
There hasn’t been an RSS feed here in like 8 years, but the ATOM feeds look fine to me.
Luis, on
The problem with maps is that if they’re statically typed, then the values must be homogeneously typed (which routinely leads to stringly typing), but if they’re dynamically typed then they’re not type-safe. The middle ground here is, surprise surprise, record types.
A pattern you often see in Haskell is to model an option set as a monoid, where the identity element is the default options, and the monoid’s operator combines options. This ties in with both the record types and the builder approach, because (a) a record type whose field types are all monoids is also a monoid, and (b) a chain of setter invocations is a monoid as well.
Named parameters with default values are equivalent to record types + monoids too.
Forgot to add the link of TokuMX Ark:
//www.tokutek.com/2014/07/introducing-ark-a-consensus-algorithm-for-tokumx-and-mongodb
Did anyone got to test whether all those issues especial the stale read one are resolved when using TokuMX?
Since it uses its own consensus algorithm (Ark) one might guess that those issues should be at least mitigated.
Hi, I wanted to add my thanks for writing a really informative how-to article.
I’ve recently built a dining table of my own, using a solid acacia wood slab and 10' 6x6 cedar posts from Lowe’s cut down to true 4x4s for the legs and beams. I wasn’t sure about using cedar, but was determined to work with 4x4 and not 3.5 for a hefty look.
From there I took many of my tips from your article on how to build it. As a novice I found that creating mortise and tenon joints was the hardest part. After fussing with chisels and drills and not getting the results I hoped for, I ended up using a multi-tool Dremel with an e-blade to cut the holes. The e-blade is easy to plunge in and makes clean edges, quick work.
I used PL to glue the joints, for extra strength. I’ve read varying reports on using glue, purists may object but let me tell you those joints won’t move ever. On gluing the last joints, a ratchet strap came in handy to hold it together.
The slab is finished with the water-based urethane, satin finish, both sides. Again read some varying reports on whether to seal both sides, but I decided it would be better to try to seal it up from the air completely to prevent any further checking. I worked about ½; cup into a couple of the cracks, no risk of water getting in there. I’m extremely pleased with the result, the table still has a good natural feel without the “plastic” look I see on some glossy finishes.
Please see an optometrist immediately!
Thanks for posting this. I was actually able to read it this time as the animated gifs made my eyes bleed :-P
Really sad that this is still the case. Right now we have dual Cassandra and Elasticsearch nodes. We reply our firehose API on top of Elasticsearch doing upserts exactly like you said. This allows us to decouple both systems but of course it’s more expensive.
I said it was possible, not useful, haha. ;-)
Sure it is. Addition and multiplication are both associative and commutative but not idempotent; PN-counters give you the extra context necessary to recover idempotent merges. Just replace the sum with multiplication in a PN-counter, and
n = n * 1.1works just fine.Hello,
Can you please do a follow-up and redo the tests using Kafka 0.8.X.?
Best regards.
“MarkLogic has been around for 13 years and…” have you run the Jepsen tests and can post the link to your results? If not, any hope Kyle has MarkLogic in his crossharis? It sure would be nice to hear the results from the closed-source vendor competing in this distributed document space.
“Read-committed is achievable” True, but what good is RC when it can be extremely out of date (such as an empty database). The guarantees made by RC are extremely non-sufficient in practice. You at least want “kind of” current data.
First of all, thanks for this terrific series of blogs. I thought I was the only one who thought so much about consistency and availability, but you proved me wrong.
The concept of CRDTs is new to me, and they seem to be useful, but there seem to be some pretty significant limits on their use. The first is that many updates, in particular, are not at all idempotent. The cases generally discussed fall in the category of “n = n + 1”, where the update is a fixed value. An update like “n = n * 1.1” is not at all idempotent.
Secondly, as is acknowledged, the concept of “eventual” is really essentially incompatible with the idea of consistency, since replications arriving in a different order may cause the appearance of data that never existed in the real world. Let’s say you had an insert from node A being replicated to node B, and an insert from node C, which occurred after the node A insert, also being replicated to node B, but that arrived at node B first. You would have a situation where the node C insert existed in the node B environment and the earlier node A insert did not - I would call that just wrong data. So even though the actual order of application would be commutative, you would lose data integrity, and have no way of knowing that you had. This is bad.
I think that the drawing that you have of the node partition is a little confusing. It seems like after a node partition occurs, you would have the complementary c nodes in the partition - like n1, n2, c3, c4 and c5. When I think of the partition this way, it makes more sense.
Thanks again for these great explorations. I intend to read them all!
Hey very nice blog!! Guy .. Excellent .. Amazing .. I will bookmark your site and take the feeds also?I am glad to find numerous useful info here in the put up, we want work out more techniques in this regard, thanks for sharing
I’m not sure I would say that. They have to do roughly the same amount of traffic for equivalent safety.
I mean 2 ^ 61.
122 of the 128 bits of an UUID are random, you should start seeing collisions at around 261 inserts, not a few thousand.
The UUID collisions could be the result of extremely low entropy conditions. This can happen if you are using VMs for testing, especially if the saved random seed is cloned. VMs have no access to disk seek timings and key press timings - the primary source of entropy. To make things worse, I’ve seen virtualization code emulate RDRAND by calling rand(3).
Hey, thanks for the great read.
So how would you compare Aerospike and Cassandra in a scenario like this since they seem to have a similar issue in your conclusion. From reading both it sounds like Cassandra is less reliant on the network but still costs data, though there are more ways to cope with this in the newer versions, whereas Aerospike is always going to be reliant on the network?
Unclear whether these methods return a new chart with different parameters, or mutate the existing parameters in scope. I generally prefer the first because it reduces mutability, reduces the need for defensive copying, etc.
What do you think of Bostock’s getter-setter-methods? http ://bost.ocks.org/mike/chart/
Interesting – so the node gets booted from the cluster quickly, but the primary still takes 90s to failover to another node?
Yo this is from, like, 2005. Use the ATOM URIs; they’re linked as rel=alternate.
I wish it could! I lowered every configurable timeout and spent a long time in the code before working out that you can’t tune this parameter. Here’s the config I finally came up with–it detects failure sooner, but you’re still looking at 90 seconds of downtime minimum. :(
# Set the time to wait for ping responses from other nodes when discovering. # Set this option to a higher value on a slow or congested network # to minimize discovery failures: # discovery.zen.ping.timeout: 3s # After joining, send a join request with this timeout? Docs are kinda unclear # here. discovery.zen.join_timeout: 5s # Fault detection timeouts. Our network is super reliable and uncongested # except when we explicitly simulate failure, so I'm gonna lower these to speed # up cluster convergence. discovery.zen.fd.ping_interval: 1s discovery.zen.fd.ping_timeout: 1s discovery.zen.fd.ping_retries: 2 # Publish cluster state updates more frequently discovery.zen.publish_timeout: 5sGreat post, thanks for the informative deep dive. Have you had a chance to try tweaking some of ElasticSearch’s settings while running these tests?
For example: That “90-second hardcoded timeout” can be reduced by setting discovery.zen.fd.ping_timeout, discovery.zen.fd.ping_interval or discovery.zen.fd.ping_retries. I’d be curious to know if the increased detection time is worth the added churn in an uncertain network.
You can also force ElasticSearch to fsync after every write by setting that index.gateway.local.sync value to 0.
I’d second the request for an article on RethinkDB.
Grrrr… I don’t know why my last comment is posted but previous comments are not. I wanna thank you - I learned many about distributed system from your articles. Also I am curious if you will test Couchbase, which is kind of similar to Aerospike?
Sorry for commenting many times. I didn’t mean it. Just realized maybe your are moderating comments. Sorry.
rss.cgi throwing route errors from sinatra.
Thanks for your post! ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In other words Zookeeper is a replicated synchronization service with eventual consistency. More at www.youtube.com/watch?v=1jMR4cHBwZE
Thanks for your great write up. And thanks also for leaving on a high note, very good write up. I think the marketing material claiming ACID needs to be clarified / revisited. It’s not good business practice to promise one thing to customers, and deliver something different. The distributed database world is a better place with your excellent contributions.
Great article!
+1 for hazelcast
I hate that our syntax clouds our intent in programming. Quick, someone, write some intent-translating software, then the syntax won’t matter so much. Oh, yeah, right, Charles Simonyi already did, but it’s closed source. Sadtimes.
There hasn’t been an RSS feed here in like 8 years, but the ATOM feeds look fine to me.
The problem with maps is that if they’re statically typed, then the values must be homogeneously typed (which routinely leads to stringly typing), but if they’re dynamically typed then they’re not type-safe. The middle ground here is, surprise surprise, record types.
A pattern you often see in Haskell is to model an option set as a monoid, where the identity element is the default options, and the monoid’s operator combines options. This ties in with both the record types and the builder approach, because (a) a record type whose field types are all monoids is also a monoid, and (b) a chain of setter invocations is a monoid as well.
Named parameters with default values are equivalent to record types + monoids too.