Thanks a lot, I learned a ton. Keep up the good work.
Aphyr, on
Russelldb: You’re right–if you retry operations against a CRDT you can still end up with double-increment. That’s what I was trying to get at with “state-based”: it’s safe to retry writing the updated state of the CRDT, but you can’t retry the transformation from scratch.
Jasper A. Visser, on
Thanks, this is an excellent read. I’m gonna have to digest the TLA+ bit for a bit longer to understand it. :)
For the purpose of continuity of discussion, antirez posted another reply here:
http://antirez.com/news/56
Ulrik Skyt, on
Thanks for a very good post!
I agree that allow_mult=true together with a custom conflict resolution function is the only way for most general cases.
Using Riak, one of several important new things you need to worry about (coming from a traditional transactional database) is the merge function. And it can turn out to be non-trivial and full of hard business decisions to be made, if the use case is one where it is important that data are correct.
Scott, on
Prescient. The w=safe scenario you show (including extra fails during rollback/re-election) happened to us today when EC2 West region had network issues that caused a network partition that separated PRIMARY from its 2 SECONDARIES in a 3 node replset. 2 hours later the old primary rejoined and rolled back everything on the new primary. Our bad for not using w=majority. Love this series. Thanks much.
Dave Peticolas, on
This is an excellent series, thank you for writing it.
Kresten Krab Thorup, on
Thank you very much for a great series of posts. This kind of research is pure gold.
Henrik Ingo, on
“So 3 writes which supposedly failed actually succeeded. That’s not so bad.”
Actually, this can happen in any kind of cluster. In fact, it can happen even in single server failure. It simply means that there is a short moment where data is committed (to disk, to a cluster… any definition of committed) but acknowledgement of this has not yet been sent back to client, at which point the server (or client) crashes. So the commit has happened properly, you just didn’t hear about it.
Admittedly, with MongoDB w=majority this is probably more likely to happen than in some other systems, ie the “moment” is longer.
Russell Brown, on
Outstanding post. And a great talk at Ricon (I watched the live stream.) One small argument:
You said “False negatives are OK, though, because state-based CRDTs are idempotent.” Which is not true of any of the counters currently described in the literature. Retrying an increment on a counter that was partially successful will lead to a double count. You can easily create an idempotent counter by using a G-Set (or two!) and using a unique ID for each operation, taking the set cardinality as the counter’s value, but the space trade-off is pretty steep. Maybe you could bound the time a unique ID may be reused after failure, but this is still a reasonably difficult engineering problem, that involves some sort of consensus to shrink the G-Set.
I’m working on something at the moment that uses append only logs, transformed periodically into state based CRDTs that might be good for an idempotent counter…it is a long way from done and I have little time.
Nolan, on
This series has been absolutely top-notch. Thanks for putting in what is obviously an amazing amount of time and effort. I’ll definitely be revisiting these in the future.
Henrik, on
Well written. Well argued.
Jepsen seems like a great tool for QA engineering, will definitely spend some hours with it!
Lucian, on
Thanks you very much for this excellent analysis. Even though it vaguely matched what I had expected, it is very useful to see actual proof.
antirez, on
Kyle, thanks for your incredible serie of blog posts!
I googled the same thing, ended up here. Although in my case there have always been a bright white flashing light that lasts for the fraction of a second that the sudden noise does. Right after that my heart starts beating faster for a few seconds, I guess it’s normal given the unexpected nature of a loud noise that engages our adrenaline levels in case of any imminent danger…
Aphyr, on
The California exclusion only applies to work which does not “relate in any way” to the company’s business or R&D–in practice, getting a company to nail down exactly what classes of ideas and algorithms relate to their business is a tricky matter.
Geert-Jan Brits, on
Excellent read.
Your last diagram “blocks of dyno’s and 1 logical LB per block” got me thinking.
You state: “This has real tradeoffs! For one, an imbalance in the random routing topology means that some min-conns routers will have more load than their neighbors–and they can’t re-route requests to dynos outside their pool. ”
But, in a multi-tenant system (like Heroku), couldn’t we use the knowledge of the separate tenants (each tenant has it’s own subset of dedicated dyno’s) to more efficiently partition the blocks of dyno’s?
Let’s for sake of argument go with the simplest case: each tenant would map 1 on 1 to a block of dyno’s. On top of that 1 min-conn router to cover 1 block ( i.e: serves 1 tenant) .
In this scenario we need an extra LB layer in front of the described setup in which a request is mapped to a tenant and by extension is mapped to the correct Logical LB which exists solely to serve said tenant.
The flow would thus be: request -> dns latency (or geo) routing -> DC front loadbalancer to determine tenant (and min-conn router that serves tenant) -> min conn router sets out request with dyno of tenant.
This would be optimal for each tenant: dyno’s for each tenant would be optimally utilized.
Moreover, because of the partitioning per tenant each of the min-conn routers only has to concern itself with a fraction of the data to make it’s decision, namely: only the queue depth of all dyno’s of a particular tenant.
This in turn would make it far easier (?) to make each of the min-conn routers loadbalanced and CP while still keeping reasonable latency to communicate state.
Any flaw in this thinking? I’m really not pretending to be an expert in this field (far from it) but this sounds pretty reasonable in my head. Any debunking welcome :)
Not sure if this multi-tenancy aspect could be elegantly modeled with Timelike, but I’d surely like to try.
Andreas, on
Hello Kyle,
thanks for this deep and insightful article!
I do like the title, too :-) In the end, the title clearly states to what it comes down to, IMHO.
Greetings,
Andreas
Matt Stump, on
California law states that work done on your own hardware and on your own time is owned by you. Also, I’m starting to see appendixes on employment contracts that specifically state the same. That of course only helps if your’re in California.
Sergey, on
Amazing shot!
AndreasS, on
Couple of things here:
its kind of hard to see whats the purpose of the blog post, there are different things I would say that could help/change that, like a different title:
what I like about clojure macros | dissolving syntax patterns using (clojure) macros
-when explaining the doto form I was reminded of combinators:
https://github.com/raganwald/homoiconic/blob/master/2008-10-29/kestrel.markdown
I think a side step to explain how the general concept is named would have added value to your post
-you state: “Scala, for instance, includes special syntactic rules for a broad variety of constructs (XML literals, lexical closures, keyword arguments, implicit scope, variable declaration, types)—and often several syntaxes for the same construct (method invocation, arguments, code blocks). When there are many syntax rules, understanding how those rules interact with each other can be difficult.”
your examples of macros do not solve this issue for me in a clear way
==>
All together I liked the post but it seemed a little rushed. Also when you speak of “power” I like it when the axis on which that is measured are described in a clear and general way.
Mike Robinson, on
thank you.
Justin Mason, on
This is a fantastic writeup.
BTW, +1 on trying out Marc Brooker’s algorithm, if you get a chance – I’ve seen what he’s worked on and written inside Amazon. he knows what he’s talking about ;)
Aphyr, on
I’d be interested in your thoughts and/or models on using forced redirects as a sort of ‘ghetto smart load balancer’.
I think it’s promising–but you’re right, as you start approaching fully loaded dynos the load balancers will spend more and more of their time retrying connections in sequence. You’ll want to instrument your dynos to make sure you always have a reasonable probability of selecting a free node.
The load balancers… don’t necessarily have to do requests in sequence–if the TCP stack is up to it the load balancer can initiate a TCP SYN to N dynos in parallel, take the first accepted connection, and close any others that accept. Might be prohibitive, not sure.
Aphyr, on
There’s a syntax error in the first code block
Thanks Ewing, fixed now. :)
Aphyr, on
Could you do a no-partition, two-random least-con simulation?
I’d actually love to, but I’m honestly pretty pressed for time right now–got a journal article, a Riemann release, a new job, and 2 talks to deliver in the next few months. If I get a chance I’ll do a followup–and I’m happy to link to anything you find out if you give it a try!
Jim Wrubel, on
Disclosure - like Rap Genius I’m running a significant Rails stack on Heroku. Single-threaded, although we’re using Unicorn so at least we get a little bump from the multi-process webserver. We have had decent results by cranking the Unicorn backlog down (in the low teens at this point). Heroku’s routers will respect a ‘full’ queue and try a different dyno (though at some point they eventually give up: https://devcenter.heroku.com/articles/error-codes#h21-backend-connection-refused). I’d be interested in your thoughts and/or models on using forced redirects as a sort of ‘ghetto smart load balancer’. There would clearly be some overhead in retrying the request if the router randomly hits a full dyno, but as you pointed out, there’s a lot of overhead you can give up before you end up back at the worst case random algorithm.
Drew, on
Your analysis is good, but you completely missed the point: The problem is NOT that heorku has an inferior or underperforming product. They can have a crappy a product as they want. The problem is what they were selling was a lie. They claimed to provide one thing, took people’s money, and then provided something else entirely.
jrheard, on
There’s a syntax error in the first code block - give faulty-dyno an arglist please :)
Could you do a no-partition, two-random least-con simulation?
Leslie landberg, on
It’s refreshing to know that the USA is still #1! Number one in homicides, number one in western infant mortality rates, number one in the first world in illiteracy and income disparity and poverty!! Go Team America! And when you feel that good about yourself, naturally you want to celebrate by going put and killing somebody with your precious guns. Beatles said it best: happiness is a warm gun.
On another note, the problems with your basic premise and sources seems insurmountable, but A for effort, and, no, I am not being sarcastic. It’s a great start, let’s keep going! I just know there is a pony under this mountain of confounding statistical shit, but we’ll have to continue to wrk together to unearth it.
Aphyr, on
You’re right, Ryan, but for queued services over, say, 50% capacity, the latency cost of even 2 least-conns routers per block can be pretty high. If I were designing a system like this, I’d probably track the block->dyno mapping in an eventually-consistent geographically distributed store, with the understanding that so long as migration times are significantly longer than the store’s convergence times, you’re unlikely to see double-counts. Then you could gradually rebalance dynos to accomodate persistently uneven load or long-lasting failures.
Thanks a lot, I learned a ton. Keep up the good work.
Russelldb: You’re right–if you retry operations against a CRDT you can still end up with double-increment. That’s what I was trying to get at with “state-based”: it’s safe to retry writing the updated state of the CRDT, but you can’t retry the transformation from scratch.
Thanks, this is an excellent read. I’m gonna have to digest the TLA+ bit for a bit longer to understand it. :)
For the purpose of continuity of discussion, antirez posted another reply here: http://antirez.com/news/56
Thanks for a very good post!
I agree that allow_mult=true together with a custom conflict resolution function is the only way for most general cases.
Using Riak, one of several important new things you need to worry about (coming from a traditional transactional database) is the merge function. And it can turn out to be non-trivial and full of hard business decisions to be made, if the use case is one where it is important that data are correct.
Prescient. The w=safe scenario you show (including extra fails during rollback/re-election) happened to us today when EC2 West region had network issues that caused a network partition that separated PRIMARY from its 2 SECONDARIES in a 3 node replset. 2 hours later the old primary rejoined and rolled back everything on the new primary. Our bad for not using w=majority. Love this series. Thanks much.
This is an excellent series, thank you for writing it.
Thank you very much for a great series of posts. This kind of research is pure gold.
“So 3 writes which supposedly failed actually succeeded. That’s not so bad.”
Actually, this can happen in any kind of cluster. In fact, it can happen even in single server failure. It simply means that there is a short moment where data is committed (to disk, to a cluster… any definition of committed) but acknowledgement of this has not yet been sent back to client, at which point the server (or client) crashes. So the commit has happened properly, you just didn’t hear about it.
Admittedly, with MongoDB w=majority this is probably more likely to happen than in some other systems, ie the “moment” is longer.
Outstanding post. And a great talk at Ricon (I watched the live stream.) One small argument:
You said “False negatives are OK, though, because state-based CRDTs are idempotent.” Which is not true of any of the counters currently described in the literature. Retrying an increment on a counter that was partially successful will lead to a double count. You can easily create an idempotent counter by using a G-Set (or two!) and using a unique ID for each operation, taking the set cardinality as the counter’s value, but the space trade-off is pretty steep. Maybe you could bound the time a unique ID may be reused after failure, but this is still a reasonably difficult engineering problem, that involves some sort of consensus to shrink the G-Set.
I’m working on something at the moment that uses append only logs, transformed periodically into state based CRDTs that might be good for an idempotent counter…it is a long way from done and I have little time.
This series has been absolutely top-notch. Thanks for putting in what is obviously an amazing amount of time and effort. I’ll definitely be revisiting these in the future.
Well written. Well argued.
Jepsen seems like a great tool for QA engineering, will definitely spend some hours with it!
Thanks you very much for this excellent analysis. Even though it vaguely matched what I had expected, it is very useful to see actual proof.
Kyle, thanks for your incredible serie of blog posts!
My reply about Redis Sentinel is here -> http://antirez.com/news/55
I googled the same thing, ended up here. Although in my case there have always been a bright white flashing light that lasts for the fraction of a second that the sudden noise does. Right after that my heart starts beating faster for a few seconds, I guess it’s normal given the unexpected nature of a loud noise that engages our adrenaline levels in case of any imminent danger…
The California exclusion only applies to work which does not “relate in any way” to the company’s business or R&D–in practice, getting a company to nail down exactly what classes of ideas and algorithms relate to their business is a tricky matter.
Excellent read.
Your last diagram “blocks of dyno’s and 1 logical LB per block” got me thinking.
You state: “This has real tradeoffs! For one, an imbalance in the random routing topology means that some min-conns routers will have more load than their neighbors–and they can’t re-route requests to dynos outside their pool. ”
But, in a multi-tenant system (like Heroku), couldn’t we use the knowledge of the separate tenants (each tenant has it’s own subset of dedicated dyno’s) to more efficiently partition the blocks of dyno’s?
Let’s for sake of argument go with the simplest case: each tenant would map 1 on 1 to a block of dyno’s. On top of that 1 min-conn router to cover 1 block ( i.e: serves 1 tenant) .
In this scenario we need an extra LB layer in front of the described setup in which a request is mapped to a tenant and by extension is mapped to the correct Logical LB which exists solely to serve said tenant.
The flow would thus be: request -> dns latency (or geo) routing -> DC front loadbalancer to determine tenant (and min-conn router that serves tenant) -> min conn router sets out request with dyno of tenant.
This would be optimal for each tenant: dyno’s for each tenant would be optimally utilized.
Moreover, because of the partitioning per tenant each of the min-conn routers only has to concern itself with a fraction of the data to make it’s decision, namely: only the queue depth of all dyno’s of a particular tenant.
This in turn would make it far easier (?) to make each of the min-conn routers loadbalanced and CP while still keeping reasonable latency to communicate state.
Any flaw in this thinking? I’m really not pretending to be an expert in this field (far from it) but this sounds pretty reasonable in my head. Any debunking welcome :)
Not sure if this multi-tenancy aspect could be elegantly modeled with Timelike, but I’d surely like to try.
Hello Kyle,
thanks for this deep and insightful article! I do like the title, too :-) In the end, the title clearly states to what it comes down to, IMHO.
Greetings, Andreas
California law states that work done on your own hardware and on your own time is owned by you. Also, I’m starting to see appendixes on employment contracts that specifically state the same. That of course only helps if your’re in California.
Amazing shot!
Couple of things here:
your examples of macros do not solve this issue for me in a clear way
==> All together I liked the post but it seemed a little rushed. Also when you speak of “power” I like it when the axis on which that is measured are described in a clear and general way.
thank you.
This is a fantastic writeup.
BTW, +1 on trying out Marc Brooker’s algorithm, if you get a chance – I’ve seen what he’s worked on and written inside Amazon. he knows what he’s talking about ;)
I’d be interested in your thoughts and/or models on using forced redirects as a sort of ‘ghetto smart load balancer’.
I think it’s promising–but you’re right, as you start approaching fully loaded dynos the load balancers will spend more and more of their time retrying connections in sequence. You’ll want to instrument your dynos to make sure you always have a reasonable probability of selecting a free node.
The load balancers… don’t necessarily have to do requests in sequence–if the TCP stack is up to it the load balancer can initiate a TCP SYN to N dynos in parallel, take the first accepted connection, and close any others that accept. Might be prohibitive, not sure.
There’s a syntax error in the first code block
Thanks Ewing, fixed now. :)
Could you do a no-partition, two-random least-con simulation?
I’d actually love to, but I’m honestly pretty pressed for time right now–got a journal article, a Riemann release, a new job, and 2 talks to deliver in the next few months. If I get a chance I’ll do a followup–and I’m happy to link to anything you find out if you give it a try!
Disclosure - like Rap Genius I’m running a significant Rails stack on Heroku. Single-threaded, although we’re using Unicorn so at least we get a little bump from the multi-process webserver. We have had decent results by cranking the Unicorn backlog down (in the low teens at this point). Heroku’s routers will respect a ‘full’ queue and try a different dyno (though at some point they eventually give up: https://devcenter.heroku.com/articles/error-codes#h21-backend-connection-refused). I’d be interested in your thoughts and/or models on using forced redirects as a sort of ‘ghetto smart load balancer’. There would clearly be some overhead in retrying the request if the router randomly hits a full dyno, but as you pointed out, there’s a lot of overhead you can give up before you end up back at the worst case random algorithm.
Your analysis is good, but you completely missed the point: The problem is NOT that heorku has an inferior or underperforming product. They can have a crappy a product as they want. The problem is what they were selling was a lie. They claimed to provide one thing, took people’s money, and then provided something else entirely.
There’s a syntax error in the first code block - give faulty-dyno an arglist please :)
Hello Kyle,
Nice program and description. I’d be interested to see how a simulated “tied requests” implementation as described in this article would compare: http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext
Timelike looks very useful. I’ve done these things in practice, and the numbers you’re coming up with feel right in my past experience.
Reading through the thread on reddit however, I found an approach I’d never seen in practice, and the math looks neat: http://brooker.co.za/blog/2012/01/17/two-random.html
Could you do a no-partition, two-random least-con simulation?
It’s refreshing to know that the USA is still #1! Number one in homicides, number one in western infant mortality rates, number one in the first world in illiteracy and income disparity and poverty!! Go Team America! And when you feel that good about yourself, naturally you want to celebrate by going put and killing somebody with your precious guns. Beatles said it best: happiness is a warm gun.
On another note, the problems with your basic premise and sources seems insurmountable, but A for effort, and, no, I am not being sarcastic. It’s a great start, let’s keep going! I just know there is a pony under this mountain of confounding statistical shit, but we’ll have to continue to wrk together to unearth it.
You’re right, Ryan, but for queued services over, say, 50% capacity, the latency cost of even 2 least-conns routers per block can be pretty high. If I were designing a system like this, I’d probably track the block->dyno mapping in an eventually-consistent geographically distributed store, with the understanding that so long as migration times are significantly longer than the store’s convergence times, you’re unlikely to see double-counts. Then you could gradually rebalance dynos to accomodate persistently uneven load or long-lasting failures.