Timelike 2: everything fails all the time

2013-02-18

In the previous post, I described an approximation of Heroku’s Bamboo routing stack, based on their blog posts. Hacker News, as usual, is outraged that the difficulty of building fast, reliable distributed systems could prevent Heroku from building a magically optimal architecture. Coda Hale quips:

Really enjoying @RapGenius’s latest mix tape, “I Have No Idea How Distributed Systems Work”.

Coda understands the implications of the CAP theorem. This job is too big for one computer–any routing system we design must be distributed. Distribution increases the probability of a failure, both in nodes and in the network itself. These failures are usually partial, and often take the form of degradation rather than the system failing as a whole. Two nodes may be unable to communicate with each other, though a client can see both. Nodes can lie to each other. Time can flow backwards.

CAP tells us that under these constraints, we can pick two of three properties (and I’m going to butcher them in an attempt to be concise):

Consistency: nodes agree on the system’s state.
Availability: the system accepts requests.
Partition tolerance: the system runs even when the network delays or drops some messages.

In the real world, partitions are common, and failing to operate during a partition is essentially a failure of availability. We must choose CP or AP, or some probabilistic blend of the two.

There’s a different way to talk about the properties of a distributed system–and I think Peter Bailis explains it well. Liveness means that at every point, there exists a sequence of operations that allows the “right thing” to happen–e.g. “threads are never deadlocked” or “you never get stuck in an infinite loop”. Safety means the system fails to do anything bad. Together, safety and liveness ensure the system does good things on time.

With this in mind, what kind of constraints apply to HTTP request routing?

The system must be partition tolerant.
The system must be available–as much as possible, anyway. Serving web pages slower is preferable to not serving them at all. In the language of CAP, our system must be AP.
But we can’t wait too long, because requests which take more than a minute to complete are essentially useless. We have a liveness constraint.
Requests must complete correctly, or not at all. We can’t route an HTTP POST to multiple servers at once, or drop pieces of requests on the floor. We have a safety constraint.

It’s impossible to do this perfectly. If all of our data centers are nuked, there’s no way we can remain available. If the network lies to us, it can be impractical to guarantee correct responses. And we can let latencies rise to accommodate failure: the liveness constraint is flexible.

Finally, we’re real engineers. We’re going to make mistakes. We have limited time and money, limited ability to think, and must work with existing systems which were never designed for the task at hand. Complex algorithms are extraordinarily difficult to prove–let alone predict–at scale, or under the weird failure modes of distributed systems. This means it’s often better to choose a dumb but predictable algorithm over an optimal but complex one.

What I want to make clear is that Heroku is full of smart engineers–and if they’re anything like the engineers I know, they’re trying their hardest to adapt to a rapidly changing problem, fighting fires and designing new systems at the same time. Their problems don’t look anything like yours or mine. Their engineering decisions are driven by complex and shifting internal constraints which we can’t really analyze or predict. When I talk about “improved routing models” or “possible alternatives”, please understand that those models may be too complex, incompatible, or unpredictable to build in a given environment.

Dealing with unreliability

Returning to our Bamboo stack simulation, I’d like to start by introducing failure dynamics.

Real nodes fail. We’ll make our dynos unreliable with the faulty function, which simulates a component which stays online for an exponentially-distributed time before crashing, then returns error responses instead of allowing requests to pass through. After another exponentially-distributed outage time, it recovers, and the process continues. You can interpret this as a physical piece of hardware, or a virtual machine, or a hot-spare scenario where another node spins up to take the downed one’s place, etc. This is a fail-fast model–the node returns failure immediately instead of swallowing messages indefinitely. Since the simulations we’re running are short-lived, I’m going to choose relatively short failure times so we can see what happens under changing dynamics.

(defn faulty-dyno []
  (cable 2
    ; Mean time before failure of 20 seconds, and
    ; mean time before resolution of one second.
    (faulty 20000 1000
      (queue-exclusive
        (delay-fixed 20
          (delay-exponential 100
            (server :rails))))))

Again, we’re using a pool of 250 dynos and a poisson-distributed load function. Let’s compare an even load balancer with a pool of perfect dynos vs a pool of faulty ones:

(test-node "Reliable min-conn -> pool of faulty dynos."
  (lb-min-conn
    (pool pool-size
      (faulty-dyno)))))

.                     Ideal dynos       95% available dynos
Total reqs:           100000            100000
Selected reqs:        50000             50000
Successful frac:      1.0               0.62632
Request rate:         678.2972 reqs/s   679.6156 reqs/s
Response rate:        673.90894 reqs/s  676.74567 reqs/s
Latency distribution:
Min:                  24.0              4.0
Median:               93.0              46.5
95th %:               323.0             272.0
99th %:               488.0             438.0
Max:                  1044.0            914.0

Well that was unexpected. Even though our pool is 95% available, over a third of all requests fail. Because our faulty nodes fail immediately, they have smaller queues on average–and the min-conns load balancer routes more requests to them. Real load balancers like HAProxy keep track of which nodes fail and avoid routing requests to them. Haproxy uses active health checks, but for simplicity I’ll introduce a passive scheme: when a request fails, don’t decrement that host’s connection counter immediately. Instead, wait for a while–say 1 second, the mean time to resolution for a given dyno. We can still return the error response immediately, so this doesn’t stop the load balancer from failing fast, but it will reduce the probability of assigning requests to broken nodes.

(lb-min-conn :lb {:error-hold-time 1000}
  (pool pool-size
    (faulty-dyno)))))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  0.98846
Request rate:     678.72076 reqs/s
Response rate:    671.3302 reqs/s
Latency distribution:
Min:     4.0
Median:  92.0
95th %:  323.0
99th %:  486.0
Max:     1157.0

Throughput is slightly lower than the ideal, perfect pool of dynos, but we’ve achieved 98% reliability over a pool of nodes which is only 95% available, and done it without any significant impact on latencies. This system is more than the sum of its parts.

This system has an upper bound on its reliability: some requests must fail in order to determine which dynos are available. Can we do better? Let’s wrap the load balancer with a system that retries requests on error, up to three requests total:

(test-node "Retry -> min-conn -> faulty pool"
  (retry 3 
    (lb-min-conn :lb {:error-hold-time 1000}
      (pool pool-size
        (faulty-dyno))))))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  0.99996
Request rate:     676.8098 reqs/s
Response rate:    670.16046 reqs/s
Latency distribution:
Min:     12.0
Median:  94.0
95th %:  320.0
99th %:  484.0
Max:     944.0

The combination of retries, least-conns balancing, and diverting requests away from failing nodes allows us to achieve 99.996% availability with minimal latency impact. This is a great building block to work with. Now let’s find a way to compose it into a large-scale distributed system.

Multilayer routing

Minimum-connections and round-robin load balancers require coordinated state. If the machines which comprise our load balancer are faulty, we might try to distribute the load balancer itself in a highly available fashion. That would require state coordination with low latency bounds–and the CAP theorem tells us this is impossible to do. We’d need to make probabilistic tradeoffs under partitions, like allowing multiple requests to flow to the same backend.

What if we punt on AP min-conns load balancers? What if we make them single machines, or CP clusters? As soon as the load balancer encountered a problem, it would become completely unavailable.

(defn faulty-lb
  [pool]
  (faulty 20000 1000
    (retry 3
      (lb-min-conn :lb {:error-hold-time 1000}
        pool))))

Let’s model the Bamboo architecture again: a stateless, random routing layer on top, which allocates requests to a pool of 10 faulty min-conns load balancers, all of which route over a single pool of faulty dynos:

(test-node "Random -> 10 faulty lbs -> One pool"
  (let [dynos (dynos pool-size)]
    (lb-random
      (pool 10
        (cable 5
          (faulty-lb
            dynos)))))))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  0.9473
Request rate:     671.94366 reqs/s
Response rate:    657.87744 reqs/s
Latency distribution:
Min:     10.0
Median:  947.0
95th %:  1620.0
99th %:  1916.0
Max:     3056.0

Notice that our availability dropped to 95% in the two-layer distributed model. This is a consequence of state isolation: because the individual least-conns routers don’t share any state, they can’t communicate about which nodes are down. That increases the probability that we’ll allocate requests to broken dynos. A load-balancer which performed active state-checks wouldn’t have this problem; but we can work around it by adding a second layer of retries on top of the stateless random routing layer:

(let [dynos (pool pool-size (faulty-dyno))]
      (retry 3
        (lb-random
          (pool 10
            (cable 5
              (faulty-lb
                dynos))))))))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  0.99952
Request rate:     686.97363 reqs/s
Response rate:    668.2616 reqs/s
Latency distribution:
Min:     30.0
Median:  982.0
95th %:  1639.0
99th %:  1952.010000000002
Max:     2878.0

This doesn’t help our latency problem, but it does provide three nines availability! Not bad for a stateless routing layer on top of a 95% available pool. However, we can do better.

Isolating the least-conns routers from each other is essential to preserve liveness and availability. On the other hand, it means that they can’t share state about how to efficiently allocate requests over the same dynos–so they’ll encounter more failures, and queue multiple requests on the same dyno independently. One way to resolve this problem is to ensure that each least-conns router has a complete picture of its backends’ state. We isolate the dynos from one another:

This has real tradeoffs! For one, an imbalance in the random routing topology means that some min-conns routers will have more load than their neighbors–and they can’t re-route requests to dynos outside their pool. And since our min-conns routers are CP systems in this architecture, when they fail, an entire block of dynos is unroutable. We have to strike a balance between more dynos per block (efficient least-conns routing) and more min-conn blocks (reduced impact of a router failure).

Let’s try 10 blocks of 25 dynos each:

  (test-node "Retry -> Random -> 10 faulty lbs -> 10 pools"
    (retry 3
      (lb-random
        (pool 10
          (cable 5
            (faulty-lb
              (pool (/ pool-size 10)
                (faulty-dyno)))))))))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  0.99952
Request rate:     681.8213 reqs/s
Response rate:    677.8099 reqs/s
Latency distribution:
Min:     30.0
Median:  104.0
95th %:  335.0
99th %:  491.0
Max:     1043.0

Whoah! We’re still 99.9% available, even with a stateless random routing layer on top of 10 95% available routers. Throughput is slightly down, but our median latency is nine times lower than the homogenous dyno pool.

I think system composition is important in distributed design. Every one of these components is complex. It helps to approach each task as an isolated system, and enforce easy-to-understand guarantees about that component’s behavior. Then you can compose different systems together to make something bigger and more useful. In these articles, we composed an efficient (but nonscalable) CP system with an inefficient (but scalable) AP system to provide a hybrid of the two.

If you have awareness of your network topology and are designing for singlethreaded, queuing backends, this kind of routing system makes sense. However, it’s only going to be efficient if you can situate your dynos close to their least-conns load balancer. One obvious design is to put one load balancer in each rack, and hook it directly to the rack’s switch. If blocks are going to fail as a group, you want to keep those blocks within the smallest network area possible. If you’re working in EC2, you may not have clear network boundaries to take advantage of, and correlated failures across blocks could be a real problem.

This architecture also doesn’t make sense for concurrent servers–and that’s a growing fraction of Heroku’s hosted applications. I’ve also ignored the problem of dynamic pools, where dynos are spinning up and exiting pools constantly. Sadly I’m out of time to work on this project, but perhaps a reader will chime in a model for for distributed routing over concurrent servers–maybe with a nonlinear load model for server latencies?

Thanks for exploring networks with me!

Sean Cribbs on 2013-02-19

The strict definition of liveness is: for any partial execution, there exists a sequence of events/steps which will achieve the desired outcome. That does not imply something completes in a finite amount of time – that’s a safety property, because there is a specific point in time at which it can be violated. Liveness properties are violated when there is a state in which the system can never achieve the desired outcome.

Also, temporality is hard.

Sean Cribbs on 2013-02-19

On the other hand, having a bound on the time a request can take before aborting is a way to achieve liveness, i.e. that a response happens, even if it is a failure state. A non-live system would let the request continue even if it had no chance of completing. In that case, the liveness property is “eventual service”. Aborting the request because of timeout ensures that a future resubmission has the possibility of completing successfully.

Aphyr on 2013-02-19

Ah, thanks Sean. I’ll update the post. Formal definitions are hard!

Ryan King on 2013-02-19

You can solve the un-routable blocks problem by having each of the second tier load balancers send traffic to more than one block (but less than all the blocks).

I haven’t run the simulation, but it seems like that might be a good middle ground.

Aphyr on 2013-02-19

You’re right, Ryan, but for queued services over, say, 50% capacity, the latency cost of even 2 least-conns routers per block can be pretty high. If I were designing a system like this, I’d probably track the block->dyno mapping in an eventually-consistent geographically distributed store, with the understanding that so long as migration times are significantly longer than the store’s convergence times, you’re unlikely to see double-counts. Then you could gradually rebalance dynos to accomodate persistently uneven load or long-lasting failures.

Crutcher Dunnavant on 2013-02-20

Timelike looks very useful. I’ve done these things in practice, and the numbers you’re coming up with feel right in my past experience.

Reading through the thread on reddit however, I found an approach I’d never seen in practice, and the math looks neat: http://brooker.co.za/blog/2012/01/17/two-random.html

Could you do a no-partition, two-random least-con simulation?

jrheard on 2013-02-20

There’s a syntax error in the first code block - give faulty-dyno an arglist please :)

Drew on 2013-02-20

Your analysis is good, but you completely missed the point: The problem is NOT that heorku has an inferior or underperforming product. They can have a crappy a product as they want. The problem is what they were selling was a lie. They claimed to provide one thing, took people’s money, and then provided something else entirely.

Aphyr on 2013-02-20

Could you do a no-partition, two-random least-con simulation?

I’d actually love to, but I’m honestly pretty pressed for time right now–got a journal article, a Riemann release, a new job, and 2 talks to deliver in the next few months. If I get a chance I’ll do a followup–and I’m happy to link to anything you find out if you give it a try!

Aphyr on 2013-02-21

There’s a syntax error in the first code block

Thanks Ewing, fixed now. :)

Justin Mason on 2013-02-21

This is a fantastic writeup.

BTW, +1 on trying out Marc Brooker’s algorithm, if you get a chance – I’ve seen what he’s worked on and written inside Amazon. he knows what he’s talking about ;)

Geert-Jan Brits on 2013-04-06

Excellent read.

Your last diagram “blocks of dyno’s and 1 logical LB per block” got me thinking.

You state: “This has real tradeoffs! For one, an imbalance in the random routing topology means that some min-conns routers will have more load than their neighbors–and they can’t re-route requests to dynos outside their pool. ”

But, in a multi-tenant system (like Heroku), couldn’t we use the knowledge of the separate tenants (each tenant has it’s own subset of dedicated dyno’s) to more efficiently partition the blocks of dyno’s?

Let’s for sake of argument go with the simplest case: each tenant would map 1 on 1 to a block of dyno’s. On top of that 1 min-conn router to cover 1 block ( i.e: serves 1 tenant) .

In this scenario we need an extra LB layer in front of the described setup in which a request is mapped to a tenant and by extension is mapped to the correct Logical LB which exists solely to serve said tenant.

The flow would thus be: request -> dns latency (or geo) routing -> DC front loadbalancer to determine tenant (and min-conn router that serves tenant) -> min conn router sets out request with dyno of tenant.

This would be optimal for each tenant: dyno’s for each tenant would be optimally utilized.

Moreover, because of the partitioning per tenant each of the min-conn routers only has to concern itself with a fraction of the data to make it’s decision, namely: only the queue depth of all dyno’s of a particular tenant.

This in turn would make it far easier (?) to make each of the min-conn routers loadbalanced and CP while still keeping reasonable latency to communicate state.

Any flaw in this thinking? I’m really not pretending to be an expert in this field (far from it) but this sounds pretty reasonable in my head. Any debunking welcome :)

Not sure if this multi-tenancy aspect could be elegantly modeled with Timelike, but I’d surely like to try.

Robert on 2013-11-13

When we start retrying failed requests we can repeat a request which failed just before returning a response, so they need to be idempotent. That got me wondering: would backup requests help? I mean sending a request to a second dyno if it takes long enough. Upon receiving a response from either of them, we send the response back and try to cancel the other one (if it is still in the queue). This increases load, but decreases latency dispersion if the distribution of latencies of different dynos are independent.

vbitdpyuki on 2021-03-22

Muchas gracias. ?Como puedo iniciar sesion?

Timelike 2: everything fails all the time

Dealing with unreliability

Multilayer routing

Post a Comment