Timelike: a network simulator

2013-02-17

For more on Timelike and routing simulation, check out part 2 of this article: everything fails all the time. There’s also more discussion on Reddit.

RapGenius is upset about Heroku’s routing infrastructure. RapGenius, like many web sites, uses Rails, and Rails is notoriously difficult to operate in a multithreaded environment. Heroku operates at large scale, and made engineering tradeoffs which gave rise to high latencies–latencies with adverse effects on customers. I’d like to explore why Heroku’s Bamboo architecture behaves this way, and help readers reason about their own network infrastructure.

To start off with, here’s a Rails server. Since we’re going to be discussing complex chains of network software, I’ll write it down as an s-expression:

(server :rails)

Let’s pretend that server has some constant request-parsing overhead–perhaps 20 milliseconds–and an exponentially-distributed processing time with a mean of 100 milliseconds.

(delay-fixed 20
  (delay-exponential 100
    (server :rails)))

Heroku runs a Rails application in a virtual machine called a Dyno, on EC2. Since the Rails server can only do one thing at a time, the dyno keeps a queue of HTTP requests, and applies them sequentially to the rails application. We’ll talk to the dyno over a 2-millisecond-long network cable.

(defn dyno []
  (cable 2
    (queue-exclusive
      (delay-fixed 20
        (delay-exponential 100
          (server :rails))))))

This node can process an infinite queue of requests at the average rate of 1 every 124 milliseconds (2 + 20 + 100 + 2). But some requests take longer than others. What happens if your request lands behind a different, longer request? How long do you, the user, have to wait?

Introducing Timelike

Surprise! This way of describing network systems is also executable code. Welcome to Timelike.

(cable 2 ...) returns a function which accepts a request, sleeps for 2 milliseconds, then passes the request to a child function–in this case, a queuing function returned by queue-exclusive. Then cable sleeps for 2 more milliseconds to simulate the return trip, and returns the response from queue-exclusive. The request (and response) are just a list of events, each one timestamped. The return value of each function, or “node”, is the entire history of a request as it passes through the pipeline.

Network node composition is function composition–and since they’re functions, we can run them.

(let [responses (future*
                  ; In a new thread, generate poisson-distributed
                  ; requests. We want 10,000 total, spaced roughly
                  ; 150 milliseconds apart. Apply them to a single
                  ; dyno.
                  (load-poisson 10000 150 req (dyno)))]
  (prn (first @responses))
  (pstats @responses))

Timelike doesn’t actually sleep for 150 milliseconds between requests. The openjdk and oracle schedulers are unreliable as it stands–and we don’t actually need to wait that long to compute the value of this function. We just virtualize time for every thread in the network (in this case, a thread per request). All operations complete “immediately” according to the virtual clock, and the clock only advances when threads explicitly sleep. We can still exploit parallelism whenever two threads wake up at the same time, and advance the clock whenever there’s no more work to be done at a given time. The scheduler will even detect deadlocks and allow the clock to advance when active threads are blocked waiting to acquire a mutex held by a thread which won’t release it until the future… though that’s a little slow. ;-)

The upside of all this ridiculous lisp is that you can simulate concurrent systems where the results are independent of wall-clock time, which makes it easier to compare parallel systems at different scales. You can simulate one machine or a network of thousands, and the dynamics are the same.

Here’s an example request, and some response statistics. We discard the first and last parts of the request logs to avoid measuring the warm-up or cool-down period of the dyno queue.

[{:time 0} {:node :rails, :time 66}]

Total reqs:       10000
Selected reqs:    5000
Successful frac:  1.0
Request rate:     6.6635394 reqs/s
Response rate:    6.653865 reqs/s
Latency distribution:
Min:     22.0
Median:  387.0
95th %:  1728.0
99th %:  2894.1100000000024
Max:     3706.0

Since the request and response rates are close, we know the dyno was stable during this time–it wasn’t overloaded or draining its queue. But look at that latency distribution! Our median request took 3 times the mean, and some requests blocked for multiple seconds. Requests which stack up behind each other have to wait, even if they could complete quickly. We need a way to handle more than one request at a time.

How do you do that with a singlethreaded Rails? You run more server processes at once. In Heroku, you add more dynos. Each runs in parallel, so with n dynos you can (optimally) process n requests at a time.

(defn dynos
  "A pool of n dynos"
  [n]
  (pool n (dyno)))

There’s those funny macros again.

Now you have a new problem: how do you get requests to the right dynos? Remember, whatever routing system we design needs to be distributed–multiple load balancers have to coordinate about the environment.

Random routing

Random load balancers are simple. When you get a new request, you pick a random dyno and send the request over there. In the infinite limit this is fine; a uniformly even distribution will distribute an infinite number of requests evenly across the cluster. But our systems aren’t infinite. A random LB will sometimes send two, or even a hundred requests to the same dyno even when its neighbors go unused. That dyno’s queue will back up, and everyone in that queue has to wait for all the requests ahead of them.

(lb-random
  (dynos 250))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  1.0
Request rate:     1039.7172 reqs/s
Response rate:    1012.6787 reqs/s
Latency distribution:
Min:     22.0
Median:  162.0
95th %:  631.0
99th %:  970.0
Max:     1995.0

A cool thing about random LBs is that they require little coordinated state. You don’t have to agree with your peers about where to route a request. They also compose freely: a layer of random load balancers over another layer of random load balancers has exactly the same characteristics as a single random load balancer, assuming perfect concurrency. On the other hand, leaving nodes unused while piling up requests on a struggling dyno is silly. We can do better.

Round-Robin routing

Round-robin load balancers write down all their backends in a circular list (also termed a “ring”). The first request goes to the first backend in the ring; the second request to the second backend, and so forth, around and around. This has the advantage of evenly distributing requests, and it’s relatively simple to manage the state involved: you only need to know a single number, telling you which element in the list to point to.

(lb-rr
  (dynos 250))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  1.0
Request rate:     1043.9939 reqs/s
Response rate:    1029.6116 reqs/s
Latency distribution:
Min:     22.0
Median:  105.0
95th %:  375.0
99th %:  560.0
Max:     1173.0

We halved our 95th percentile latencies, and cut median request time by roughly a third. RR balancers have a drawback though. Most real-world requests–like the one in our model–take a variable amount of time. When that variability is large enough (relative to pool saturation), round robin balancers can put two long-running requests on the same dyno. Queues back up again.

Least-connections routing

A min-conn LB algorithm keeps track of the number of connections which it has opened on each particular backend. When a new connection arrives, you find the backend with the least number of current connections. For singlethreaded servers, this also corresponds to the server with the shortest queue (in terms of request count, not time).

(lb-min-conn
  (dynos 250))

Total reqs:       100000
Selected reqs:    50000
Successful frac:  1.0
Request rate:     1049.7806 reqs/s
Response rate:    1041.1244 reqs/s
Latency distribution:
Min:     22.0
Median:  92.0
95th %:  322.0
99th %:  483.0
Max:     974.0

Our 95th percentile latency has gone from 600 ms, to 375 ms, to 322ms. This algorithm is significantly more efficient over our simulated dynos than random or round-robin balancing–though it’s still not optimal. An optimal algorithms would predict the future and figure out how long the request will take before allocating it–so it could avoid stacking two long-running requests in the same queue.

Least-conns also means keeping track of lots of state: a number for every dyno, at least. All that state has to be shared between the load balancers in a given cluster, which can be expensive. On the other hand, we could afford up to a 200-millisecond delay on each connection, and still be more efficient than a random balancer. That’s a fair bit of headroom.

Meanwhile, in the real world

Heroku can’t use round-robin or min-conns load balancers for their whole infrastructure–it’s just too big a problem to coordinate. Moreover, some of the load balancers are far apart from each other so they can’t communicate quickly or reliably. Instead, Heroku uses several independent least-conns load balancers for their Bamboo stack. This has a drawback: with two least-conns routers, you can load the same dyno with requests from both routers at once–which increases the queue depth variability.

Let’s hook up a random router to a set of min-conns routers, all backed by the same pool of 250 dynos. We’ll separate the random routing layer from the min-conns layer by a 5-millisecond-long network cable.

(defn bamboo-test
  [n]
  (test-node (str "Bamboo with " n " routers")
     (let [dynos (dynos pool-size)]
       (lb-random
         (pool n
           (cable 5
             (lb-min-conn
               dynos)))))))

(deftest ^:bamboo bamboo-2
         (bamboo-test 2))
(deftest ^:bamboo bamboo-4
         (bamboo-test 4))
(deftest ^:bamboo bamboo-8
         (bamboo-test 8))
(deftest ^:bamboo bamboo-16
         (bamboo-test 16))

This plot sums up, in a nutshell, why RapGenius saw terrible response times. Latencies in this model–especially those killer 95th and 99th percentile times–rise linearly with additional least-conns routers (asymptotically bounded by the performance of a random router). As Heroku’s Bamboo cluster grew, so did the variability of dyno queue depths.

This is not the only routing topology available. In part 2, I explore some other options for distributed load balancing. If you want to experiment with Timelike for yourself, check out the github project.

Matt Trifiro on 2013-02-18

Wow. What an extraordinary and straightforward explanation of a hugely complex topic.

dude on 2013-02-18

Very interesting! But I think you should edit the post to be a little less misleading wrt ‘Bamboo’ and ‘Rails’. Heroku’s random routing has nothing to do with Bamboo. It affects Cedar just as much. The only difference that’s relevant here is that Cedar supports multi-threaded app servers, of which there are many that work with Rails.

Since your post stands alone without ANY talk of Bamboo/Cedar or Rails, I hope you will consider removing that stuff.

Stefan Wrobel on 2013-02-18

Please label your graph axes

Troy Howard on 2013-02-18

At AppFog, we have a relatively simplistic LB/routing layer: round-robin selection across the LB to the cache/reverse-proxy, then randomized to the app instances behind it. We were just talking about changing the router code to a LRU style of routing (which wouldn’t really be very different than round-robin), but backed off on it for fear of fucking up what was already a pretty effective routing layer.

Mostly our impetus for change was that after a code-review, we reacted to the randomization with “there’s no way random could be better than something more intelligent”… but we backed off because we felt that introducing anything other than a constant-time algo to the routing pipeline was a recipe for dangerous outcomes.

It would be fun to model it this way and see which method proves better. Also, we’re on the non-bleeding-edge version of our Cloud Foundry components. Cloud Foundry’s next-generation router has very little testing, and we’re hesitant to move to it. It takes the odd approach of using a (very high speed) message bus with a subscription/first responder method to do the reverse proxy vs a frequently updated, but fundamentally static in-memory hash checked via embedded Lua in Nginx. Neither one is very confidence inspiring to me,

It looks kind of scary and we’re going to have to put it through a lot of serious testing before we release it, because at our scale we can’t handle anything slower than what we’re already working with, and theory is too hard to evaluate.

Thanks, Troy

Ralph on 2013-02-18

“Heroku can’t use round-robin or min-conns load balancers for their whole infrastructure–it’s just too big a problem to coordinate.”

What’s the basis for this statement? DynamoDB (or Cassandra, or whatever NoSQL distributed atomic counter du jour you like), for example, has no problem returning in <50ms against pretty sizeable read/write traffic, and it seems pretty trivial (and important!) to shard pools of dynos per router rather than trying to treat dynos as a globally shared pool of resources.

Oh. You cover this in your next blog post. Excellent. And well done.

Ralph on 2013-02-18

“Heroku can’t use round-robin or min-conns load balancers for their whole infrastructure–it’s just too big a problem to coordinate.”

What’s the basis for this statement? DynamoDB (or Cassandra, or whatever NoSQL distributed atomic counter du jour you like), for example, has no problem returning in 50ms against pretty sizeable read/write traffic, and it seems pretty trivial (and important!) to shard pools of dynos per router rather than trying to treat dynos as a globally shared pool of resources.

Oh. You cover this in your next blog post. Excellent. And well done. (although a bare < breaks your comment system…)

Aphyr on 2013-02-18

Since your post stands alone without ANY talk of Bamboo/Cedar or Rails, I hope you will consider removing that stuff.

The dynamics of this routing system depend on a.) the existence of per-dyno queues and b.) the invariance of the server response distribution. Neither of those assumptions applies to a concurrent-server stack like Cedar, as far as I understand from Heroku’s posts.

DynamoDB (or Cassandra, or whatever NoSQL distributed atomic counter du jour you like), for example, has no problem returning in 50ms against pretty sizeable read/write traffic, and it seems pretty trivial (and important!) to shard pools of dynos per router rather than trying to treat dynos as a globally shared pool of resources.

In theory, the top-level routing system is globally distributed, which means you could face inter-dc latencies for consensus; at least one round trip and likely more than that. There’s a reason folks don’t split Dynamo over multiple datacenters without special rack-aware reasoning; on inhomogenous networks you can get spectacular tail latencies. The problem is that the dynamics of the system–like dyno queue depths–varies on a shorter timescale than an inter-DC system can reach consensus, which renders any consensus-based approach useless. The only solutions I can think of are

1.) Go stateless 2.) Be stateful, but only over dynamics that change slowly, like overall DC load 3.) Localize state to a system with tight latency bounds, like a single DC or machine.

In practice, your DNS system is already balancing traffic on a geographic basis, which helps you choose #3; a hybrid stateless/short-latency CP system is what I describe in the second post.

Aphyr on 2013-02-18

Cloud Foundry’s next-generation router has very little testing, and we’re hesitant to move to it. It takes the odd approach of using a (very high speed) message bus with a subscription/first responder method to do the reverse proxy vs a frequently updated, but fundamentally static in-memory hash checked via embedded Lua in Nginx.

Without knowing your infrastructure… I blindly recommend haproxy to everyone who isn’t sure what load balancer to use. I’ve used (haproxy for load balancing) -> (nginx for rewrites and static files) -> (app servers) with excellent results. The least-conns balancer and health checks built into haproxy work nicely for seamless app deployment, and support for kernel TCP stream splicing minimizes the latency cost.

John on 2013-02-20

Hello Kyle,

Nice program and description. I’d be interested to see how a simulated “tied requests” implementation as described in this article would compare: http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext

Jim Wrubel on 2013-02-20

Disclosure - like Rap Genius I’m running a significant Rails stack on Heroku. Single-threaded, although we’re using Unicorn so at least we get a little bump from the multi-process webserver. We have had decent results by cranking the Unicorn backlog down (in the low teens at this point). Heroku’s routers will respect a ‘full’ queue and try a different dyno (though at some point they eventually give up: https://devcenter.heroku.com/articles/error-codes#h21-backend-connection-refused). I’d be interested in your thoughts and/or models on using forced redirects as a sort of ‘ghetto smart load balancer’. There would clearly be some overhead in retrying the request if the router randomly hits a full dyno, but as you pointed out, there’s a lot of overhead you can give up before you end up back at the worst case random algorithm.

Aphyr on 2013-02-21

I’d be interested in your thoughts and/or models on using forced redirects as a sort of ‘ghetto smart load balancer’.

I think it’s promising–but you’re right, as you start approaching fully loaded dynos the load balancers will spend more and more of their time retrying connections in sequence. You’ll want to instrument your dynos to make sure you always have a reasonable probability of selecting a free node.

The load balancers… don’t necessarily have to do requests in sequence–if the TCP stack is up to it the load balancer can initiate a TCP SYN to N dynos in parallel, take the first accepted connection, and close any others that accept. Might be prohibitive, not sure.