Jepsen: Chronos

2015-08-10

Chronos is a distributed task scheduler (cf. cron) for the Mesos cluster management system. In this edition of Jepsen, we’ll see how simple network interruptions can permanently disrupt a Chronos+Mesos cluster

Chronos relies on Mesos, which has two flavors of node: master nodes, and slave nodes. Ordinarily in Jepsen we’d refer to these as “primary” and “secondary” or “leader” and “follower” to avoid connotations of, well, slavery, but the master nodes themselves form a cluster with leaders and followers, and terms like “executor” have other meanings in Mesos, so I’m going to use the Mesos terms here.

Mesos slaves connect to masters and offer resources like CPU, disk, and memory. Masters take those offers and make decisions about resource allocation using frameworks like Chronos. Those decisions are sent to slaves, which actually run tasks on their respective nodes. Masters form a replicated state machine with a persistent log. Both masters and slaves rely on Zookeeper for coordination and discovery. Zookeeper is also a replicated persistent log.

Chronos runs on several nodes, and uses Zookeeper to discover Mesos masters. The Mesos leading master offers CPU, disk, etc to Chronos, which in turn attempts to schedule jobs at their correct times. Chronos persists job configuration in Zookeeper, and may journal additional job state to Cassandra. Chronos has its own notion of leader and follower nodes, independent from both Mesos and Zookeeper.

There are, in short, a lot of moving parts here–which leads to the question at the heart of every Jepsen test: will it blend?

Designing a test

Zookeeper will run across all 5 nodes. Our production Mesos installation separates control from worker nodes, so we’ll run Mesos masters on n1, n2, and n3; and Mesos slaves on n4 and n5. Finally, Chronos will run across all 5 nodes. We’re working with Zookeeper version 3.4.5+dfsg-2, Mesos 0.23.0-1.0.debian81, and Chronos 2.3.4-1.0.81.debian77; the most recent packages available in Wheezy and the Mesosphere repos as of August 2015.

Jepsen works by generating random operations and applying them to the system, building up a concurrent history of operations. We need a way to create new, randomized jobs, and to see what runs have occurred for each job. To build new jobs, we’ll write a stateful generator which emits jobs with a unique integer :name, a :start time, a repetition :count, a run :duration, an :epsilon window allowing jobs to run slightly late, and finally, an :interval between the start of each window.

This may seem like a complex way to generate tasks, and indeed earlier generators were much simpler–however, they led to failed constraints. Chronos takes a few seconds to spin up a task, which means that a task could run slightly after its epsilon window. To allow this minor fault we add an additional epsilon-forgiveness as padding, allowing Chronos to fudge its guarantees somewhat. Chronos also can’t run tasks immediately after their submission, so we have a small head-start delaying the beginning of a job. Finally, Chronos tries not to run tasks concurrently, which bounds the interval between targets. We ensure that the interval is large enough that the task could run at the end of the target’s epsilon window, plus that epsilon forgiveness, and still complete running before the next window begins.

Once jobs are generated, we transform them into a suitable JSON representation and make an HTTP POST to submit them to Chronos. Only successfully acknowledged jobs are required for the analysis to pass.

We need a way to identify which tasks ran and at what times. Our jobs will open a new file and write their job ID and current time, sleep for some duration, then, to indicate successful completion, write the current time again to the same file. We can reconstruct the set of all runs by parsing the files from all nodes. Runs are considered complete iff they wrote a final timestamp. In this particular test, all node clocks are perfectly synchronized, so we can simply union times from each node without correction.

With the basic infrastructure in place, we’ll write a client which takes add-job and read operations and applies them to the cluster. As with all Jepsen clients, this one is specialized via (setup! client test node) into a client bound to a specific node, ensuring we route requests to both leaders and non-leaders.

Finally, we bind together the database, OS, client, and generators into a single test. Our generator emits add-job operations with a 30 second delay between each, randomly staggered by up to 30 seconds. Meanwhile, the special nemesis process cycles between creating and resolving failures every 200 seconds. This phase proceeds for a few seconds, after which the nemesis resolves any ongoing failures and we allow the system to stabilize. Finally, we have a single client read the current runs.

In order to evaluate the results, we need a checker, which examines the history of add-job and read operations, and identifies whether Chronos did what it was supposed to.

How do you measure a task scheduler?

What does it mean for a cron system to be correct?

The trivial answer is that tasks run on time. Each task has a schedule, which specifies the times–call them targets–at which a job should run. The scheduler does its job iff, for every target time, the task is run.

Since we aren’t operating in a realtime environment, there will be some small window of time during which the job should run–call that epsilon. And because we can’t control how long tasks run for, we just want to ensure that the run begins somewhere between the target time t and t + epsilon–we’ll allow tasks to complete at their leisure.

Because we can only see runs that have already occurred, not runs from the future, we need to limit our targets to those which must have completed by the time the read began.

Since this is a distributed, fault-tolerant system, we should expect multiple, possibly concurrent runs for a single target. If a task doesn’t complete successfully, we might need to retry it–or a node running a task could become isolated from a coordinator, forcing the coordinator to spin up a second run. It’s a lot easier to recover from multiple runs than no runs!

So, given some set of jobs acknowledged by Chronos, and a set of runs for each job, we expand each job into a set of targets, attempt to map each target to some run, and consider the job valid iff every target is satisfied.

Assigning values to possibly overlapping bins is a constraint logic problem. We can use Loco, a wrapper around the Choco constraint solver to find a unique mapping from targets to runs. In the degenerate case when targets don’t overlap, we can simply sort both targets and runs and riffle them together. This approach is handy for getting partial solutions when the entire constraint problem can’t be satisfied.

This allows us to determine whether a set of runs satisfies a single job. To check multiple jobs, we simply group all runs by their job ID and solve each job independently, and consider the system valid iff every job is satisfiable by its runs.

Finally, we have to transform the history of operations–all those add-job operations followed by a read–into a set of jobs and a set of runs, and identify the time of the read so we can compute the targets that should have been satisfied. We can use the mappings of job targets to runs to compute overall correctness results, and to build graphs showing the behavior of the system over time.

With our test framework in place, it’s time to go exploring!

Results

To start, Chronos error messages are less than helpful. In response to an invalid job–perhaps due to a malformed date, for instance, it simply returns HTTP 400 with an empty body.

{:orig-content-encoding nil,
 :request-time 121
 :status 400
 :headers {"Server" "Jetty(8.y.z-SNAPSHOT"
           "Connection" "close"
           "Content-Length" "0"
           "Content-Type" "text/html;charset=ISO-8859-1"
           "Cache-Control" "must-revalidate,no-cache,no-store"}
 :body ""}

Chronos can also crash when proxying requests to the leader, causing invalid HTTP responses:

org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1290; received: 0)

Or the brusque:

org.apache.http.NoHttpResponseException: n3:4400 failed to respond

Or the delightfully enigmatic:

{:orig-content-encoding nil,
 :trace-redirects ["http://n4:4400/scheduler/iso8601"]
 :request-time 19476
 :status 500
 :headers {"Server" "Jetty(8.y.z-SNAPSHOT"
           "Connection" "close"
           "Content-Length" "1290"
           "Content-Type" "text/html;charset=ISO-8859-1"
           "Cache-Control" "must-revalidate,no-cache,no-store"}
 :body "<html>\n
        <head>\n
        <meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"/>\n
        <title>Error 500 Server Error</title>\n
        </head>\n>
        <body>\n
        <h2>HTTP ERROR: 500</h2>\n
        <p>Problem accessing /scheduler/iso8601. Reason:\n
        <pre>    Server Error</pre></p>\n
        <hr /><i><small>Powered by Jetty://</small></i>\n  \n  \n ... \n</html>\n"}

In other cases, you may not get a response from Chronos at all, because Chronos’ response to certain types of failures–for instance, losing its Zookeeper connection–is to crash the entire JVM and wait for an operator or supervising process, e.g. upstart, to restart it. This is particularly vexing because the Mesosphere Debian packages for Chronos don’t include a supervisor, and service chronos start isn’t idempotent, which makes it easy to run zero or dozens of conflicting copies of the Chronos process.

Chronos is the only system tested under Jepsen which hard-crashes in response to a network partition. The Chronos team asserts that allowing the process to keep running would allow split brain behavior, making this expected, if undocumented behavior. As it turns out, you can also crash the Mesos master with a network partition, and Mesos maintainers say this is not how Mesos should behave, so this “fail-fast” philosophy may play out differently depending on what Mesos components you’re working with.

If you schedule jobs with intervals that are too frequent–even if they don’t overlap–Chronos can fail to run jobs on time, because the scheduler loop can’t handle granularities finer than --schedule_horizon, which is, by default, 60 seconds. Lowering the scheduler horizon to 1 second allows Chronos to satisfy all executions for intervals around 30 seconds–so long as no network failures occur.

However, if the network does fail (for instance, if a partition cleanly isolates two nodes from the other three), Chronos will fail to run any jobs–even after the network recovers.

Chronos fails to run any jobs for the first two minutes, then fails to run any jobs after the partition initiates, and never recovers.

This plot shows targets and runs for each job over time. Targets are thick bars, and runs are narrow, darker bars. Green targets are satisfied by a run beginning in their time window, and red targets show where a task should have run but didn’t. The Mesos master dies at the start of the test and no jobs run until a failover two minutes later.

The gray region shows the duration of a network partition isolating [n2 n3] from [n1 n4 n5]. Chronos stops accepting new jobs for about a minute just after the partition, then recovers. ZK can continue running in the [n1 n4 n5] component, as can Chronos, but Mesos, to preserve a majority of its nodes [n1 n2 n3], can only allow a leading master in [n2 n3]. Isolating Chronos from the Mesos master prevents job execution during the partition–hence every target during the partition is red.

This isn’t the end of the world–it does illustrate the fragility of a system with three distinct quorums, all of which must be available and connected to one another, but there will always be certain classes of network failure that can break a distributed scheduler. What one might not expect, however, is that Chronos never recovers when the network heals. It continues accepting new jobs, but won’t run any jobs at all for the remainder of the test–every target is red even after the network heals. This behavior persists even when we give Chronos 1500+ seconds to recover.

Even given thousands of seconds to recover, Chronos never runs jobs after an initial partition.

The timeline here is roughly:

0 seconds: Mesos on n3 becomes leading master
15 seconds: Chronos on n1 becomes leader
224 seconds: A partition isolates [n1 n4] from [n2 n3 n5]
239 seconds: Chronos on n1 detects ZK connection loss and does not crash
240 seconds: A few rounds of elections; n2 becomes Mesos leading master
270 seconds: Chronos on n3 becomes leader and detects n2 as Mesos leading master
375 seconds: The partition heals
421 seconds: Chronos on n1 recovers its ZK connection and recognizes n3 as new Chronos leader.

This is bug #520: after Chronos fails over, it registers with Mesos as an entirely new framework instead of re-registering. Mesos assumes the original Chronos framework still owns every resource in the cluster, and refuses to offer resources to the new Chronos leader. Why did the first leader consume all resources when it only needed a small fraction of them? I’m not really sure.

I0812 12:13:06.788936 12591 hierarchical.hpp:955] No resources available to allocate!

Studious readers may also have noticed that in this test, Chronos leader and non-leader nodes did not crash when they lost their connections, but instead slept and reconnected at a later time. This contradicts the design statements made in #513, where a crash was expected and necessary behavior. I’m not sure what lessons to draw from this, other than that operators should expect the unexpected.

As a workaround, the Chronos team recommended setting --offer_timeout (I chose 30secs) to allow Mesos to reclaim resources from the misbehaving Chronos framework. They also recommend automatically restarting both Chronos and Mesos processes–both can recover from some kinds of partitions but others cause them to crash.

With these changes in place, Mesos may be able to recover some jobs but not others. Just after the partition resolves, it runs most (but not all!) jobs outside their target times. For instance, Job 14 runs twice in too short a window, just after the partition ends. Job 9, on the other hand, never recovers at all.

Chronos only recovers some jobs but not others, and runs jobs at the wrong times

Or maybe you’ll get some jobs that run during a partition, followed by a wave of failures a few minutes after resolution–and sporadic scheduling errors later on.

There are just SO MANY EXCITING WAYS FOR CHRONOS TO FAIL

I’m running out of time to work on Chronos and can’t explore much further, but you can follow the Chronos team’s work in #511.

Recommendations

In general, the Mesos and Chronos documentation is adequate for developers but lacks operational guidance; for instance, it omits that Chronos nodes are fragile by design and must be supervised by a daemon to restart them. The Mesosphere Debian packages don’t provide these supervisory daemons; you’ll have to write and test your own.

Similar conditions (e.g. a network failure) can lead to varied failure modes: for instance, both Mesos and Chronos can sleep and recover from some kinds of network partitions isolating leaders from Zookeeper, but not others. Error messages are unhelpful and getting visibility into the system is tricky.

In Camille Fournier’s excellent talk on consensus systems, she advises that “Zookeeper Owns Your Availability.” Consensus systems are a necessary and powerful tool, but they add complexity and new failure modes. Specifically, if the consensus system goes down, you can’t do work any more. In Chronos’s case, you’re not just running one consensus system, but three. If any one of them fails, you’re in for a bad time. An acquaintance notes that at their large production service, their DB has lost 2/3 quorum nodes twice this year.

Transient resource or network failures can completely disable Chronos. Most systems tested with Jepsen return to some sort of normal operation within a few seconds to minutes after a failure is resolved. In no Jepsen test has Chronos ever recovered completely from a network failure. As an operator, this fragility does not inspire confidence.

Production users confirm that Chronos handles node failure well, but can get wedged when ZK becomes unavailable.

If you are evaluating Chronos, you might consider shipping cronfiles directly to redundant nodes and having tasks coordinate through a consensus system–it could, depending on your infrastructure reliability and need for load-balancing, be simpler and more reliable. Several engineers suggest that Aurora is more robust, though more difficult to set up, than Chronos. I haven’t evaluated Aurora yet, but it’s likely worth looking in to.

If you already use Chronos, I suggest you:

Ensure your Mesos and Chronos processes are surrounded with automatic-restart wrappers
Monitor Chronos and Mesos uptime to detect restart loops
Ensure your Chronos schedule_horizon is shorter than job intervals
Set Mesos’ --offer_timeout to some reasonable (?) value
Instrument your jobs to identify whether they ran or not
Ensure your jobs are OK with being run outside their target windows
Ensure your jobs are OK with never being run at all
Avoid network failures at all costs

I still haven’t figured out how to get Chronos to recover from a network failure; presumably some cycle of total restarts and clearing ZK can fix a broken cluster state, but I haven’t found the right pattern yet. When Chronos fixes this issue, it’s likely that it will still refuse to run jobs during a partition. Consider whether you would prefer multiple or zero runs during network disruption–if zero is OK, Chronos may still be a good fit. If you need jobs to keep running during network partitions, you may need a different system.

This work is a part of my research at Stripe, where we’re trying to take systems reliability more seriously. My thanks to Siddarth Chandrasekaran, Brendan Taylor, Shale Craig, Cosmin Nicolaescu, Brenden Matthews, Timothy Chen, and Aaron Bell, and to their respective teams at Stripe, Mesos, and Mesosphere for their help in this analysis. I’d also like to thank Caitie McCaffrey, Kyle Conroy, Ines Sombra, Julia Evans, and Jared Morrow for their feedback.

Charlie HOover on 2015-08-18

I need a date ! help!

YellowApple on 2015-08-18

Here you go: Tue Aug 18 12:11:14 PDT 2015

Ian Miell on 2015-08-19

This is a fantastic post. Thank you. Ian

darkfader on 2015-08-20

Thanks, I had so much fun reading this!

tim on 2015-08-21

Keep the awesome tests going.

Something must work!

Magnus on 2015-08-29

How would the result be affected if you run Chronos controlled by Maraton? I think that is the recommended way by many and helps to recover from some failure states.

zytek on 2015-09-05

Oh, this perfectly demonstrats how in the hype of “distributed, fault tolerant systems” things are not as authors promise us and how they forget that even when every piece is fault-tolerant, combined pieces a.k.a. ‘system’ is not. Awesome job.

anonymous on 2015-10-13

Chronos relies on Mesos, which has two flavors of node: master nodes, and slave nodes. Ordinarily in Jepsen we’d refer to these as “primary” and “secondary” or “leader” and “follower” to avoid connotations of, well, slavery, but the master nodes themselves form a cluster with leaders and followers, and terms like “executor” have other meanings in Mesos, so I’m going to use the Mesos terms here.

If you prefer, you can call what Mesos calls “master” a “director” and the “slaves” “workers” or something similar. Then “leader” and “follower” still works for the Mesos director mini-cluster.

For anyone interested in the “executor” part: Mesos frameworks have two “halves”: Scheduler and Executor. Scheduler is the part that connects to the director and chooses between resource offers, while executor is the part that actually runs on the worker; the meat of the task itself. More on this in the next response.

Mesos slaves connect to masters and offer resources like CPU, disk, and memory. Masters take those offers and make decisions about resource allocation using frameworks like Chronos. Those decisions are sent to slaves, which actually run tasks on their respective nodes. Masters form a replicated state machine with a persistent log. Both masters and slaves rely on Zookeeper for coordination and discovery. Zookeeper is also a replicated persistent log.

More specifically, Mesos workers offer all of their available resources at once, unless otherwise configured. Directors then take all of their known offers on hand and present them to any scheduler clients that have connected. At this point, scheduler clients must actively make a decision to take or leave resources, although recent Mesos versions may now include a timeout for such decisions to be reached. This is done in sort of a “token ring”/serial round-robin-esque fashion; all resources are offered individually to only one scheduler at a time, and the responses to those offers are then replicated down to the followers and run on the worker corresponding to the resource offer. Not all resources have to be used, either; the scheduler can tell the director “I’ll take this one, but only 1 CPU and 2 GB of the RAM offered, and please run this task at this URL with this ID.”

Chronos runs on several nodes, and uses Zookeeper to discover Mesos masters.

Quick note: ALL Mesos clients (including workers and directors) use Zookeeper to discover Mesos directors; the directors use it to leader-elect, everything else uses it to get the current leading director.

The Mesos leading master offers CPU, disk, etc to Chronos, which in turn attempts to schedule jobs at their correct times.

Since Mesos offers resources in a round-robin sort of fashion, what winds up happening is when the offers are made and it’s not time to run a task (that is: the current time falls outside of the configured timeframes within an epsilon tolerance OR the task is currently running and configured not to run more than once at a time), Chronos passes on them. When it IS time to run a task, Chronos takes the offer and presents the task information to Mesos, which passes the information on to a worker. The job is then run.

There are a couple of ways this can fail on its own even without partitions: another connected framework can receive the offers and not respond in a timely manner (meaning your epsilon must be at least the timeout for a resource offer expiration (offer_timeout) times the count of every other framework in Mesos to not fail because of this), there can be no offers available at the time (because another framework is currently using them, the task has not finished, there are no workers or for some other reason the resources are not available)… probably a few more I’m forgetting.

tl;dr this thing is a bit of a minefield. Good article.

rystsov on 2015-10-20

Hey Kyle, master in computer since and master in slavery are homonyms. Of cause connotation is possible for any homonyms, but in general people don’t crusade against homonyms, why should we do it in CS?

More over this crusade looks futile. Originally CS started in english-speaking word, so it isn’t surprising that many CS term as well as master&slave were borrowed into other languages without translation. As a result other languages e.g. Russian use master/slave in CS but has native words for slavery. I bet the same situation is in another languages, so for the majority of programmers master-slave doesn’t have any connections to the slavery.

Also it’s funny that even in US slavery analogy would have been totally acceptable in 18th century. Now you suggest to use words for master-slave replication that have other meaning which is totally acceptable now. How can you predict that words primary&secondary won’t become offensive 200 year later?

So if you want to make an attempt to change a world-known term for CS then it’s better to use words without meaning to avoid possible another renaming in future. For example we may rename master-slave replication to, say, krirtulp-rybkliler replication.

anonymous on 2015-10-24

Master in slavery context and master I’m computer science context are homonyms

Definitely not; the CS terminology deliberately invokes the classical context as a way to intuitively “pre-inform” the concept of the slave (node/device/whatever) taking some sort of command or direction from the master. In databases and clusters where this is relevant, this is in the form of replication logs or in Mesos’s case, literal tasks. In the sense that it maybe gains you an extra second of a hint at the relationship between two (nodes/devices/whatever), it’s effective.

On the other hand, the very basis of that inference is a very real thing that happened in many cultures' histories, affecting the ancestors – some not so distant – of many, many people. The imperative that a slave must carry out the commands of its master is something real people have had to live under, be punished for, and tell their children about – even after the fact. Then their children have children, and the stories are passed on again. But some of those children want to go into computer science, where this horrible thing that happened to their parents, their grandparents, their great grandparents, has been distilled into a commonplace and normal relationship between two objects. Sure, they’re not sapient, no free will is being subjugated, no living thing forced to make their life’s purpose the whims of another; but it directly recalls a time in histories and human life where it wasn’t a server or a hard drive, it was a person, like you or me. A casual mention opening old family wounds. It’s hard to explain around that just as it’s hard to reason around emotion.

Primary and secondary are really unlikely to become offensive someday, being numerical words based in mathematics. They still hint at the relationship, but in a less emotionally charged way. I’ll grant that it can be a little less descriptive and informational, but that’s the compromise. It’s also not wrong. There isn’t a reason not to use it in place of the “traditional” terms.

Besides, as a guest in the place (blog) of a host who has stated a preference for less contentious terminology, the polite thing to do is adopt the custom and follow the example.

(Incidentally, an even better word for the “workers” is “actors”. Especially because Mesos internally strives to maintain the Actor model with libprocess, so it’s kind of a pun.)

Henrik Andersson on 2015-11-18

rystsov: if you plan to make krirtulp-rybkliler replication a thing, I would be willing to donate some of my time to educate people about it. This is not to be taken as an insult. I don’t know if you randomly hit keys and got that, but I find it very entertaining.

vitez.7 on 2015-12-15

Hi, thank you very much for this article. I’m running the Mesosphere cluster (3 master nodes/3 zookeeper nodes/3 slaves; everything separated) on the network that is affected by other systems hence not reliable sometimes.

From time to time the Chronos UI get stuck, but as a process it’s still running so no uptime can restart it. It has to be manually done. There are also a lot of warnings/errors and even infos in logs like orphan tasks, transport endpoint not connected, keeperException when processing session id, master nodes decline offers, etc.

You helped me understand why it happens, because from the logs it’s impossible. Thank you

Matt Gould on 2016-03-07

Following the Issues you linked, it appears that they have been fixed. Would it be possible to retry your tests and post a follow up?

anonymous on 2016-06-20

Yeah. It will be REALLY helpful if the same test will be retried on the new version of Mesos.

Ladislav Jech on 2017-03-09

+1 Great post and update would be really helpful

I am now also considering Chronos as clusterized HA job scheduler and executor engine for batch jobs written in Java, but I am curious about production issues and recovery. We have very small IT team here, so focused on handling implementation issues and not expected much to look into scheduler issues.

Still I will give Chronos a try in experimental environment in dockerized instances and lets see.

Jepsen: Chronos

Designing a test

How do you measure a task scheduler?

Results

Recommendations

Post a Comment