So there’s a blog post that advises every method should, when possible, return self. I’d like to suggest you do the opposite: wherever possible, return something other than self.

Mutation is hard

Mutation makes code harder to reason about. Mutable objects make equality comparisons tricky: if you use a mutable object as the key in a hashmap, for instance, then change one of its fields, what happens? Can you access the value by the new string value? By the old one? What about a set? An array? For a fun time, try these in various languages. Try it with mutable primitives, like Strings, if the language makes a distinction. Enjoy the results.

If you call a function with a mutable object as an argument, you have very few guarantees about the new object’s value. It’s up to you to enforce invariants like “certain fields must be read together”.

If you have two threads interacting with mutable objects concurrently, things get weird fast.

Now, nobody’s arguing that mutability is always bad. There are really good reasons to mutate: your program ultimately must change state; must perform IO, to be meaningful. Mutation is usually faster, reduces GC pressure, and can be safe! It just comes with costs! The more of your program deals with pure values, the easier it is to reason about. If you compare two objects now, you know they’ll compare the same later. You can pass arguments to functions without ever having to worry that they’ll be changed out from underneath you. It gets easier to reason about thread safety.

Moreover, you don’t need a fancy type system like Haskell to experience these benefits: even in the unityped default-mutable wonderland of Ruby, having a culture that makes mutation explicit (for instance, gsub vs gsub!), a culture where not clobbering state is the default, can make our jobs a little easier. Remember, we don’t have to categorically prevent bugs; just make them less likely. Every bit helps.

Returning nil, void, or self strongly suggests impurity

Any time you see a method like

public void foo(String X) { ... } function(a, b) { ... return undefined; } def foo(args) ... self end

you should read: “This function probably mutates state!” In an object oriented language, it might mutate the receiver (self or this). It might mutate any of its arguments. It might mutate variables in lexical scope. It might mutate the computing environment, by setting a global variable, or writing to the filesystem, or sending a network packet.

The hand-wavy argument for this is that there is exactly one meaningful pure function for each of these three return types: the constant void function, the constant nil function, and the identity function(s). If you see this signature used over and over, it’s a hint you’re staring at a big ball of mutable state.

Proof

We aim to show there is only one pure function returning void, one pure function returning nil, etc. In general, we wish to show for any value r you might care to return, there exists exactly one pure function which always returns r.

I’m going to try to write this for folks without a proofs background, but I will use some notation:

  • Capital letters, e.g. X, denote sets
  • f(x) is function application
  • a iff b means “a if, and only if, b”
  • | means “such that”
  • ∀ x means “for all x”
  • ∃ x means “there exists an x”
  • x ∈ X means “x is an element of the set X”
  • (x, y) is an ordered pair, like a tuple
  • X x Y is the Cartesian product: all ordered pairs of (x, y) taken from X and Y respectively.

Definitions

I’m going to depart slightly from the usual set-theoretic definitions to simplify the proof and reduce confusion with common CS terms. We’re interested in functions which might:

  • Take a receiver (e.g. this, self)
  • Take arguments
  • Return values
  • Throw exceptions
  • Depend on an environment
  • Mutate their environment

Let’s simplify.

  • A receiver is simply the first argument to a function.
  • Zero or multiple arguments can be represented as an ordered tuple: (), (arg1), (arg1, arg2, arg3, …).
  • Returning multiple return values (as in go) can be modeled by returning tuples.
  • Exceptions can be modeled as a special set of return values, e.g. (“exception”, “something bad!”)
  • In addition to mapping an argument to a return value, the function will map an initial environment e to a (possibly identical) final environment e'. The environment encapsulates IO, global variables, dynamic scope, mutable state, etc.

Now we adapt the usual set-theoretic graph definition of a function to our model:

Definition 1. A function f in an environment set E, from an input set X (the “domain”), to a set of return values Y (the “codomain”), written f: E, X -> Y, is the set of ordered tuples (e, e', x, y) where e and e' ∈ E, x ∈ X, and y ∈ Y, with two constraints:

  1. Completeness. ∀ x ∈ X, e ∈ E: ∃ (e, e', x, y) ∈ f.
  2. Determinism. ∀ (e, e', x, y) ∈ f: e' = e' and y = y if e = e and x = x

Completeness simply means that the function must return a value for all environments and x’s. Determinism just means that the environment and input x uniquely determine the new environment and return value. Nondeterministic functions are modeled by state in the environment.

We write function application in this model as f(e, x) = (e', y). Read: “Calling f on x in environment e returns y and changes the environment to e'.”

Definition 2. A function is pure iff ∀ (e, e', x, y) ∈ f, e = e'; e.g, its initial and final environments are identical.

There can be only one

We wish to show that for any value r, there is only one pure function which always returns r. Assume there exist two distinct pure functions f and g, over the same domain X, returning r. Remember, these functions are pure, so their initial and final environments are the same:

  • ∀ e ∈ E, x ∈ X: f(e, x) -> (e, r)
  • ∀ e ∈ E, x ∈ X: g(e, x) -> (e, r)

But by definition 1, f and g are simply:

  • f = {(e, e, x, r) | e ∈ E, x ∈ X}
  • g = {(e, e, x, r) | e ∈ E, x ∈ X}

… which are identical sets. We obtain a contradiction: f and g cannot be distinct; therefore, in any environment E and over any input set X, there exists only a single function returning r. ∎

You can make the exact same argument for functions that return their first (or nth) argument: they’re just variations on the identity function, one version for each arity:

  • (e, e, (x), x)
  • (e, e, (x, a), x)
  • (e, e, (x, a, b), x)
  • (e, e, (x, a, b, …), x)

Redundancy of functions over different domains

Given two pure single-valued functions over different domains f: E, X1 -> {r} and g: E, X2 -> {r}, let h be the set of all tuples in either f or g: h = f ∪ g.

Since f is pure, ∀ (e, e', x, y) ∈ f, e = e'; and the same for g. Therefore, ∀ (e, e', x, y) ∈ h, e = e' as well: h does not mutate its environment.

Since f has a mapping for all combinations of environments in E and inputs in X1, so does h. And the same goes for g: h has mappings for all combinations of environments in E and inputs in X2. h is therefore complete over E and X1 ∪ X2.

Since f and g always return r, ∀ (e, e', x, y) ∈ h, y = r too. Because h can never have multiple values for y (and because it does not mutate its environment), it is deterministic per definition 1.

Therefore, h is a pure function in E over X1 ∪ X2–and is therefore a pure function over either X1 or X2 alone. You can safely replace any instance of f or g with h: there isn’t really a point to having more than one pure function returning void, nil, etc. in your program, unless you’re doing it for static type safety.

Don’t believe me? Here’s a single Clojure function that can replace any pure function returning its first argument. Works on integers, strings, other functions… whatever types you like.

user=> (def selfie (fn [self & args] self))) #'user/selfie user=> (selfie 3) 3 user=> (selfie "channing" "tatum") "channing"

Returning self suggests impurity

You can write the same function more than one way. Here are two pure functions in Ruby that both return self:

def meow self end def stretch nil ENV["USER"] + " in spaaace" 5.3 / 3 self end

meow is just identity–but so is stretch, and, by our proof above, so is every other pure function returning self. The only difference is that stretch has useless dead code, which any compiler, linter, or human worth their salt will strip out. Writing code like this is probably silly. You can construct weird cases (interfaces, etc) where you want a whole bunch of identity functions, or (constantly nil), etc, but I think those are pretty rare.

What about calling a function then returning self?

def foo enjoy("http://shirtless-channing-tatum.biz") self end

There are only two cases. If enjoy is pure, so is foo, and we can replace the function by

def foo self end

If enjoy is impure (and let’s face it: shirtless Channing Tatum induces side effects in most callers), then foo is also impure, and we’re back to square one: mutation.

Final thoughts

When you see functions that return void, nil, or self, ask “what is this mutating?” If you have a pure function (say, returning the number of explosions in a film) and follow the advice of returning self as much as possible, you are turning a pure function into an impure one. You have to add state and mutability to the system. You should strive to do the opposite: reduce mutation wherever possible.

I assure you, return values are OK.

Writing software can be an exercise in frustration. Useless error messages, difficult-to-reproduce bugs, missing stacktrace information, obscure functions without documentation, and unmaintained libraries all stand in our way. As software engineers, our most useful skill isn’t so much knowing how to solve a problem as knowing how to explore a problem that we haven’t seen before. Experience is important, but even experienced engineers face unfamiliar bugs every day. When a problem doesn’t bear a resemblance to anything we’ve seen before, we fall back on general cognitive strategies to explore–and ultimately solve–the problem.

There’s an excellent book by the mathematician George Polya: How to Solve It, which tries to catalogue how successful mathematicians approach unfamiliar problems. When I catch myself banging my head against a problem for more than a few minutes, I try to back up and consider his principles. Sometimes, just taking the time to slow down and reflect can get me out of a rut.

  1. Understand the problem.
  2. Devise a plan.
  3. Carry out the plan
  4. Look back

Seems easy enough, right? Let’s go a little deeper.

Understanding the problem

Well obviously there’s a problem, right? The program failed to compile, or a test spat out bizarre numbers, or you hit an unexpected exception. But try to dig a little deeper than that. Just having a careful description of the problem can make the solution obvious.

Our audit program detected that users can double-withdraw cash from their accounts.

What does your program do? Chances are your program is large and complex, so try to isolate the problem as much as possible. Find preconditions where the error holds.

The problem occurs after multiple transfers between accounts.

Identify specific lines of code from the stacktrace that are involved, specific data that’s being passed around. Can you find a particular function that’s misbehaving?

The balance transfer function sometimes doesn’t increase or decrease the account values correctly.

What are that function’s inputs and outputs? Are the inputs what you expected? What did you expect the result to be, given those arguments? It’s not enough to know “it doesn’t work”–you need to know exactly what should have happened. Try to find conditions where the function works correctly, so you can map out the boundaries of the problem.

Trying to transfer $100 from A to B works as expected, as does a transfer of $50 from B to A. Running a million random transfers between accounts, sequentially, results in correct balances. The problem only seems to happen in production.

If your function–or functions it calls–uses mutable state, like an agent, atom, or ref, the value of those references matters too. This is why you should avoid mutable state wherever possible: each mutable variable introduces another dimension of possible behaviors for your program. Print out those values when they’re read, and after they’re written, to get a description of what the function is actually doing. I am a huge believer in sprinkling (prn x) throughout one’s code to print how state evolves when the program runs.

Each balance is stored in a separate atom. When two transfers happen at the same time involving the same accounts, the new value of one or both atoms may not reflect the transfer correctly.

Look for invariants: properties that should always be true of a program. Devise a test to look for where those invariants are broken. Consider each individual step of the program: does it preserve all the invariants you need? If it doesn’t, what ensures those invariants are restored correctly?

The total amount of money in the system should be constant–but sometimes changes!

Draw diagrams, and invent a notation to talk about the problem. If you’re accessing fields in a vector, try drawing the vector as a set of boxes, and drawing the fields it accesses, step by step on paper. If you’re manipulating a tree, draw one! Figure out a way to write down the state of the system: in letters, numbers, arrows, graphs, whatever you can dream up.

Transferring $5 from A to B in transaction 1, and $5 from B to A in transaction 2: Transaction | A | B -------------+-----+----- txn1 read | 10 | 10 ; Transaction 1 sees 10, 10 txn1 write A | 5 | 10 ; A and B now out-of-sync txn2 read | 5 | 10 ; Transaction 2 sees 5, 10 txn1 write B | 5 | 15 ; Transaction 1 completes txn2 write A | 10 | 15 ; Transaction 2 writes based on out-of-sync read txn2 write B | 5 | 5 ; Should have been 10, 10!

This doesn’t solve the problem, but helps us explore the problem in depth. Sometimes this makes the solution obvious–other times, we’re just left with a pile of disjoint facts. Even if things look jumbled-up and confusing, don’t despair! Exploring gives the brain the pieces; it’ll link them together over time.

Armed with a detailed description of the problem, we’re much better equipped to solve it.

Devise a plan

Our brains are excellent pattern-matchers, but not that great at tracking abstract logical operations. Try changing your viewpoint: rotating the problem into a representation that’s a little more tractable for your mind. Is there a similar problem you’ve seen in the past? Is this a well-known problem?

Make sure you know how to check the solution. With the problem isolated to a single function, we can write a test case that verifies the account balances are correct. Then we can experiment freely, and have some confidence that we’ve actually found a solution.

Can you solve a related problem? If only concurrent transfers trigger the problem, could we solve the issue by ensuring transactions never take place concurrently–e.g. by wrapping the operation in a lock? Could we solve it by logging all transactions, and replaying the log? Is there a simpler variant of the problem that might be tractable–maybe one that always overcounts, but never undercounts?

Consider your assumptions. We rely on layers of abstraction in writing software–that changing a variable is atomic, that lexical variables don’t change, that adding 1 and 1 always gives 2. Sometimes, parts of the computer fail to guarantee those abstractions hold. The CPU might–very rarely–fail to divide numbers correctly. A library might, for supposedly valid input, spit out a bad result. A numeric algorithm might fail to converge, and spit out wrong numbers. To avoid questioning everything, start in your own code, and work your way down to the assumptions themselves. See if you can devise tests that check the language or library is behaving as you expect.

Can you avoid solving the problem altogether? Is there a library, database, or language feature that does transaction management for us? Is integrating that library worth the reduced complexity in our application?

We’re not mathematicians; we’re engineers. Part theorist, yes, but also part mechanic. Some problems take a more abstract approach, and others are better approached by tapping it with a wrench and checking the service manual. If other people have solved your problem already, using their solution can be much simpler than devising your own.

Can you think of a way to get more diagnostic information? Perhaps we could log more data from the functions that are misbehaving, or find a way to dump and replay transactions from the live program. Some problems disappear when instrumented; these are the hardest to solve, but also the most rewarding.

Combine key phrases in a Google search: the name of the library you’re using, the type of exception thrown, any error codes or log messages. Often you’ll find a StackOverflow result, a mailing list post, or a Github issue that describes your problem. This works well when you know the technical terms for your problem–in our case, that we’re performing a atomic, transactional transfer between two variables. Sometimes, though, you don’t know the established names for your problem, and have to resort to blind queries like “variables out of sync” or “overwritten data”–which are much more difficult.

When you get stuck exploring on your own, try asking for help. Collect your description of the problem, the steps you took, and what you expected the program to do. Include any stacktraces or error messages, log files, and the smallest section of source code required to reproduce the problem. Also include the versions of software used–in Clojure, typically the JVM version (java -version), Clojure version (project.clj), and any other relevant library versions.

If the project has a Github page or public issue tracker, like Jira, you can try filing an issue there. Here’s a particularly well-written issue filed by a user on one of my projects. Note that this user included installation instructions, the command they ran, and the stacktrace it printed. The more specific a description you provide, the easier it is for someone else to understand your problem and help!

Sometimes you need to talk through a problem interactively. For that, I prefer IRC–many projects have a channel on the Freenode IRC network where you can ask basic questions. Remember to be respectful of the channel’s time; there may be hundreds of users present, and they have to sort through everything you write. Paste your problem description into a pastebin like Gist, then mention the link in IRC with a short–say a few sentences–description of the problem. I try asking in a channel devoted to a specific library or program first, then back off to a more general channel, like #clojure. There’s no need to ask “Can I ask a question” first–just jump in.

Since the transactional problem we’ve been exploring seems like a general issue with atoms, I might ask in #clojure

aphyr > Hi! Does anyone know the right way to change multiple atoms at the same time? aphyr > This function and test case (http://gist.github.com/...) seems to double- or under-count when invoked concurrently.

Finally, you can join the project’s email list, and ask your question there. Turnaround times are longer, but you’ll often find a more in-depth response to your question via email. This applies especially if you and the maintainer are in different time zones, or if they’re busy with life. You can also ask specific problems on StackOverflow or other message boards; users there can be incredibly helpful.

Remember, other engineers are taking time away from their work, family, friends, and hobbies to help you. It’s always polite to give them time to answer first–they may have other priorities. A sincere thank-you is always appreciated–as is paying it forward by answering other users' questions on the list or channel!

Dealing with abuse

Sadly, some women, LGBT people, and so on experience harassment on IRC or in other discussion circles. They may be asked inappropriate personal questions, insulted, threatened, assumed to be straight, to be a man, and so on. Sometimes other users will attack questioners for inexperience. Exclusion can be overt (“Read the fucking docs, faggot!”) or more subtle (“Hey dudes, what’s up?”). It only takes one hurtful experience this to sour someone on an entire community.

If this happens to you, place your own well-being first. You are not obligated to fix anyone else’s problems, or to remain in a social context that makes you uncomfortable.

That said, be aware the other people in a channel may not share your culture. English may not be their main language, or they may have said something hurtful without realizing its impact. Explaining how the comment made you feel can jar a well-meaning but unaware person into reconsidering their actions.

Other times, people are just mean–and it only takes one to ruin everybody’s day. When this happens, you can appeal to a moderator. On IRC, moderators are sometimes identified by an @ sign in front of their name; on forums, they may have a special mark on their username or profile. Large projects may have an official policy for reporting abuse on their website or in the channel topic. If there’s no policy, try asking whoever seems in charge for help. Most projects have a primary maintainer or community manager with the power to mute or ban malicious users.

Again, these ways of dealing with abuse are optional. You have no responsibility to provide others with endless patience, and it is not your responsibility to fix a toxic culture. You can always log off and try something else. There are many communities which will welcome and support you–it may just take a few tries to find the right fit.

If you don’t find community, you can build it. Starting your own IRC channel, mailing list, or discussion group with a few friends can be a great way to help each other learn in a supportive environment. And if trolls ever come calling, you’ll be able to ban them personally.

Now, back to problem-solving.

Execute the plan

Sometimes we can make a quick fix in the codebase, test it by hand, and move on. But for more serious problems, we’ll need a more involved process. I always try to get a reproducible test suite–one that runs in a matter of seconds–so that I can continually check my work.

Persist. Many problems require grinding away for some time. Mix blind experimentation with sitting back and planning. Periodically re-evaluate your work–have you made progress? Identified a sub-problem that can be solved independently? Developed a new notation?

If you get stuck, try a new tack. Save your approach as a comment or using git stash, and start fresh. Maybe using a different concurrency primitive is in order, or rephrasing the data structure entirely. Take a reading break and review the documentation for the library you’re trying to use. Read the source code for the functions you’re calling–even if you don’t understand exactly what it does, it might give you clues to how things work under the hood.

Bounce your problem off a friend. Grab a sheet of paper or whiteboard, describe the problem, and work through your thinking with that person. Their understanding of the problem might be totally off-base, but can still give you valuable insight. Maybe they know exactly what the problem is, and can point you to a solution in thirty seconds!

Finally, take a break. Go home. Go for a walk. Lift heavy, run hard, space out, drink with your friends, practice music, read a book. Just before sleep, go over the problem once more in your head; I often wake up with a new algorithm or new questions burning to get out. Your unconscious mind can come up with unexpected insights if given time away from the problem!

Some folks swear by time in the shower, others by hiking, or with pen and paper in a hammock. Find what works for you! The important thing seems to be giving yourself away from struggling with the problem.

Look back

Chances are you’ll know as soon as your solution works. The program compiles, transactions generate the correct amounts, etc. Now’s an important time to solidify your work.

Bolster your tests. You may have made the problem less likely, but not actually solved it. Try a more aggressive, randomized test; one that runs for longer, that generates a broader class of input. Try it on a copy of the production workload before deploying your change.

Identify why the new system works. Pasting something in from StackOverflow may get you through the day, but won’t help you solve similar problems in the future. Try to really understand why the program went wrong, and how the new pieces work together to prevent the problem. Is there a more general underlying problem? Could you generalize your technique to solve a related problem? If you’ll encounter this type of issue frequently, could you build a function or library to help build other solutions?

Document the solution. Write down your description of the problem, and why your changes fix it, as comments in the source code. Use that same description of the solution in your commit message, or attach it as a comment to the resources you used online, so that other people can come to the same understanding.

Debugging Clojure

With these general strategies in mind, I’d like to talk specifically about the debugging Clojure code–especially understanding its stacktraces. Consider this simple program for baking cakes:

(ns scratch.debugging) (defn bake "Bakes a cake for a certain amount of time, returning a cake with a new :tastiness level." [pie temp time] (assoc pie :tastiness (condp (* temp time) < 400 :burned 350 :perfect 300 :soggy)))

And in the REPL

user=> (bake {:flavor :blackberry} 375 10.25) ClassCastException java.lang.Double cannot be cast to clojure.lang.IFn scratch.debugging/bake (debugging.clj:8)

This is not particularly helpful. Let’s print a full stacktrace using pst:

user=> (pst) ClassCastException java.lang.Double cannot be cast to clojure.lang.IFn scratch.debugging/bake (debugging.clj:8) user/eval1223 (form-init4495957503656407289.clj:1) clojure.lang.Compiler.eval (Compiler.java:6619) clojure.lang.Compiler.eval (Compiler.java:6582) clojure.core/eval (core.clj:2852) clojure.main/repl/read-eval-print--6588/fn--6591 (main.clj:259) clojure.main/repl/read-eval-print--6588 (main.clj:259) clojure.main/repl/fn--6597 (main.clj:277) clojure.main/repl (main.clj:277) clojure.tools.nrepl.middleware.interruptible-eval/evaluate/fn--591 (interruptible_eval.clj:56) clojure.core/apply (core.clj:617) clojure.core/with-bindings* (core.clj:1788)

The first line tells us the type of the error: a ClassCastException. Then there’s some explanatory text: we can’t cast a java.lang.Double to a clojure.lang.IFn. The indented lines show the functions that led to the error. The first line is the deepest function, where the error actually occurred: the bake function in the scratch.debugging namespace. In parentheses is the file name (debugging.clj) and line number (8) from the code that caused the error. Each following line shows the function that called the previous line. In the REPL, our code is invoked from a special function compiled by the REPL itself–with an automatically generated name like user/eval1223, and that function is invoked by the Clojure compiler, and the REPL tooling. Once we see something like Compiler.eval at the repl, we can generally skip the rest.

As a general rule, we want to look at the deepest (earliest) point in the stacktrace that we wrote. Sometimes an error will arise from deep within a library or Clojure itself–but it was probably invoked by our code somewhere. We’ll skim down the lines until we find our namespace, and start our investigation at that point.

Our case is simple: bake.clj, on line 8, seems to be the culprit.

(condp (* temp time) <

Now let’s consider the error itself: ClassCastException: java.lang.Double cannot be cast to clojure.lang.IFn. This implies we had a Double and tried to cast it to an IFn–but what does “cast” mean? For that matter, what’s a Double, or an IFn?

A quick google search for java.lang.Double reveals that it’s a class (a Java type) with some basic documentation. “The Double class wraps a value of the primitive type double in an object” is not particularly informative–but the “class hierarchy” at the top of the page shows that a Double is a kind of java.lang.Number. Let’s experiment at the REPL:

user=> (type 4) java.lang.Long user=> (type 4.5) java.lang.Double

Indeed: decimal numbers in Clojure appear to be doubles. One of the expressions in that condp call was probably a decimal. At first we might suspect the literal values 300, 350, or 400–but those are Longs, not Doubles. The only Double we passed in was the time duration 10.25–which appears in condp as (* temp time). That first argument was a Double, but should have been an IFn.

What the heck is an IFn? Its source code has a comment:

IFn provides complete access to invoking any of Clojure’s API’s. You can also access any other library written in Clojure, after adding either its source or compiled form to the classpath.

So IFn has to do with invoking Clojure’s API. Ah–Fn probably stands for function–and this class is chock full of things like invoke(Object arg1, Object arg2). That suggests that IFn is about calling functions. And the I? Google suggests it’s a Java convention for an interface–whatever that is. Remember, we don’t have to understand everything–just enough to get by. There’s plenty to explore later.

Let’s check our hypothesis in the repl:

user=> (instance? clojure.lang.IFn 2.5) false user=> (instance? clojure.lang.IFn conj) true user=> (instance? clojure.lang.IFn (fn [x] (inc x))) true

So Doubles aren’t IFns–but Clojure built-in functions, and anonymous functions, both are. Let’s double-check the docs for condp again:

user=> (doc condp) ------------------------- clojure.core/condp ([pred expr & clauses]) Macro Takes a binary predicate, an expression, and a set of clauses. Each clause can take the form of either: test-expr result-expr test-expr :>> result-fn Note :>> is an ordinary keyword. For each clause, (pred test-expr expr) is evaluated. If it returns logical true, the clause is a match. If a binary clause matches, the result-expr is returned, if a ternary clause matches, its result-fn, which must be a unary function, is called with the result of the predicate as its argument, the result of that call being the return value of condp. A single default expression can follow the clauses, and its value will be returned if no clause matches. If no default expression is provided and no clause matches, an IllegalArgumentException is thrown.clj

That’s a lot to take in! No wonder we got it wrong! We’ll take it slow, and look at the arguments.

(condp (* temp time) <

Our pred was (* temp time) (a Double), and our expr was the comparison function <. For each clause, (pred test-expr expr) is evaluated, so that would expand to something like

((* temp time) 400 <)

Which evaluates to something like

(123.45 400 <)

But this isn’t a valid Lisp program! It starts with a number, not a function. We should have written (< 123.45 400). Our arguments are backwards!

(defn bake "Bakes a cake for a certain amount of time, returning a cake with a new :tastiness level." [pie temp time] (assoc pie :tastiness (condp < (* temp time) 400 :burned 350 :perfect 300 :soggy))) user=> (use 'scratch.debugging :reload) nil user=> (bake {:flavor :chocolate} 375 10.25) {:tastiness :burned, :flavor :chocolate} user=> (bake {:flavor :chocolate} 450 0.8) {:tastiness :perfect, :flavor :chocolate}

Mission accomplished! We read the stacktrace as a path to a part of the program where things went wrong. We identified the deepest part of that path in our code, and looked for a problem there. We discovered that we had reversed the arguments to a function, and after some research and experimentation in the REPL, figured out the right order.

An aside on types: some languages have a stricter type system than Clojure’s, in which the types of variables are explicitly declared in the program’s source code. Those languages can detect type errors–when a variable of one type is used in place of another, incompatible, type–and offer more precise feedback. In Clojure, the compiler does not generally enforce types at compile time, which allows for significant flexibility–but requires more rigorous testing to expose these errors.

Higher order stacktraces

The stacktrace shows us a path through the program, moving downwards through functions. However, that path may not be straightforward. When data is handed off from one part of the program to another, the stacktrace may not show the origin of an error. When functions are handed off from one part of the program to another, the resulting traces can be tricky to interpret indeed.

For instance, say we wanted to make some picture frames out of wood, but didn’t know how much wood to buy. We might sketch out a program like this:

(defn perimeter "Given a rectangle, returns a vector of its edge lengths." [rect] [(:x rect) (:y rect) (:z rect) (:y rect)]) (defn frame "Given a mat width, and a photo rectangle, figure out the size of the frame required by adding the mat width around all edges of the photo." [mat-width rect] (let [margin (* 2 rect)] {:x (+ margin (:x rect)) :y (+ margin (:y rect))})) (def failure-rate "Sometimes the wood is knotty or we screw up a cut. We'll assume we need a spare segment once every 8." 1/8) (defn spares "Given a list of segments, figure out roughly how many of each distinct size will go bad, and emit a sequence of spare segments, assuming we screw up `failure-rate` of them." [segments] (->> segments ; Compute a map of each segment length to the number of ; segments we'll need of that size. frequencies ; Make a list of spares for each segment length, ; based on how often we think we'll screw up. (mapcat (fn [ [segment n]] (repeat (* failure-rate n) segment))))) (def cut-size "How much extra wood do we need for each cut? Let's say a mitred cut for a 1-inch frame needs a full inch." 1) (defn total-wood [mat-width photos] "Given a mat width and a collection of photos, compute the total linear amount of wood we need to buy in order to make frames for each, given a 2-inch mat." (let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) ; Convert frames to segments (mapcat perimeter))] ; Now, take segments (->> segments ; Add the spares (concat (spares segments)) ; Include a cut between each segment (interpose cut-size) ; And sum the whole shebang. (reduce +)))) (->> [{:x 8 :y 10} {:x 10 :y 8} {:x 20 :y 30}] (total-wood 2) (println "total inches:"))

Running this program yields a curious stacktrace. We’ll print the full trace (not the shortened one that comes with pst) for the last exception *e with the .printStackTrace function.

user=> (.printStackTrace *e) java.lang.ClassCastException: clojure.lang.PersistentArrayMap cannot be cast to java.lang.Number, compiling:(scratch/debugging.clj:73:23) at clojure.lang.Compiler.load(Compiler.java:7142) at clojure.lang.RT.loadResourceScript(RT.java:370) at clojure.lang.RT.loadResourceScript(RT.java:361) at clojure.lang.RT.load(RT.java:440) at clojure.lang.RT.load(RT.java:411) ... at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassCastException: clojure.lang.PersistentArrayMap cannot be cast to java.lang.Number at clojure.lang.Numbers.multiply(Numbers.java:146) at clojure.lang.Numbers.multiply(Numbers.java:3659) at scratch.debugging$frame.invoke(debugging.clj:26) at clojure.lang.AFn.applyToHelper(AFn.java:156) at clojure.lang.AFn.applyTo(AFn.java:144) at clojure.core$apply.invoke(core.clj:626) at clojure.core$partial$fn__4228.doInvoke(core.clj:2468) at clojure.lang.RestFn.invoke(RestFn.java:408) at clojure.core$map$fn__4245.invoke(core.clj:2557) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$map$fn__4245.invoke(core.clj:2551) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$apply.invoke(core.clj:624) at clojure.core$mapcat.doInvoke(core.clj:2586) at clojure.lang.RestFn.invoke(RestFn.java:423) at scratch.debugging$total_wood.invoke(debugging.clj:62) ...

First: this trace has two parts. The top-level error (a CompilerException) appears first, and is followed by the exception that caused the CompilerException: a ClassCastException. This makes the stacktrace read somewhat out of order, since the deepest part of the trace occurs in the first line of the last exception. We read C B A then F E D. This is an old convention in the Java language, and the cause of no end of frustration.

Notice that this representation of the stacktrace is less friendly than (pst). We’re seeing the Java Virtual Machine (JVM)’s internal representation of Clojure functions, which look like clojure.core$partial$fn__4228.doInvoke. This corresponds to the namespace clojure.core, in which there is a function called partial, inside of which is an anonymous function, here named fn__4228. Calling a Clojure function is written, in the JVM, as .invoke or .doInvoke.

So: the root cause was a ClassCastException, and it tells us that Clojure expected a java.lang.Number, but found a PersistentArrayMap. We might guess that PersistentArrayMap is something to do with the map data structure, which we used in this program:

user=> (type {:x 1}) clojure.lang.PersistentArrayMap

And we’d be right. We can also tell, by reading down the stacktrace looking for our scratch.debugging namespace, where the error took place: scratch.debugging$frame, on line 26.

(let [margin (* 2 rect)]

There’s our multiplication operation *, which we might assume expands to clojure.lang.Numbers.multiply. But the path to the error is odd.

(->> photos ; Convert photos to frame dimensions (map (partial frame mat-width))

In total-wood, we call (map (partial frame mat-width) photos) right away, so we’d expect the stacktrace to go from total-wood to map to frame. But this is not what happens. Instead, total-wood invokes something called RestFn–a piece of Clojure plumbing–which in turn calls mapcat.

at clojure.core$mapcat.doInvoke(core.clj:2586) at clojure.lang.RestFn.invoke(RestFn.java:423) at scratch.debugging$total_wood.invoke(debugging.clj:62)

Why doesn’t total-wood call map first? Well it did–but map doesn’t actually apply its function to anything in the photos vector when invoked. Instead, it returns a lazy sequence–one which applies frame only when elements are asked for.

user=> (type (map inc (range 10))) clojure.lang.LazySeq

Inside each LazySeq is a box containing a function. When you ask a LazySeq for its first value, it calls that function to return a new sequence–and that’s when frame gets invoked. What we’re seeing in this stacktrace is the LazySeq internal machinery at work–mapcat asks it for a value, and the LazySeq asks map to generate that value.

at clojure.core$partial$fn__4228.doInvoke(core.clj:2468) at clojure.lang.RestFn.invoke(RestFn.java:408) at clojure.core$map$fn__4245.invoke(core.clj:2557) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$map$fn__4245.invoke(core.clj:2551) at clojure.lang.LazySeq.sval(LazySeq.java:40) at clojure.lang.LazySeq.seq(LazySeq.java:49) at clojure.lang.RT.seq(RT.java:484) at clojure.core$seq.invoke(core.clj:133) at clojure.core$apply.invoke(core.clj:624) at clojure.core$mapcat.doInvoke(core.clj:2586) at clojure.lang.RestFn.invoke(RestFn.java:423) at scratch.debugging$total_wood.invoke(debugging.clj:62)

In fact we pass through map’s laziness twice here: a quick peek at (source mapcat) shows that it expands into a map call itself, and then there’s a second map: the one we created in in total-wood. Then an odd thing happens–we hit something called clojure.core$partial$fn__4228.

(map (partial frame mat-width) photos)

The frame function takes two arguments: a mat width and a photo. We wanted a function that takes just one argument: a photo. (partial frame mat-width) took mat-width and generated a new function which takes one arg–call it photo–and calls (frame mad-width photo). That automatically generated function, returned by partial, is what map uses to generate new elements of its sequence on demand.

user=> (partial + 1) #<core$partial$fn__4228 clojure.core$partial$fn__4228@243634f2> user=> ((partial + 1) 4) 5

That’s why we see control flow through clojure.core$partial$fn__4228 (an anonymous function defined inside clojure.core/partial) on the way to frame.

Caused by: java.lang.ClassCastException: clojure.lang.PersistentArrayMap cannot be cast to java.lang.Number at clojure.lang.Numbers.multiply(Numbers.java:146) at clojure.lang.Numbers.multiply(Numbers.java:3659) at scratch.debugging$frame.invoke(debugging.clj:26) at clojure.lang.AFn.applyToHelper(AFn.java:156) at clojure.lang.AFn.applyTo(AFn.java:144) at clojure.core$apply.invoke(core.clj:626) at clojure.core$partial$fn__4228.doInvoke(core.clj:2468)

And there’s our suspect! scratch.debugging/frame, at line 26. To return to that line again:

(let [margin (* 2 rect)]

* is a multiplication, and 2 is obviously a number, but rectrect is a map here. Aha! We meant to multiply the mat-width by two, not the rectangle.

(defn frame "Given a mat width, and a photo rectangle, figure out the size of the frame required by adding the mat width around all edges of the photo." [mat-width rect] (let [margin (* 2 mat-width)] {:x (+ margin (:x rect)) :y (+ margin (:y rect))}))

I believe we’ve fixed the bug, then. Let’s give it a shot!

The unbearable lightness of nil

There’s one more bug lurking in this program. This one’s stacktrace is short.

user=> (use 'scratch.debugging :reload) CompilerException java.lang.NullPointerException, compiling:(scratch/debugging.clj:73:23) user=> (pst) CompilerException java.lang.NullPointerException, compiling:(scratch/debugging.clj:73:23) clojure.lang.Compiler.load (Compiler.java:7142) clojure.lang.RT.loadResourceScript (RT.java:370) clojure.lang.RT.loadResourceScript (RT.java:361) clojure.lang.RT.load (RT.java:440) clojure.lang.RT.load (RT.java:411) clojure.core/load/fn--5066 (core.clj:5641) clojure.core/load (core.clj:5640) clojure.core/load-one (core.clj:5446) clojure.core/load-lib/fn--5015 (core.clj:5486) clojure.core/load-lib (core.clj:5485) clojure.core/apply (core.clj:626) clojure.core/load-libs (core.clj:5524) Caused by: NullPointerException clojure.lang.Numbers.ops (Numbers.java:961) clojure.lang.Numbers.add (Numbers.java:126) clojure.core/+ (core.clj:951) clojure.core.protocols/fn--6086 (protocols.clj:143) clojure.core.protocols/fn--6057/G--6052--6066 (protocols.clj:19) clojure.core.protocols/seq-reduce (protocols.clj:27) clojure.core.protocols/fn--6078 (protocols.clj:53) clojure.core.protocols/fn--6031/G--6026--6044 (protocols.clj:13) clojure.core/reduce (core.clj:6287) scratch.debugging/total-wood (debugging.clj:69) scratch.debugging/eval1560 (debugging.clj:81) clojure.lang.Compiler.eval (Compiler.java:6703)

On line 69, total-wood calls reduce, which dives through a series of functions from clojure.core.protocols before emerging in +: the function we passed to reduce. Reduce is trying to combine two elements from its collection of wood segments using +, but one of them was nil. Clojure calls this a NullPointerException. In total-wood, we constructed the sequence of segments this way:

(let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) ; Convert frames to segments (mapcat perimeter))] ; Now, take segments (->> segments ; Add the spares (concat (spares segments)) ; Include a cut between each segment (interpose cut-size) ; And sum the whole shebang. (reduce +))))

Where did the nil value come from? The stacktrace doesn’t say, because the sequence reduce is traversing didn’t have any problem producing the nil. reduce asked for a value and the sequence happily produced a nil. We only had a problem when it came time to combine the nil with the next value, using +.

A stacktrace like this is something like a murder mystery: we know the program died in the reducer, that it was shot with a +, and the bullet was a nil–but we don’t know where the bullet came from. The trail runs cold. We need more forensic information–more hints about the nil’s origin–to find the culprit.

Again, this is a class of error largely preventable with static type systems. If you have worked with a statically typed language in the past, it may be interesting to consider that almost every Clojure function takes Option[A] and does something more-or-less sensible, returning Option[B]. Whether the error propagates as a nil or an Option, there can be similar difficulties in localizing the cause of the problem.

Let’s try printing out the state as reduce goes along:

(->> segments ; Add the spares (concat (spares segments)) ; Include a cut between each segment (interpose cut-size) ; And sum the whole shebang. (reduce (fn [acc x] (prn acc x) (+ acc x)))))) user=> (use 'scratch.debugging :reload) 12 1 13 14 27 1 28 nil CompilerException java.lang.NullPointerException, compiling:(scratch/debugging.clj:73:56)

Not every value is nil! There’s a 14 there which looks like a plausible segment for a frame, and two one-inch buffers from cut-size. We can rule out interpose because it inserts a 1 every time, and that 1 reduces correctly. But where’s that nil coming from? Is from segments or (spares segments)?

(let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) ; Convert frames to segments (mapcat perimeter))] (prn :segments segments) user=> (use 'scratch.debugging :reload) :segments (12 14 nil 14 14 12 nil 12 24 34 nil 34)

It is present in segments. Let’s trace it backwards through the sequence’s creation. It’d be handy to have a function like prn that returned its input, so we could spy on values as they flowed through the ->> macro.

(defn spy [& args] (apply prn args) (last args)) (let [segments (->> photos ; Convert photos to frame dimensions (map (partial frame mat-width)) (spy :frames) ; Convert frames to segments (mapcat perimeter))] user=> (use 'scratch.debugging :reload) :frames ({:x 12, :y 14} {:x 14, :y 12} {:x 24, :y 34}) :segments (12 14 nil 14 14 12 nil 12 24 34 nil 34)

Ah! So the frames are intact, but the perimeters are bad. Let’s check the perimeter function:

(defn perimeter "Given a rectangle, returns a vector of its edge lengths." [rect] [(:x rect) (:y rect) (:z rect) (:y rect)])

Spot the typo? We wrote :z instead of :x. Since the frame didn’t have a :z field, it returned nil! That’s the origin of our NullPointerException. With the bug fixed, we can re-run and find:

user=> (use 'scratch.debugging :reload) total inches: 319

Whallah!

Recap

As we solve more and more problems, we get faster at debugging–at skipping over irrelevant log data, figuring out exactly what input was at fault, knowing what terms to search for, and developing a network of peers and mentors to ask for help. But when we encounter unexpected bugs, it can help to fall back on a family of problem-solving tactics.

We explore the problem thoroughly, localizing it to a particular function, variable, or set of inputs. We identify the boundaries of the problem, carving away parts of the system that work as expected. We develop new notation, maps, and diagrams of the problem space, precisely characterizing it in a variety of modes.

With the problem identified, we search for extant solutions–or related problems others have solved in the past. We trawl through issue trackers, mailing list posts, blogs, and forums like Stackoverflow, or, for more theoretical problems, academic papers, Mathworld, and Wikipedia, etc. If searching reveals nothing, we try rephrasing the problem, relaxing the constraints, adding debugging statements, and solving smaller subproblems. When all else fails, we ask for help from our peers, or from the community in IRC, mailing lists, and so on, or just take a break.

We learned to explore Clojure stacktraces as a trail into our programs, leading to the place where an error occurred. But not all paths are linear, and we saw how lazy operations and higher-order functions create inversions and intermediate layers in the stacktrace. Then we learned how to debug values that were distant from the trace, by adding logging statements and working our way closer to the origin.

Programming languages and us, their users, are engaged in a continual dialogue. We may speak more formally, verbosely, with many types and defensive assertions–or we may speak quickly, generally, in fuzzy terms. The more precise we are with the specifications of our program’s types, the more the program can assist us when things go wrong. Conversely, those specifications harden our programs into strong but rigid forms, and rigid structures are harder to bend into new shapes.

In Clojure we strike a more dynamic balance: we speak in generalities, but we pay for that flexibility. Our errors are harder to trace to their origins. While the Clojure compiler can warn us of some errors, like mis-spelled variable names, it cannot (without a library like core.typed) tell us when we have incorrectly assumed an object will be of a certain type. Even very rigid languages, like Haskell, cannot identify some errors, like reversing the arguments to a subtraction function. Some tests are always necessary, though types are a huge boon.

No matter what language we write in, we use a balance of types and tests to validate our assumptions, both when the program is compiled and when it is run.

The errors that arise in compilation or runtime aren’t rebukes so much as hints. Don’t despair! They point the way towards understanding one’s program in more detail–though the errors may be cryptic. Over time we get better at reading our language’s errors and making our programs more robust.

Earlier versions of Jepsen found glaring inconsistencies, but missed subtle ones. In particular, Jepsen was not well equipped to distinguish linearizable systems from sequentially or causally consistent ones. When people asked me to analyze systems which claimed to be linearizable, Jepsen could rule out obvious classes of behavior, like dropping writes, but couldn’t tell us much more than that. Since users and vendors are starting to rely on Jepsen as a basic check on correctness, it’s important that Jepsen be able to identify true linearization errors.

etcd-jepsen-set-test.jpg

To understand why Jepsen was not a complete test of linearizability, we have to understand the structure of its original tests. Jepsen assumed, originally, that every system could be modeled as a set of integers. Each client would gradually add a sequence of integers–disjoint from all the other client sets–to the database’s set; then perform a final read. If any elements which had supposedly succeeded were missing, we know the system dropped data.

The original Jepsen tests were designed for AP systems, like Riak, without a linear order; using a set is appropriate because its contents are fundamentally unordered, and because addition to the set is associative and idempotent. To test a linearizable system, we implement set addition by performing a compare-and-set, replacing the old set with the current value plus the number being written. If a given CAS was successful, then that element should appear in the final read.

This does verify sequential consistency, and to varying degrees linearizability, but has limited power. The database may choose, for instance, to delay the visibility of changes, so long as they become visible before the final read. We can’t test operations other than a CAS. We can’t, for instance, test deletions. It’s also not clear how to verify systems like mutexes, queues, or semaphores.

Furthermore, if a test does fail, it’s not clear why. A missing number from the final set might be caused by a problem with that particular CAS–or a CAS executed hours later which happened to destroy the effects of a preceding write. Ideally, we’d like to know exactly why the system failed to linearize. With this in mind, I set out to design a linearizability checker suitable for analyzing both formal models and real software with no internal visibility.

Knossos

In the introduction to Knossos, I couched Knossos as a model checker, motivated by a particular algorithm discussed on the Redis mailing list. This was slightly disingenuous: in fact, I designed Knossos as a model checker for any type of history, including those recorded from real databases. This means that Jepsen can generate a series of random operations, execute them against a database, and verify that the resulting history is valid with respect to some model.

Given a sequence of operations that a database might go through–say, two processes attempting to acquire a mutex:

{:process 1, :type :invoke, :f :acquire, :value nil} {:process 2, :type :invoke, :f :acquire, :value nil} {:process 1, :type :ok, :f :acquire, :value nil} {:process 2, :type :fail :f :acquire, :value "lock failed; already held"}

… and a singlethreaded model of the system, like

(defrecord Mutex [locked?] Model (step [mutex op] (condp = (:f op) :acquire (if locked? (inconsistent "already held") (Mutex. true)) :release (if locked? (Mutex. false) (inconsistent "not held")))))

… Knossos can identify if the given concurrent history linearizes–that is, whether there exists some equivalent history in which every operation appears to take place atomically, in a well-defined order, between the invocation and completion times.

jepsen-model.jpg

Linearizability, like sequential and serializable consistency, requires that every operation take place in some specific order; that there appears to be only one “true” state for the system at any given time. Therefore we can model any linearizable system as a single state, plus a function, called step, which applies an operation to that state and returns a new state.

In Clojure, we represent this model with a simple protocol, called Model, which defines a function (step current-model-state operation), and returns the new state. In our mutex example, there are four possibilities, depending on whether the operation is :acquire or :release, and whether the state locked? is true. If we try to lock an unlocked mutex, we return a new Mutex with the state true. If we try to lock a mutex which is already locked, we return a special kind of state: an inconsistent state.

Inconsistent states allow us to verify that a singlethreaded history is valid. We simply (reduce step initial-state operations); if the the result is inconsistent, we know that sequence of operations was prohibited by the model. The model formally expresses our definition of the allowable causal histories.

The plot thickens

jepsen-histories.jpg

But we don’t have a singlethreaded history to test. We have a multithreaded history, with any number of operations in play concurrently. Each client is invoking, waiting for, and then discovering the result of its operations. Our history contains pairs of :invoke, :ok messages, when an operation succeeds, or :invoke, :fail when the operation is known to not have taken place, or :invoke, :info, when we simply don’t know what happened.

If an operation times out, or the server returns an indeterminate response, we may never find out whether the operation really took place. In the history to the right, process 5 has hung and will never recover. Its operation could take place at any time, even years into the future. In general, a hung process is concurrent with every other subsequent operation.

jepsen-invalid-history.jpg
jepsen-valid-history.jpg

Given a model, we know how to test if a particular sequence of operations is valid. But in a concurrent history, the ordering is ambiguous; each operation could take place at any time between its invocation and completion. One possible interleaving might be read 1, write 1, read 2, write 2, which is obviously incorrect. On the other hand, we could evaluate write 1, read 1, write 2, read 2 instead–which is a valid history for a register. This history is linearizable–but in order to prove that fact, we have to find a particular valid order.

Imagine something like a game of hopscotch: one must land on each cell in turn, always moving from left to right, finding a path in which the model’s constraints hold. Where there are many cells at the same time, finding a path becomes especially difficult. We must consider every possible permutation of those concurrent cells, which is O(n!). That’s the kind of hopscotch that, even when played by computer, makes one re-evaluate one’s life choices.

So what do we do, presented with a huge space of possibilities?

Exploit degeneracy

I’m a degenerate sort of person, so my first inclination is to look for symmetries in the state space. The key observation to make is that whether a given operation is valid or not depends solely on the current state of the model, not its history.

step(state, op) -> state'
jepsen-degeneracy.jpg

It doesn’t matter how we got to the state; if you give me two registers containing the value 2, and ask me to apply the same operation to both, we only need to check one of the registers, because the results will be equivalent!

Unlike a formal model-checker or proof assistant, Knossos doesn’t know the structure of the system it’s analyzing; it can’t perform symmetry reduction based on the definition of step. What we can do, however, is look for cases where we come back to the same state and the same future series of operations–and when that occurs, drop all but one of the cases immediately–and this turns out to be equivalent to a certain class of symmetry reduction. In particular, we can compact interchangeable orders like concurrent reads, or writes that lead to the same value, etc. We keep a cache of visited worlds and avoid exploring any that have been seen before.

Laziness

monads.jpg
jepsen-laziness.jpg

Remember, we’re looking for any linearization, not all of them. If we can find a shortcut by not evaluating some highly-branching history, by not taking some expensive path, we can skip huge parts of the search. Like a lightning bolt feeling its way down the path of least resistance, we evaluate only those paths which seem easiest–coming back to the hard ones later. If the history is truly not linearizable, we’re forced to return to those expensive branches and check them, but if the history is valid, we can finish as soon as a single path is found.

Lazy evaluation is all about making control flow explicit instead of implicit. We use a data structure to describe where to explore next, instead of following the normal program flow. In Knossos, we represent the exploration of a particular order of operations as a world, which sits at some index along the multithreaded history. Each world carries with it a fixed history–the specific order of operations that occurred in that possible universe. The fixed history leads to a current model state. Finally, each world has a set of pending operations: operations that have been invoked, but have not yet taken effect.

For example, a world might have a fixed history of lock, unlock, lock, leading to a model state where locked is true, and a second lock attempt might be pending but not yet applied. An unlock operation could arrive and allow the pending lock to take place.

By representing the entire state of the computation as a data structure, we can write a single function that takes a world and explores it, returning a set of potential future worlds. We can explore those worlds in parallel.

Parallelization

pmap.jpg
jepsen-parallelize.jpg

Because our states are immutable representations of the computation, and the function we use to explore any given state is pure and deterministic, we can trivially parallelize the exploration process. Early versions of Knossos reduced over each operation in the history, applying that operation to every outstanding world by whacking it with a parallel map.

This parallelization strategy has a serious drawback, though: by exploring the state space one index at a time, we effectively perform a breadth-first search. We want to take shortcuts through the state space; running many searches at once. We don’t just want depth-first, either; instead, we want to explore those worlds which have the lowest branching factor, because those worlds are the cheapest to explore.

So instead of exploring the history one operation at a time, we spawn lots of threads and have each consume from a priority queue of worlds, ranked by how awful those worlds are to explore. As each explorer thread discovers new consequent worlds, it inserts them back into the pool. If any thread finds a world that encompasses every operation in the history, we’ve demonstrated the history is linearizable.

We pay some cost in synchronization: queues aren’t cheap, and the java.util.concurrent.BlockingPriorityQueue has some particularly nasty contention costs for both enqueues and dequeues. Luckily, the queue will usually contain plenty of elements, so we can stripe the queue into several subqueues, each with thread affinity. Affinity for each queue reduces lock contention, which dramatically reduces the time threads spend waiting to enqueue or dequeue worlds. When a thread exhausts its local queue, it steals worlds from its neighbors.

This approach costs us some degree of memory locality: transferring records through the queue tends to push them out of the CPU cache. We can tune how far each explorer thread will take a particular world to reduce the locality cost: if work is too chunky, threads can starve awaiting worlds to explore–but if work is too fine-grained, synchronization and cache misses dominate.

Memoization

Making control flow explicit (some might even say monadic) allows us to memoize computation as well. At RICON East, in 2013, Margo Seltzer gave a phenomenal talk on automatically parallelizing singlethreaded x86 programs. She pointed out that x86 can be thought of as a very large, very complicated, function that transforms a bit-vector of all the registers and all of memory into some subsequent state–depending on the instruction pointer, contents of registers, etc. It’s a very large value, but if you compress it and make even some modest predictions you can cache the results of computations that haven’t even happened yet, allowing the program to jump forward when it encounters a known state.

jepsen-memoization.jpg

This works because parallel programs usually don’t change the entire memory space; they often read and write only a small portion of memory. for(i = 0; i < 100; i++) { arr[i]++ }, for instance, independently increments each number in arr. In that sense, the memory space is degenerate outside each particular element. That degeneracy allows speculative execution to have a chance of predicting an equivalent future state of the program: we can increment each number concurrently.

In Knossos we have a similarly degenerate state space; all fixed histories may be collapsed so long as the model and pending operations are identical. We also have a speculative and lazy execution strategy: operations are simultaneously explored at various points in the multiverse. Hence we can apply a similar memoization strategy: by caching visited worlds, we can avoid exploring equivalent paths twice.

In fact we don’t even need to store the results of the exploration, simply that we have reached that world. Think of exploring a maze with several friends, all looking for a path through. When anyone reaches a dead end, they can save time for everyone by coloring in the path they took. When someone comes to a branch in the maze, they only take the paths that nobody has colored in. We simply abort any exploration of a world equivalent to one already visited. This optimization is nondeterministic but synchronization-free, allowing memoization checks to be extremely cheap. Even though cache hitrates are typically low, each hit prunes an exponential number of descendant worlds, dramatically reducing runtimes.

Immutability and performance

haskell.jpg

When we explore a world, we’ll typically encounter many branching paths. Given two concurrent writes a and b, we need to explore [], [a], [b], [a b], and [b a], and in turn, each of those worlds will fork into hundreds, then thousands, then millions of consequent worlds. We have to make a lot of copies.

At this point in the essay, Haskell enthusiasts are nodding their heads sagely and muttering things about Coyoneda diffeomorphisms and trendofunctors. Haskell offers excellent support for immutable data structures and parallel execution of pure functions, which would make it an ideal choice for building this kind of checker.

zahn.jpg

But I am, sadly, not a Haskell wizard. When you get right down to it, I’m more of a Clojure Sith Lord. And as it turns out, this is a type of problem that Clojure is also well-suited for. We express the consistency model as a pure function over immutable models, and use Clojure’s immutable maps, vectors, and sets to store the state of each world, its histories, its pending operations, and so on. Forking the world into distinct paths doesn’t require copying the entire state; rather, Clojure uses a reference to the original data structure, and stores a delta on top. We can fork millions of worlds cheaply.

Because worlds are immutable, we can share them freely between threads. Because the functions that explore a world, returning subsequent possible worlds, are pure, we can explore worlds on any thread, at any time, and take advantage of memoization. But in order to execute that search process in parallel, we need that priority queue of worlds-at-the-edge: a fundamentally mutable data structure. The memoizing cache is also mutable: it must be, to share state between threads. We also need some book-keeping state: how far has the algorithm explored; have we reached the end; how large is the cache.

So as a layer atop the immutable core, we make limited use of mutable structures: a striped java.util.concurrent.PriorityQueue for keeping track of which worlds are up next, a concurrent hashmap to memoize results, Clojure’s atoms for bookkeeping, and some java.util.concurrent.atomic references for primitive CAS. Because this code is wildly nondeterministic, it’s the most difficult portion of Knossos to reason about and debug–yet that nondeterminism is a critical degree of freedom for parallel execution. By broadening the space of allowable execution orders, we reduce the need for inter-core synchronization.

deathstar.jpg

Reducing synchronization is especially important because while I was working on Knossos, Comcast offered me a research grant specifically for Jepsen. As one does when offered unlimited resources by a galactic empire, I thought big.

I used Comcast’s grant to build a 24-core (48 HT) Xeon with 128GB of ECC; effectively demolishing the parallelism and heap barriers that limited earlier verification efforts. Extensive profiling with Yourkit (another great supporter of open-source projects) helped reduce lock and CAS contention which limited scalability to ~4 cores; a few weeks work removed almost all thread stalls and improved performance by two orders of magnitude.

The result is that Knossos can check 5-process, 150–200-element histories in a matter of minutes, not days–and it can do it on 48 cores.

cpus.png

There are several optimizations I haven’t made yet; for instance, detecting crashed processes and optimistically inserting a world in which that crashed process' operation never takes place. However, Knossos at this stage is more than capable of detecting linearization errors in real-world histories.

Proud of this technological terror I’d constructed, I consulted the small Moff Tarkin that lives in my head on what database to test next. “You would prefer another target? An open-source target? Then name the distributed system!”

alderaan.jpg

RabbitMQ. They’re on RabbitMQ.”

Previously: Logistics

Until this point in the book, we’ve dealt primarily in specific details: what an expression is, how math works, which functions apply to different data structures, and where code lives. But programming, like speaking a language, painting landscapes, or designing turbines, is about more than the nuts and bolts of the trade. It’s knowing how to combine those parts into a cohesive whole–and this is a skill which is difficult to describe formally. In this part of the book, I’d like to work with you on an integrative tour of one particular problem: modeling a rocket in flight.

We’re going to reinforce our concrete knowledge of the standard library by using maps, sequences, and math functions together. At the same time, we’re going to practice how to represent a complex system; decomposing a problem into smaller parts, naming functions and variables, and writing tests.

So you want to go to space

First, we need a representation of a craft. The obvious properties for a rocket are its dry mass (how much it weighs without fuel), fuel mass, position, velocity, and time. We’ll create a new file in our scratch project–src/scratch/rocket.clj–to talk about spacecraft.

For starters, let’s pattern our craft after an Atlas V launch vehicle. We’ll represent everything in SI units–kilograms, meters, newtons, etc. The Atlas V carries 627,105 lbs of LOX/RP-1 fuel, and a total mass of 334,500 kg gives only 50,050 kg of mass which isn’t fuel. It develops 4152 kilonewtons of thrust and runs for 253 seconds, with a specific impulse (effectively, exhaust velocity) of 3.05 kilometers/sec. Real rockets develop varying amounts of thrust depending on the atmosphere, but we’ll pretend it’s constant in our simulation.

(defn atlas-v [] {:dry-mass 50050 :fuel-mass 284450 :time 0 :isp 3050 :max-fuel-rate (/ 284450 253) :max-thrust 4.152e6})

How heavy is the craft?

(defn mass "The total mass of a craft." [craft] (+ (:dry-mass craft) (:fuel-mass craft)))

What about the position and velocity? We could represent them in Cartesian coordinates–x, y, and z–or we could choose spherical coordinates: a radius from the planet and angle from the pole and 0 degrees longitude. I’ve got a hunch that spherical coordinates will be easier for position, but accelerating the craft will be simplest in in x, y, and z terms. The center of the planet is a natural choice for the coordinate system’s origin (0, 0, 0). We’ll choose z along the north pole, and x and y in the plane of the equator.

Let’s define a space center where we launch from–let’s say it’s initially on the equator at y=0. To figure out the x coordinate, we’ll need to know how far the space center is from the center of the earth. The earth’s equatorial radius is ~6378 kilometers.

(def earth-equatorial-radius "Radius of the earth, in meters" 6378137)

How fast is the surface moving? Well the earth’s day is 86,400 seconds long,

(def earth-day "Length of an earth day, in seconds." 86400)

which means a given point on the equator has to go 2 * pi * equatorial radius meters in earth-day seconds:

(def earth-equatorial-speed "How fast points on the equator move, relative to the center of the earth, in meters/sec." (/ (* 2 Math/PI earth-equatorial-radius) earth-day))

So our space center is on the equator (z=0), at y=0 by choice, which means x is the equatorial radius. Since the earth is spinning, the space center is moving at earth-equatorial-speed in the y direction–and not changing at all in x or z.

(def initial-space-center "The initial position and velocity of the launch facility" {:time 0 :position {:x earth-equatorial-radius :y 0 :z 0} :velocity {:x 0 :y earth-equatorial-speed :z 0}})

:position and :velocity are both vectors, in the sense that they describe a position, or a direction, in terms of x, y, and z components. This is a different kind of vector than a Clojure vector, like [1 2 3]. We’re actually representing these logical vectors as Clojure maps, with :x, :y, and :z keys, corresponding to the distance along the x, y, and z directions, from the center of the earth. Throughout this chapter, I’ll mainly use the term coordinates to talk about these structures, to avoid confusion with Clojure vectors.

Now let’s create a function which positions our craft on the launchpad at time 0. We’ll just merge the spacecraft’s with the initial space center, overwriting the craft’s time and space coordinates.

(defn prepare "Prepares a craft for launch from an equatorial space center." [craft] (merge craft initial-space-center))

Forces

Gravity continually pulls the spacecraft towards the center of the Earth, accelerating it by 9.8 meters/second every second. To figure out what direction is towards the Earth, we’ll need the angles of a spherical coordinate system. We’ll use the trigonometric functions from java.lang.Math.

(defn magnitude "What's the radius of a given set of cartesian coordinates?" [c] ; By the Pythagorean theorem... (Math/sqrt (+ (Math/pow (:x c) 2) (Math/pow (:y c) 2) (Math/pow (:z c) 2)))) (defn cartesian->spherical "Converts a map of Cartesian coordinates :x, :y, and :z to spherical coordinates :r, :theta, and :phi." [c] (let [r (magnitude c)] {:r r :theta (Math/acos (/ (:z c) r)) :phi (Math/atan (/ (:y c) (:x c)))})) (defn spherical->cartesian "Converts spherical to Cartesian coordinates." [c] {:x (* (:r c) (Math/sin (:theta c)) (Math/cos (:phi c))) :y (* (:r c) (Math/sin (:theta c)) (Math/sin (:phi c))) :z (* (:r c) (Math/cos (:phi c)))})

With those angles in mind, computing the gravitational acceleration is easy. We just take the spherical coordinates of the spacecraft, and replace the radius with the total force due to gravity. Then we can transform that spherical force back into Cartesian coordinates.

(def g "Acceleration of gravity in meters/s^2" -9.8) (defn gravity-force "The force vector, each component in Newtons, due to gravity." [craft] ; Since force is mass times acceleration... (let [total-force (* g (mass craft))] (-> craft ; Now we'll take the craft's position :position ; in spherical coordinates, cartesian->spherical ; replace the radius with the gravitational force... (assoc :r total-force) ; and transform back to Cartesian-land spherical->cartesian)))

Rockets produce thrust by consuming fuel. Let’s say the fuel consumption is always the maximum, until we run out:

(defn fuel-rate "How fast is fuel, in kilograms/second, consumed by the craft?" [craft] (if (pos? (:fuel-mass craft)) (:max-fuel-rate craft) 0))

Now that we know how much fuel is being consumed, we can compute the force the rocket engine develops. That force is simply the mass consumed per second times the exhaust velocity–which is the specific impulse :isp. We’ll ignore atmospheric effects.

(defn thrust "How much force, in newtons, does the craft's rocket engines exert?" [craft] (* (fuel-rate craft) (:isp craft)))

Cool. What about the direction of thrust? Just for grins, let’s keep the rocket pointing entirely along the x axis.

(defn engine-force "The force vector, each component in Newtons, due to the rocket engine." [craft] (let [t (thrust craft)] {:x t :y 0 :z 0}))

The total force on a craft is just the sum of gravity and thrust. To sum these maps together, we’ll need a way to sum the x, y, and z components independently. Clojure’s merge-with function combines common fields in maps using a function, so this is surprisingly straightforward.

(defn total-force "Total force on a craft." [craft] (merge-with + (engine-force craft) (gravity-force craft)))

The acceleration of a craft, by Newton’s second law, is force divided by mass. This one’s a little trickier; given {:x 1 :y 2 :z 4} we want to apply a function–say, multiplication by a factor, to each number. Since maps are sequences of key/value pairs…

user=> (seq {:x 1 :y 2 :z 3}) ([:z 3] [:y 2] [:x 1])

… and we can build up new maps out of key/value pairs using into

user=> (into {} [[:x 4] [:y 5]]) {:x 4, :y 5}

… we can write a function map-values which works like map, but affects the values of a map data structure.

(defn map-values "Applies f to every value in the map m." [f m] (into {} (map (fn [pair] [(key pair) (f (val pair))]) m)))

And that allows us to define a scale function which scales a set of coordinates by some factor:

(defn scale "Multiplies a map of x, y, and z coordinates by the given factor." [factor coordinates] (map-values (partial * factor) coordinates))

What’s that partial thing? It’s a function which takes a function, and some arguments, and returns a new function. What does the new function do? It calls the original function, with the arguments passed to partial, followed by any arguments passed to the new function. In short, (partial * factor) returns a function that takes any number, and multiplies it by factor.

So to divide each component of the force vector by the mass of the craft:

(defn acceleration "Total acceleration of a craft." [craft] (let [m (mass craft)] (scale (/ m) (total-force craft))))

Note that (/ m) returns 1/m. Our scale function can do double-duty as both multiplication and division.

With the acceleration and fuel consumption all figured out, we’re ready to apply those changes over time. We’ll write a function which takes the rocket at a particular time, and returns a version of it dt seconds later.

(defn step [craft dt] (assoc craft ; Time advances by dt seconds :t (+ dt (:t craft)) ; We burn some fuel :fuel-mass (- (:fuel-mass craft) (* dt (fuel-rate craft))) ; Our position changes based on our velocity :position (merge-with + (:position craft) (scale dt (:velocity craft))) ; And our velocity changes based on our acceleration :velocity (merge-with + (:velocity craft) (scale dt (acceleration craft)))))

OK. Let’s save the rocket.clj file, load that code into the REPL, and fire it up.

user=> (use 'scratch.rocket :reload) nil

use is like a shorthand for (:require ... :refer :all). We’re passing :reload because we want the REPL to re-read the file. Notice that in ns declarations, the namespace name scratch.rocket is unquoted–but when we call use or require at the repl, we quote the namespace name.

user=> (atlas-v) {:dry-mass 50050, :fuel-mass 284450, :time 0, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

Launch

Let’s prepare the rocket. We’ll use pprint to print it in a more readable form.

user=> (-> (atlas-v) prepare pprint) {:velocity {:x 0, :y 463.8312116386399, :z 0}, :position {:x 6378137, :y 0, :z 0}, :dry-mass 50050, :fuel-mass 284450, :time 0, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

Great; there it is on the launchpad. Wow, even “standing still”, it’s moving at 463 meters/sec because of the earth’s rotation! That means you and I are flying through space at almost half a kilometer every second! Let’s step forward one second in time.

user=> (-> (atlas-v) prepare (step 1) pprint) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942)

In evaluating this expression, Clojure reached a point where it could not continue, and aborted execution. We call this error an exception, and the process of aborting throwing the exception. Clojure backs up to the function which called the function that threw, then the function which called that function, and so on, all the way to the top-level expression. The REPL finally intercepts the exception, prints an error to the console, and stashes the exception object in a special variable *e.

In this case, we know that the exception in question was a NullPointerException, which occurs when a function received nil unexpectedly. This one came from clojure.lang.Numbers.ops, which suggests some sort of math was involved. Let’s use pst to find out where it came from.

user=> (pst *e) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942) clojure.lang.Numbers.add (Numbers.java:126) scratch.rocket/step (rocket.clj:125) user/eval1478 (NO_SOURCE_FILE:1) clojure.lang.Compiler.eval (Compiler.java:6619) clojure.lang.Compiler.eval (Compiler.java:6582) clojure.core/eval (core.clj:2852) clojure.main/repl/read-eval-print--6588/fn--6591 (main.clj:259) clojure.main/repl/read-eval-print--6588 (main.clj:259) clojure.main/repl/fn--6597 (main.clj:277) clojure.main/repl (main.clj:277) clojure.tools.nrepl.middleware.interruptible-eval/evaluate/fn--589 (interruptible_eval.clj:56)

This is called a stack trace: the stack is the context of the program at each function call. It traces the path the computer took in evaluating the expression, from the bottom to the top. At the bottom is the REPL, and Clojure compiler. Our code begins at user/eval1478–that’s the compiler’s name for the expression we just typed. That function called scratch.rocket/step, which in turn called Numbers.add, and that called Numbers.ops. Let’s start by looking at the last function we wrote before calling into Clojure’s standard library: the step function, in rocket.clj, on line 125.

123 (assoc craft 124 ; Time advances by dt seconds 125 :t (+ dt (:t craft))

Ah; we named the time field :time earlier, not :t. Let’s replace :t with :time, save the file, and reload.

user=> (use 'scratch.rocket :reload) nil user=> (-> (atlas-v) prepare (step 1) pprint) {:velocity {:x 0.45154055666826215, :y 463.8312116386399, :z -9.8}, :position {:x 6378137, :y 463.8312116386399, :z 0}, :dry-mass 50050, :fuel-mass 71681400/253, :time 1, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

Look at that! Our position is unchanged (because our velocity was zero), but our velocity has shifted. We’re now moving… wait, -9.8 meters per second south? That can’t be right. Gravity points down, not sideways. Something must be wrong with our spherical coordinate system. Let’s write a test in test/scratch/rocket_test.clj to explore.

(ns scratch.rocket-test (:require [clojure.test :refer :all] [scratch.rocket :refer :all])) (deftest spherical-coordinate-test (let [pos {:x 1 :y 2 :z 3}] (testing "roundtrip" (is (= pos (-> pos cartesian->spherical spherical->cartesian)))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test scratch.rocket-test lein test :only scratch.rocket-test/spherical-coordinate-test FAIL in (spherical-coordinate-test) (rocket_test.clj:8) roundtrip expected: (= pos (-> pos cartesian->spherical spherical->cartesian)) actual: (not (= {:z 3, :y 2, :x 1} {:x 1.0, :y 1.9999999999999996, :z 1.6733200530681513})) Ran 2 tests containing 4 assertions. 1 failures, 0 errors. Tests failed.

Definitely wrong. Looks like something to do with the z coordinate, since x and y look OK. Let’s try testing a point on the north pole:

(deftest spherical-coordinate-test (testing "spherical->cartesian" (is (= (spherical->cartesian {:r 2 :phi 0 :theta 0}) {:x 0.0 :y 0.0 :z 2.0}))) (testing "roundtrip" (let [pos {:x 1.0 :y 2.0 :z 3.0}] (is (= pos (-> pos cartesian->spherical spherical->cartesian))))))

That checks out OK. Let’s try some values in the repl.

user=> (cartesian->spherical {:x 0.00001 :y 0.00001 :z 2.0}) {:r 2.00000000005, :theta 7.071068104411588E-6, :phi 0.7853981633974483} user=> (cartesian->spherical {:x 1 :y 2 :z 3}) {:r 3.7416573867739413, :theta 0.6405223126794245, :phi 1.1071487177940904} user=> (spherical->cartesian (cartesian->spherical {:x 1 :y 2 :z 3})) {:x 1.0, :y 1.9999999999999996, :z 1.6733200530681513} user=> (cartesian->spherical {:x 1 :y 2 :z 0}) {:r 2.23606797749979, :theta 1.5707963267948966, :phi 1.1071487177940904} user=> (cartesian->spherical {:x 1 :y 1 :z 0}) {:r 1.4142135623730951, :theta 1.5707963267948966, :phi 0.7853981633974483}

Oh, wait, that looks odd. {:x 1 :y 1 :z 0} is on the equator: phi–the angle from the pole–should be pi/2 or ~1.57, and theta–the angle around the equator–should be pi/4 or 0.78. Those coordinates are reversed! Double-checking our formulas with Wolfram MathWorld shows that we mixed up phi and theta! Let’s redefine cartesian->polar correctly.

(defn cartesian->spherical "Converts a map of Cartesian coordinates :x, :y, and :z to spherical coordinates :r, :theta, and :phi." [c] (let [r (Math/sqrt (+ (Math/pow (:x c) 2) (Math/pow (:y c) 2) (Math/pow (:z c) 2)))] {:r r :phi (Math/acos (/ (:z c) r)) :theta (Math/atan (/ (:y c) (:x c)))})) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test scratch.rocket-test Ran 2 tests containing 5 assertions. 0 failures, 0 errors.

Great. Now let’s check the rocket trajectory again.

user=> (-> (atlas-v) prepare (step 1) pprint) {:velocity {:x 0.45154055666826204, :y 463.8312116386399, :z -6.000769315822031E-16}, :position {:x 6378137, :y 463.8312116386399, :z 0}, :dry-mass 50050, :fuel-mass 71681400/253, :time 1, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0}

This time, our velocity is increasing in the +x direction, at half a meter per second. We have liftoff!

Flight

We have a function that can move the rocket forward by one small step of time, but we’d like to understand the rocket’s trajectory as a whole; to see all positions it will take. We’ll use iterate to construct a lazy, infinite sequence of rocket states, each one constructed by stepping forward from the last.

(defn trajectory [dt craft] "Returns all future states of the craft, at dt-second intervals." (iterate #(step % 1) craft)) user=> (->> (atlas-v) prepare (trajectory 1) (take 3) pprint) ({:velocity {:x 0, :y 463.8312116386399, :z 0}, :position {:x 6378137, :y 0, :z 0}, :dry-mass 50050, :fuel-mass 284450, :time 0, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0} {:velocity {:x 0.45154055666826204, :y 463.8312116386399, :z -6.000769315822031E-16}, :position {:x 6378137, :y 463.8312116386399, :z 0}, :dry-mass 50050, :fuel-mass 71681400/253, :time 1, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0} {:velocity {:x 0.9376544222659078, :y 463.83049896253056, :z -1.200153863164406E-15}, :position {:x 6378137.451540557, :y 927.6624232772798, :z -6.000769315822031E-16}, :dry-mass 50050, :fuel-mass 71396950/253, :time 2, :isp 3050, :max-fuel-rate 284450/253, :max-thrust 4152000.0})

Notice that each map is like a frame of a movie, playing at one frame per second. We can make the simulation more or less accurate by raising or lowering the framerate–adjusting the parameter fed to trajectory. For now, though, we’ll stick with one-second intervals.

How high above the surface is the rocket?

(defn altitude "The height above the surface of the equator, in meters." [craft] (-> craft :position cartesian->spherical :r (- earth-equatorial-radius)))

Now we can explore the rocket’s path as a series of altitudes over time:

user=> (->> (atlas-v) prepare (trajectory 1) (map altitude) (take 10) pprint) (0.0 0.016865378245711327 0.519002066925168 1.540983198210597 3.117615718394518 5.283942770212889 8.075246102176607 11.52704851794988 15.675116359256208 20.555462017655373)

The million dollar question, though, is whether the rocket breaks orbit.

(defn above-ground? "Is the craft at or above the surface?" [craft] (<= 0 (altitude craft))) (defn flight "The above-ground portion of a trajectory." [trajectory] (take-while above-ground? trajectory)) (defn crashed? "Does this trajectory crash into the surface before 100 hours are up?" [trajectory] (let [time-limit (* 100 3600)] ; 1 hour (not (every? above-ground? (take-while #(<= (:time %) time-limit) trajectory))))) (defn crash-time "Given a trajectory, returns the time the rocket impacted the ground." [trajectory] (:time (last (flight trajectory)))) (defn apoapsis "The highest altitude achieved during a trajectory." [trajectory] (apply max (map altitude trajectory))) (defn apoapsis-time "The time of apoapsis" [trajectory] (:time (apply max-key altitude (flight trajectory))))

If the rocket goes below ground, we know it crashed. If the rocket stays in orbit, the trajectory will never end. That makes it a bit tricky to tell whether the rocket is in a stable orbit or not, because we can’t ask about every element, or the last element, of an infinite sequence: it’ll take infinite time to evaluate. Instead, we’ll assume that the rocket should crash within the first, say, 100 hours; if it makes it past that point, we’ll assume it made orbit successfully. With these functions in hand, we’ll write a test in test/scratch/rocket_test.clj to see whether or not the launch is successful:

(deftest makes-orbit (let [trajectory (->> (atlas-v) prepare (trajectory 1))] (when (crashed? trajectory) (println "Crashed at" (crash-time trajectory) "seconds") (println "Maximum altitude" (apoapsis trajectory) "meters at" (apoapsis-time trajectory) "seconds")) ; Assert that the rocket eventually made it to orbit. (is (not (crashed? trajectory))))) aphyr@waterhouse:~/scratch$ lein test scratch.rocket-test lein test scratch.rocket-test Crashed at 982 seconds Maximum altitude 753838.039645385 meters at 532 seconds lein test :only scratch.rocket-test/makes-orbit FAIL in (makes-orbit) (rocket_test.clj:26) expected: (not (crashed? trajectory)) actual: (not (not true)) Ran 2 tests containing 3 assertions. 1 failures, 0 errors. Tests failed.

We made it to an altitude of 750 kilometers, and crashed 982 seconds after launch. We’re gonna need a bigger boat.

Stage II

The Atlas V isn’t big enough to make it into orbit on its own. It carries a second stage, the Centaur), which is much smaller and uses more efficient engines.

(defn centaur "The upper rocket stage. http://en.wikipedia.org/wiki/Centaur_(rocket_stage) http://www.astronautix.com/stages/cenaurde.htm" [] {:dry-mass 2361 :fuel-mass 13897 :isp 4354 :max-fuel-rate (/ 13897 470)})

The Centaur lives inside the Atlas V main stage. We’ll re-write atlas-v to take an argument: its next stage.

(defn atlas-v "The full launch vehicle. http://en.wikipedia.org/wiki/Atlas_V" [next-stage] {:dry-mass 50050 :fuel-mass 284450 :isp 3050 :max-fuel-rate (/ 284450 253) :next-stage next-stage})

Now, in our tests, we’ll construct the rocket like so:

(let [trajectory (->> (atlas-v (centaur)) prepare (trajectory 1))]

When we exhaust the fuel reserves of the primary stage, we’ll de-couple the main booster from the Centaur. In terms of our simulation, the Atlas V will be replaced by its next stage, the Centaur. We’ll write a function stage which separates the vehicles when ready:

(defn stage "When fuel reserves are exhausted, separate stages. Otherwise, return craft unchanged." [craft] (cond ; Still fuel left (pos? (:fuel-mass craft)) craft ; No remaining stages (nil? (:next-stage craft)) craft ; Stage! :else (merge (:next-stage craft) (select-keys craft [:time :position :velocity]))))

We’re using cond to handle three distinct cases: where there’s fuel remaining in the craft, where there is no stage to separate, and when we’re ready for stage separation. Separation is easy: we simply return the next stage of the current craft, with the current craft’s time, position, and velocity merged in.

Finally, we’ll have to update our step function to take into account the possibility of stage separation.

(defn step [craft dt] (let [craft (stage craft)] (assoc craft ; Time advances by dt seconds :time (+ dt (:time craft)) ; We burn some fuel :fuel-mass (- (:fuel-mass craft) (* dt (fuel-rate craft))) ; Our position changes based on our velocity :position (merge-with + (:position craft) (scale dt (:velocity craft))) ; And our velocity changes based on our acceleration :velocity (merge-with + (:velocity craft) (scale dt (acceleration craft))))))

Same as before, only now we call stage prior to the physics simulation. Let’s try a launch.

aphyr@waterhouse:~/scratch$ lein test scratch.rocket-test lein test scratch.rocket-test Crashed at 2415 seconds Maximum altitude 4598444.289945109 meters at 1446 seconds lein test :only scratch.rocket-test/makes-orbit FAIL in (makes-orbit) (rocket_test.clj:27) expected: (not (crashed? trajectory)) actual: (not (not true)) Ran 2 tests containing 3 assertions. 1 failures, 0 errors. Tests failed.

Still crashed–but we increased our apoapsis from 750 kilometers to 4,598 kilometers. That’s plenty high, but we’re still not making orbit. Why? Because we’re going straight up, then straight back down. To orbit, we need to move sideways, around the earth.

Orbital insertion

Our spacecraft is shooting upwards, but in order to remain in orbit around the earth, it has to execute a second burn: an orbital injection maneuver. That injection maneuver is also called a circularization burn because it turns the orbit from an ascending parabola into a circle (or something roughly like it). We don’t need to be precise about circularization–any trajectory that doesn’t hit the planet will suffice. All we have to do is burn towards the horizon, once we get high enough.

To do that, we’ll need to enhance the rocket’s control software. We assumed that the rocket would always thrust in the +x direction; but now we’ll need to thrust in multiple directions. We’ll break up the engine force into two parts: thrust (how hard the rocket motor pushes) and orientation (which determines the direction the rocket is pointing.)

(defn unit-vector "Scales coordinates to magnitude 1." [coordinates] (scale (/ (magnitude coordinates)) coordinates)) (defn engine-force "The force vector, each component in Newtons, due to the rocket engine." [craft] (scale (thrust craft) (unit-vector (orientation craft))))

We’re taking the orientation of the craft–some coordinates–and scaling it to be of length one with unit-vector. Then we’re scaling the orientation vector by the thrust, returning a thrust vector.

As we go back and redefine parts of the program, you might see an error like

Exception in thread "main" java.lang.RuntimeException: Unable to resolve symbol: unit-vector in this context, compiling:(scratch/rocket.clj:69:11) at clojure.lang.Compiler.analyze(Compiler.java:6380) at clojure.lang.Compiler.analyze(Compiler.java:6322)

This is a stack trace from the Clojure compiler. It indicates that in scratch/rocket.clj, on line 69, column 11, we used the symbol unit-vector–but it didn’t have a meaning at that point in the program. Perhaps unit-vector is defined below that line. There are two ways to solve this.

  1. Organize your functions so that the simple ones come first, and those that depend on them come later. Read this way, namespaces tell a story, progressing from smaller to bigger, more complex problems.

  2. Sometimes, ordering functions this way is impossible, or would put related ideas too far apart. In this case you can (declare unit-vector) near the top of the namespace. This tells Clojure that unit-vector isn’t defined yet, but it’ll come later.

Now that we’ve broken up engine-force into thrust and orientation, we have to control the thrust properly for our two burns. We’ll start by defining the times for the initial ascent and circularization burn, expressed as vectors of start and end times, in seconds.

(def ascent "The start and end times for the ascent burn." [0 3000]) (def circularization "The start and end times for the circularization burn." [4000 1000])

Now we’ll change the thrust by adjusting the rate of fuel consumption. Instead of burning at maximum until running out of fuel, we’ll execute two distinct burns.

(defn fuel-rate "How fast is fuel, in kilograms/second, consumed by the craft?" [craft] (cond ; Out of fuel (<= (:fuel-mass craft) 0) 0 ; Ascent burn (<= (first ascent) (:time craft) (last ascent)) (:max-fuel-rate craft) ; Circularization burn (<= (first circularization) (:time craft) (last circularization)) (:max-fuel-rate craft) ; Shut down engines otherwise :else 0))

We’re using cond to express four distinct possibilities: that we’ve run out of fuel, executing either of the two burns, or resting with the engines shut down. Because the comparison function <= takes any number of arguments and asserts that they occur in order, expressing intervals like “the time is between the first and last times in the ascent” is easy.

Finally, we need to determine the direction to burn in. This one’s gonna require some tricky linear algebra. You don’t need to worry about the specifics here–the goal is to find out what direction the rocket should burn to fly towards the horizon, in a circle around the planet. We’re doing that by taking the rocket’s velocity vector, and flattening out the velocity towards or away from the planet. All that’s left is the direction the rocket is flying around the earth.

(defn dot-product "Finds the inner product of two x, y, z coordinate maps. See http://en.wikipedia.org/wiki/Dot_product." [c1 c2] (+ (* (:x c1) (:x c2)) (* (:y c1) (:y c2)) (* (:z c1) (:z c2)))) (defn projection "The component of coordinate map a in the direction of coordinate map b. See http://en.wikipedia.org/wiki/Vector_projection." [a b] (let [b (unit-vector b)] (scale (dot-product a b) b))) (defn rejection "The component of coordinate map a *not* in the direction of coordinate map b." [a b] (let [a' (projection a b)] {:x (- (:x a) (:x a')) :y (- (:y a) (:y a')) :z (- (:z a) (:z a'))}))

With the mathematical underpinnings ready, we’ll define the orientation control software:

(defn orientation "What direction is the craft pointing?" [craft] (cond ; Initially, point along the *position* vector of the craft--that is ; to say, straight up, away from the earth. (<= (first ascent) (:time craft) (last ascent)) (:position craft) ; During the circularization burn, we want to burn *sideways*, in the ; direction of the orbit. We'll find the component of our velocity ; which is aligned with our position vector (that is to say, the vertical ; velocity), and subtract the vertical component. All that's left is the ; *horizontal* part of our velocity. (<= (first circularization) (:time craft) (last circularization)) (rejection (:velocity craft) (:position craft)) ; Otherwise, just point straight ahead. :else (:velocity craft)))

For the ascent burn, we’ll push straight away from the planet. For circularization, we use the rejection function to find the part of the velocity around the planet, and thrust in that direction. By default, we’ll leave the rocket pointing in the direction of travel.

With these changes made, the rocket should execute two burns. Let’s re-run the tests and see.

aphyr@waterhouse:~/scratch$ lein test scratch.rocket-test lein test scratch.rocket-test Ran 2 tests containing 3 assertions. 0 failures, 0 errors.

We finally did it! We’re rocket scientists!

Review

(ns scratch.rocket) ;; Linear algebra for {:x 1 :y 2 :z 3} coordinate vectors. (defn map-values "Applies f to every value in the map m." [f m] (into {} (map (fn [pair] [(key pair) (f (val pair))]) m))) (defn magnitude "What's the radius of a given set of cartesian coordinates?" [c] ; By the Pythagorean theorem... (Math/sqrt (+ (Math/pow (:x c) 2) (Math/pow (:y c) 2) (Math/pow (:z c) 2)))) (defn scale "Multiplies a map of x, y, and z coordinates by the given factor." [factor coordinates] (map-values (partial * factor) coordinates)) (defn unit-vector "Scales coordinates to magnitude 1." [coordinates] (scale (/ (magnitude coordinates)) coordinates)) (defn dot-product "Finds the inner product of two x, y, z coordinate maps. See http://en.wikipedia.org/wiki/Dot_product" [c1 c2] (+ (* (:x c1) (:x c2)) (* (:y c1) (:y c2)) (* (:z c1) (:z c2)))) (defn projection "The component of coordinate map a in the direction of coordinate map b. See http://en.wikipedia.org/wiki/Vector_projection." [a b] (let [b (unit-vector b)] (scale (dot-product a b) b))) (defn rejection "The component of coordinate map a *not* in the direction of coordinate map b." [a b] (let [a' (projection a b)] {:x (- (:x a) (:x a')) :y (- (:y a) (:y a')) :z (- (:z a) (:z a'))})) ;; Coordinate conversion (defn cartesian->spherical "Converts a map of Cartesian coordinates :x, :y, and :z to spherical coordinates :r, :theta, and :phi." [c] (let [r (magnitude c)] {:r r :phi (Math/acos (/ (:z c) r)) :theta (Math/atan (/ (:y c) (:x c)))})) (defn spherical->cartesian "Converts spherical to Cartesian coordinates." [c] {:x (* (:r c) (Math/cos (:theta c)) (Math/sin (:phi c))) :y (* (:r c) (Math/sin (:theta c)) (Math/sin (:phi c))) :z (* (:r c) (Math/cos (:phi c)))}) ;; The earth (def earth-equatorial-radius "Radius of the earth, in meters" 6378137) (def earth-day "Length of an earth day, in seconds." 86400) (def earth-equatorial-speed "How fast points on the equator move, relative to the center of the earth, in meters/sec." (/ (* 2 Math/PI earth-equatorial-radius) earth-day)) (def g "Acceleration of gravity in meters/s^2" -9.8) (def initial-space-center "The initial position and velocity of the launch facility" {:time 0 :position {:x earth-equatorial-radius :y 0 :z 0} :velocity {:x 0 :y earth-equatorial-speed :z 0}}) ;; Craft (defn centaur "The upper rocket stage. http://en.wikipedia.org/wiki/Centaur_(rocket_stage) http://www.astronautix.com/stages/cenaurde.htm" [] {:dry-mass 2361 :fuel-mass 13897 :isp 4354 :max-fuel-rate (/ 13897 470)}) (defn atlas-v "The full launch vehicle. http://en.wikipedia.org/wiki/Atlas_V" [next-stage] {:dry-mass 50050 :fuel-mass 284450 :isp 3050 :max-fuel-rate (/ 284450 253) :next-stage next-stage}) ;; Flight control (def ascent "The start and end times for the ascent burn." [0 300]) (def circularization "The start and end times for the circularization burn." [400 1000]) (defn orientation "What direction is the craft pointing?" [craft] (cond ; Initially, point along the *position* vector of the craft--that is ; to say, straight up, away from the earth. (<= (first ascent) (:time craft) (last ascent)) (:position craft) ; During the circularization burn, we want to burn *sideways*, in the ; direction of the orbit. We'll find the component of our velocity ; which is aligned with our position vector (that is to say, the vertical ; velocity), and subtract the vertical component. All that's left is the ; *horizontal* part of our velocity. (<= (first circularization) (:time craft) (last circularization)) (rejection (:velocity craft) (:position craft)) ; Otherwise, just point straight ahead. :else (:velocity craft))) (defn fuel-rate "How fast is fuel, in kilograms/second, consumed by the craft?" [craft] (cond ; Out of fuel (<= (:fuel-mass craft) 0) 0 ; Ascent burn (<= (first ascent) (:time craft) (last ascent)) (:max-fuel-rate craft) ; Circularization burn (<= (first circularization) (:time craft) (last circularization)) (:max-fuel-rate craft) ; Shut down engines otherwise :else 0)) (defn stage "When fuel reserves are exhausted, separate stages. Otherwise, return craft unchanged." [craft] (cond ; Still fuel left (pos? (:fuel-mass craft)) craft ; No remaining stages (nil? (:next-stage craft)) craft ; Stage! :else (merge (:next-stage craft) (select-keys craft [:time :position :velocity])))) ;; Dynamics (defn thrust "How much force, in newtons, does the craft's rocket engines exert?" [craft] (* (fuel-rate craft) (:isp craft))) (defn mass "The total mass of a craft." [craft] (+ (:dry-mass craft) (:fuel-mass craft))) (defn gravity-force "The force vector, each component in Newtons, due to gravity." [craft] ; Since force is mass times acceleration... (let [total-force (* g (mass craft))] (-> craft ; Now we'll take the craft's position :position ; in spherical coordinates, cartesian->spherical ; replace the radius with the gravitational force... (assoc :r total-force) ; and transform back to Cartesian-land spherical->cartesian))) (declare altitude) (defn engine-force "The force vector, each component in Newtons, due to the rocket engine." [craft] ; Debugging; useful for working through trajectories in detail. ; (println craft) ; (println "t " (:time craft) "alt" (altitude craft) "thrust" (thrust craft)) ; (println "fuel" (:fuel-mass craft)) ; (println "vel " (:velocity craft)) ; (println "ori " (unit-vector (orientation craft))) (scale (thrust craft) (unit-vector (orientation craft)))) (defn total-force "Total force on a craft." [craft] (merge-with + (engine-force craft) (gravity-force craft))) (defn acceleration "Total acceleration of a craft." [craft] (let [m (mass craft)] (scale (/ m) (total-force craft)))) (defn step [craft dt] (let [craft (stage craft)] (assoc craft ; Time advances by dt seconds :time (+ dt (:time craft)) ; We burn some fuel :fuel-mass (- (:fuel-mass craft) (* dt (fuel-rate craft))) ; Our position changes based on our velocity :position (merge-with + (:position craft) (scale dt (:velocity craft))) ; And our velocity changes based on our acceleration :velocity (merge-with + (:velocity craft) (scale dt (acceleration craft)))))) ;; Launch and flight (defn prepare "Prepares a craft for launch from an equatorial space center." [craft] (merge craft initial-space-center)) (defn trajectory [dt craft] "Returns all future states of the craft, at dt-second intervals." (iterate #(step % 1) craft)) ;; Analyzing trajectories (defn altitude "The height above the surface of the equator, in meters." [craft] (-> craft :position cartesian->spherical :r (- earth-equatorial-radius))) (defn above-ground? "Is the craft at or above the surface?" [craft] (<= 0 (altitude craft))) (defn flight "The above-ground portion of a trajectory." [trajectory] (take-while above-ground? trajectory)) (defn crashed? "Does this trajectory crash into the surface before 10 hours are up?" [trajectory] (let [time-limit (* 10 3600)] ; 10 hours (not (every? above-ground? (take-while #(<= (:time %) time-limit) trajectory))))) (defn crash-time "Given a trajectory, returns the time the rocket impacted the ground." [trajectory] (:time (last (flight trajectory)))) (defn apoapsis "The highest altitude achieved during a trajectory." [trajectory] (apply max (map altitude (flight trajectory)))) (defn apoapsis-time "The time of apoapsis" [trajectory] (:time (apply max-key altitude (flight trajectory))))

As written here, our first non-trivial program tells a story–though a different one than the process of exploration and refinement that brought the rocket to orbit. It builds from small, abstract ideas: linear algebra and coordinates; physical constants describing the universe for the simulation; and the basic outline of the spacecraft. Then we define the software controlling the rocket; the times for the burns, how much to thrust, in what direction, and when to separate stages. Using those control functions, we build a physics engine including gravity and thrust forces, and use Newton’s second law to build a basic Euler Method solver. Finally, we analyze the trajectories the solver produces to answer key questions: how high, how long, and did it explode?

We used Clojure’s immutable data structures–mostly maps–to represent the state of the universe, and defined pure functions to interpret those states and construct new ones. Using iterate, we projected a single state forward into an infinite timeline of the future–evaluated as demanded by the analysis functions. Though we pay a performance penalty, immutable data structures, pure functions, and lazy evaluation make simulating complex systems easier to reason about.

Had we written this simulation in a different language, different techniques might have come into play. In Java, C++, or Ruby, we would have defined a hierarchy of datatypes called classes, each one representing a small piece of state. We might define a Craft type, and created subtypes Atlas and Centaur. We’d create a Coordinate type, subdivided into Cartesian and Spherical, and so on. The types add complexity and rigidity, but also prevent mis-spellings, and can prevent us from interpreting, say, coordinates as craft or vice-versa.

To move the system forward in a language emphasizing mutable data structures, we would have updated the time and coordinates of a single craft in-place. This introduces additional complexity, because many of the changes we made depended on the current values of the craft. To ensure the correct ordering of calculations, we’d scatter temporary variables and explicit copies throughout the code, ensuring that functions didn’t see inconsistent pictures of the craft state. The mutable approach would likely be faster, but would still demand some copying of data, and sacrifice clarity.

More imperative languages place less emphasis on laziness, and make it harder to express ideas like map and take. We might have simulated the trajectory for some fixed time, constructing a history of all the intermediate results we needed, then analyzed it by moving explicitly from slot to slot in that history, checking if the craft had crashed, and so on.

Across all these languages, though, some ideas remain the same. We solve big problems by breaking them up into smaller ones. We use data structures to represent the state of the system, and functions to alter that state. Comments and docstrings clarify the story of the code, making it readable to others. Tests ensure the software is correct, and allow us to work piecewise towards a solution.

Exercises

  1. We know the spacecraft reached orbit, but we have no idea what that orbit looks like. Since the trajectory is infinite in length, we can’t ask about the entire history using max–but we know that all orbits have a high and low point. At the highest point, the difference between successive altitudes changes from increasing to decreasing, and at the lowest point, the difference between successive altitudes changes from decreasing to increasing. Using this technique, refine the apoapsis function to find the highest point using that inflection in altitudes–and write a corresponding periapsis function that finds the lowest point in the orbit. Display both periapsis and apoapsis in the test suite.

  2. We assumed the force of gravity resulted in a constant 9.8 meter/second/second acceleration towards the earth, but in the real world, gravity falls off with the inverse square law. Using the mass of the earth, mass of the spacecraft, and Newton’s constant, refine the gravitational force used in this simulation to take Newton’s law into account. How does this affect the apoapsis?

  3. We ignored the atmosphere, which exerts drag on the craft as it moves through the air. Write a basic air-density function which falls off with altitude. Make some educated guesses as to how much drag a real rocket experiences, and assume that the drag force is proportional to the square of the rocket’s velocity. Can your rocket still reach orbit?

  4. Notice that the periapsis and apoapsis of the rocket are different. By executing the circularization burn carefully, can you make them the same–achieving a perfectly circular orbit? One way to do this is to pick an orbital altitude and velocity of a known satellite–say, the International Space Station–and write the control software to match that velocity at that altitude.

Previously, we covered state and mutability.

Up until now, we’ve been programming primarily at the REPL. However, the REPL is a limited tool. While it lets us explore a problem interactively, that interactivity comes at a cost: changing an expression requires retyping the entire thing, editing multi-line expressions is awkward, and our work vanishes when we restart the REPL–so we can’t share our programs with others, or run them again later. Moreover, programs in the REPL are hard to organize. To solve large problems, we need a way of writing programs durably–so they can be read and evaluated later.

In addition to the code itself, we often want to store ancillary information. Tests verify the correctness of the program. Resources like precomputed databases, lookup tables, images, and text files provide other data the program needs to run. There may be documentation: instructions for how to use and understand the software. A program may also depend on code from other programs, which we call libraries, packages, or dependencies. In Clojure, we have a standardized way to bind together all these parts into a single directory, called a project.

Project structure

We created a project at the start of this book by using Leiningen, the Clojure project tool.

$ lein new scratch

scratch is the name of the project, and also the name of the directory where the project’s files live. Inside the project are a few files.

$ cd scratch; ls doc project.clj README.md resources src target test

project.clj defines the project: its name, its version, dependencies, and so on. Notice the name of the project (scratch) comes first, followed by the version (0.1.0-SNAPSHOT). -SNAPSHOT versions are for development; you can change them at any time, and any projects which depend on the snapshot will pick up the most recent changes. A version which does not end in -SNAPSHOT is fixed: once published, it always points to the same version of the project. This allows projects to specify precisely which projects they depend on. For example, scratch’s project.clj says scratch depends on org.clojure/clojure version 1.5.1.

(defproject scratch "0.1.0-SNAPSHOT" :description "FIXME: write description" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.5.1"] ])

README.md is the first file most people open when they look at a new project, and Lein generates a generic readme for you to fill in later. We call this kind of scaffolding or example a “stub”; it’s just there to remind you what sort of things to write yourself. You’ll notice the readme includes the name of the project, some notes on what it does and how to use it, a copyright notice where your name should go, and a license, which sets the legal terms for the use of the project. By default, Leiningen suggests the Eclipse Public License, which allows everyone to use and modify the software, so long as they preserve the license information.

The doc directory is for documentation; sometimes hand-written, sometimes automatically generated from the source code. resources is for additional files, like images. src is where Clojure code lives, and test contains the corresponding tests. Finally, target is where Leiningen stores compiled code, built packages, and so on.

Namespaces

Every lein project starts out with a stub namespace containing a simple function. Let’s take a look at that namespace now–it lives in src/scratch/core.clj:

(ns scratch.core) (defn foo "I don't do a whole lot." [x] (println x "Hello, World!"))

The first part of this file defines the namespace we’ll be working in. The ns macro lets the Clojure compiler know that all following code belongs in the scratch.core namespace. Remember, scratch is the name of our project. scratch.core is for the core functions and definitions of the scratch project. As projects expand, we might add new namespaces to separate our work into smaller, more understandable pieces. For instance, Clojure’s primary functions live in clojure.core, but there are auxiliary functions for string processing in clojure.string, functions for interoperating with Java’s input-output system in clojure.java.io, for printing values in clojure.pprint, and so on.

def, defn, and peers always work in the scope of a particular namespace. The function foo in scratch.core is different from the function foo in scratch.pad.

scratch.foo=> (ns scratch.core) nil scratch.core=> (def foo "I'm in core") #'scratch.core/foo scratch.core=> (ns scratch.pad) nil scratch.pad=> (def foo "I'm in pad!") #'scratch.pad/foo

Notice the full names of these vars are different: scratch.core/foo vs scratch.pad/foo. You can always refer to a var by its fully qualified name: the namespace, followed by a slash /, followed by the short name.

Inside a namespace, symbols resolve to variables which are defined in that namespace. So in scratch.pad, foo refers to scratch.pad/foo.

scratch.pad=> foo "I'm in pad!"

Namespaces automatically include clojure.core by default; which is where all the standard functions, macros, and special forms come from. let, defn, filter, vector, etc: all live in clojure.core, but are automatically included in new namespaces so we can refer to them by their short names.

Notice that the values for foo we defined in scratch.pad and scratch.core aren’t available in other namespaces, like user.

scratch.pad=> (ns user) nil user=> foo CompilerException java.lang.RuntimeException: Unable to resolve symbol: foo in this context, compiling:(NO_SOURCE_PATH:1:602)

To access things from other namespaces, we have to require them in the namespace definition.

user=> (ns user (:require [scratch.core])) nil user=> scratch.core/foo "I'm in core"

The :require part of the ns declaration told the compiler that the user namespace needed access to scratch.core. We could then refer to the fully qualified name scratch.core/foo.

Often, writing out the full namespace is cumbersome–so you can give a short alias for a namespace like so:

user=> (ns user (:require [scratch.core :as c])) nil user=> c/foo "I'm in core"

The :as directive indicates that anywhere we write c/something, the compiler should expand that to scratch.core/something. If you plan on using a var from another namespace often, you can refer it to the local namespace–which means you may omit the namespace qualifier entirely.

user=> (ns user (:require [scratch.core :refer [foo]])) nil user=> foo "I'm in core"

You can refer functions into the current namespace by listing them: [foo bar ...]. Alternatively, you can suck in every function from another namespace by saying :refer :all:

user=> (ns user (:require [scratch.core :refer :all])) nil user=> foo "I'm in core"

Namespaces control complexity by isolating code into more understandable, related pieces. They make it easier to read code by keeping similar things together, and unrelated things apart. By making dependencies between namespaces explicit, they make it clear how groups of functions relate to one another.

If you’ve worked with Erlang, Modula-2, Haskell, Perl, or ML, you’ll find namespaces analogous to modules or packages. Namespaces are often large, encompassing hundreds of functions; and most projects use only a handful of namespaces.

By contrast, object-oriented programming languages like Java, Scala, Ruby, and Objective C organize code in classes, which combine names and state in a single construct. Because all functions in a class operate on the same state, object-oriented languages tend to have many classes with fewer functions in each. It’s not uncommon for a typical Java project to define hundreds or thousands of classes containing only one or two functions each. If you come from an object-oriented language, it can feel a bit unusual to combine so many functions in a single scope–but because functional programs isolate state differently, this is normal. If, on the other hand, you move to an object-oriented language after Clojure, remember that OO languages compose differently. Objects with hundreds of functions are usually considered unwieldy and should be split into smaller pieces.

Code and tests

It’s perfectly fine to test small programs in the REPL. We’ve written and refined hundreds of functions that way: by calling the function and seeing what happens. However, as programs grow in scope and complexity, testing them by hand becomes harder and harder. If you change the behavior of a function which ten other functions rely on, you may have to re-test all ten by hand. In real programs, a small change can alter thousands of distinct behaviors, all of which should be verified.

Wherever possible, we want to automate software tests–making the test itself another program. If the test suite runs in a matter of seconds, we can make changes freely–re-running the tests continuously to verify that everything still works.

As a simple example, let’s write and test a single function in src/scratch/core.clj. How about exponentiation–raising a number to the given power?

(ns scratch.core) (defn pow "Raises base to the given power. For instance, (pow 3 2) returns three squared, or nine." [base power] (apply * (repeat base power)))

So we repeat the base power times, then call * with that sequence of bases to multiply them all together. Seems straightforward enough. Now we need to test it.

By default, all lein projects come with a simple test stub. Let’s see it in action by running lein test.

aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test :only scratch.core-test/a-test FAIL in (a-test) (core_test.clj:7) FIXME, I fail. expected: (= 0 1) actual: (not (= 0 1)) Ran 1 tests containing 1 assertions. 1 failures, 0 errors. Tests failed.

A failure is when a test returns the wrong value. An error is when a test throws an exception. In this case, the test named a-test, in the file core_test.clj, on line 7, failed. That test expected zero to be equal to one–but found that 0 and 1 were (in point of fact) not equal. Let’s take a look at that test, in test/scratch/core_test.clj.

(ns scratch.core-test (:require [clojure.test :refer :all] [scratch.core :refer :all])) (deftest a-test (testing "FIXME, I fail." (is (= 0 1))))

These tests live in a namespace too! Notice that namespaces and file names match up: scratch.core lives in src/scratch/core.clj, and scratch.core-test lives in test/scratch/core_test.clj. Dashes (-) in namespaces correspond to underscores (_) in filenames, and dots (.) correspond to directory separators (/).

The scratch.core-test namespace is responsible for testing things in scratch.core. Notice that it requires two namespaces: clojure.test, which provides testing functions and macros, and scratch.core, which is the namespace we want to test.

Then we define a test using deftest. deftest takes a symbol as a test name, and then any number of expressions to evaluate. We can use testing to split up tests into smaller pieces. If a test fails, lein test will print out the enclosing deftest and testing names, to make it easier to figure out what went wrong.

Let’s change this test so that it passes. 0 should equal 0.

(deftest a-test (testing "Numbers are equal to themselves, right?" (is (= 0 0)))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test Ran 1 tests containing 1 assertions. 0 failures, 0 errors.

Wonderful! Now let’s test the pow function. I like to start with a really basic case and work my way up to more complicated ones. 11 is 1, so:

(deftest pow-test (testing "unity" (is (= 1 (pow 1 1))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test Ran 1 tests containing 1 assertions. 0 failures, 0 errors.

Excellent. How about something harder?

(deftest pow-test (testing "unity" (is (= 1 (pow 1 1)))) (testing "square integers" (is (= 9 (pow 3 2))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test lein test :only scratch.core-test/pow-test FAIL in (pow-test) (core_test.clj:10) square integers expected: (= 9 (pow 3 2)) actual: (not (= 9 8)) Ran 1 tests containing 2 assertions. 1 failures, 0 errors. Tests failed.

That’s odd. 32 should be 9, not 8. Let’s double-check our code in the REPL. base was 3, and power was 2, so…

user=> (repeat 3 2) (2 2 2) user=> (* 2 2 2) 8

Ah, there’s the problem. We’re mis-using repeat. Instead of repeating 3 twice, we repeated 2 thrice.

user=> (doc repeat) ------------------------- clojure.core/repeat ([x] [n x]) Returns a lazy (infinite!, or length n if supplied) sequence of xs.

Let’s redefine pow with the correct arguments to repeat:

(defn pow "Raises base to the given power. For instance, (pow 3 2) returns three squared, or nine." [base power] (apply * (repeat power base)))

How about 00? By convention, mathematicians define 00 as 1.

(deftest pow-test (testing "unity" (is (= 1 (pow 1 1)))) (testing "square integers" (is (= 9 (pow 3 2)))) (testing "0^0" (is (= 1 (pow 0 0))))) aphyr@waterhouse:~/scratch$ lein test lein test scratch.core-test Ran 1 tests containing 3 assertions. 0 failures, 0 errors.

Hey, what do you know? It works! But why?

user=> (repeat 0 0) ()

What happens when we call * with an empty list of arguments?

user=> (*) 1

Remember when we talked about how the zero-argument forms of +, and * made some definitions simpler? This is one of those times. We didn’t have to define a special exception for zero powers because (*) returns the multiplicative identity 1, by convention.

Exploring data

The last bit of logistics we need to talk about is working with other people’s code. Clojure projects, like most modern programming environments, are built to work together. We can use libraries to parse data, solve mathematical problems, render graphics, perform simulations, talk to robots, or predict the weather. As a quick example, I’d like to imagine that you and I are public-health researchers trying to identify the best location for an ad campaign to reduce drunk driving.

The FBI’s Uniform Crime Reporting database tracks the annual tally of different types of arrests, broken down by county–but the data files provided by the FBI are a mess to work with. Helpfully, Matt Aliabadi has helpfully organized the UCR’s somewhat complex format into nice, normalized files in a data format called JSON, and made them available on Github. Let’s download the most recent year’s normalized data, and save it in the scratch directory.

What’s in this file, anyway? Let’s take a look at the first few lines using head:

aphyr@waterhouse:~/scratch$ head 2008.json [ { "icpsr_study_number": null, "icpsr_edition_number": 1, "icpsr_part_number": 1, "icpsr_sequential_case_id_number": 1, "fips_state_code": "01", "fips_county_code": "001", "county_population": 52417, "number_of_agencies_in_county": 3,

This is a data format called JSON, and it looks a lot like Clojure’s data structures. That’s the start of a vector on the first line, and the second line starts a map. Then we’ve got string keys like "icpsr_study_number", and values which look like null (nil), numbers, or strings. But in order to work with this file, we’ll need to parse it into Clojure data structures. For that, we can use a JSON parsing library, like Cheshire.

Cheshire, like most Clojure libraries, is published on an internet repository called Clojars. To include it in our scratch project, all we have to do is add open project.clj in a text editor, and add a line specifying that we want to use a particular version of Cheshire.

(defproject scratch "0.1.0-SNAPSHOT" :description "Just playing around" :url "http://example.com/FIXME" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.5.1"] [cheshire "5.3.1"]])

Now we’ll exit the REPL with Control+D (^D), and restart it with lein repl. Leiningen, the Clojure package manager, will automatically download Cheshire from Clojars and make it available in the new REPL session.

Now let’s figure out how to parse the JSON file. Looking at Cheshire’s README shows an example that looks helpful:

;; parse some json and get keywords back (parse-string "{\"foo\":\"bar\"}" true) ;; => {:foo "bar"}

So Cheshire includes a parse-string function which can take a string and return a data structure. How can we get a string out of a file? Using slurp:

user=> (use 'cheshire.core) nil user=> (parse-string (slurp "2008.json")) ...

Woooow, that’s a lot of data! Let’s chop it down to something more manageable. How about the first entry?

user=> (first (parse-string (slurp "2008.json"))) {"syntheticdrug_salemanufacture" 1, "all_other_offenses_except_traffic" 900, "arson" 3, ...} user=> (-> "2008.json" slurp parse-string first)

It’d be nicer if this data used keywords instead of strings for its keys. Let’s use the second argument to Chesire’s parse-string to convert all the keys in maps to keywords.

user=> (first (parse-string (slurp "2008.json") true)) {:other_assaults 288, :gambling_all_other 0, :arson 3, ... :drunkenness 108}

Since we’re going to be working with this dataset over and over again, let’s bind it to a variable for easy re-use.

user=> (def data (parse-string (slurp "2008.json") true)) #'user/data

Now we’ve got a big long vector of counties, each represented by a map–but we’re just interested in the DUIs of each one. What does that look like? Let’s map each county to its :driving_under_influence.

user=> (->> data (map :driving_under_influence)) (198 1095 114 98 135 4 122 587 204 53 177 ...

What’s the most any county has ever reported?

user=> (->> data (map :driving_under_influence) (apply max)) 45056

45056 counts in one year? Wow! What about the second-worst county? The easiest way to find the top n counties is to sort the list, then look at the final elements.

user=> (->> data (map :driving_under_influence) sort (take-last 10)) (8589 10432 10443 10814 11439 13983 17572 18562 26235 45056)

So the top 10 counties range from 8549 counts to 45056 counts. What’s the most common count? Clojure comes with a built-in function called frequencies which takes a sequence of elements, and returns a map from each element to how many times it appeared in the sequence.

user=> (->> data (map :driving_under_influence) frequencies) {0 227, 1024 1, 45056 1, 32 15, 2080 1, 64 12 ...

Now let’s take those [drunk-driving, frequency] pairs and sort them by key to produce a histogram. sort-by takes a function to apply to each element in the collection–in this case, a key-value pair–and returns something that can be sorted, like a number. We’ll choose the key function to extract the key from each key-value pair, effectively sorting the counties by number of reported incidents.

user=> (->> data (map :driving_under_influence) frequencies (sort-by key) pprint) ([0 227] [1 24] [2 17] [3 20] [4 17] [5 24] [6 23] [7 23] [8 17] [9 19] [10 29] [11 20] [12 18] [13 21] [14 25] [15 13] [16 18] [17 16] [18 17] [19 11] [20 8] ...

So a ton of counties (227 out of 3172 total) report no drunk driving; a few hundred have one incident, a moderate number have 10-20, and it falls off from there. This is a common sort of shape in statistics; often a hallmark of an exponential distribution.

How about the 10 worst counties, all the way out on the end of the curve?

user=> (->> data (map :driving_under_influence) frequencies (sort-by key) (take-last 10) pprint) ([8589 1] [10432 1] [10443 1] [10814 1] [11439 1] [13983 1] [17572 1] [18562 1] [26235 1] [45056 1])

So it looks like 45056 is high, but there are 8 other counties with tens of thousands of reports too. Let’s back up to the original dataset, and sort it by DUIs:

user=> (->> data (sort-by :driving_under_influence) (take-last 10) pprint) ({:other_assaults 3096, :gambling_all_other 3, :arson 106, :have_stolen_property 698, :syntheticdrug_salemanufacture 0, :icpsr_sequential_case_id_number 220, :drug_abuse_salemanufacture 1761, ...

What we’re looking for is the county names, but it’s a little hard to read these enormous maps. Let’s take a look at just the keys that define each county, and see which ones might be useful. We’ll take this list of counties, map each one to a list of its keys, and concatenate those lists together into one big long list. mapcat maps and concatenates in a single step. We expect the same keys to show up over and over again, so we’ll remove duplicates by merging all those keys into a sorted-set.

user=> (->> data (sort-by :driving_under_influence) (take-last 10) (mapcat keys) (into (sorted-set)) pprint) #{:aggravated_assaults :all_other_offenses_except_traffic :arson :auto_thefts :bookmaking_horsesport :burglary :county_population :coverage_indicator :curfew_loitering_laws :disorderly_conduct :driving_under_influence :drug_abuse_salemanufacture :drug_abuse_violationstotal :drug_possession_other :drug_possession_subtotal :drunkenness :embezzlement :fips_county_code :fips_state_code :forgerycounterfeiting :fraud :gambling_all_other :gambling_total :grand_total :have_stolen_property :icpsr_edition_number :icpsr_part_number :icpsr_sequential_case_id_number :icpsr_study_number :larceny :liquor_law_violations :marijuana_possession :marijuanasalemanufacture :multicounty_jurisdiction_flag :murder :number_of_agencies_in_county :numbers_lottery :offenses_against_family_child :opiumcocaine_possession :opiumcocainesalemanufacture :other_assaults :otherdang_nonnarcotics :part_1_total :property_crimes :prostitutioncomm_vice :rape :robbery :runaways :sex_offenses :suspicion :synthetic_narcoticspossession :syntheticdrug_salemanufacture :vagrancy :vandalism :violent_crimes :weapons_violations}

Ah, :fips_county_code and :fips_state_code look promising. Let’s compact the dataset to just drunk driving and those codes, using select-keys.

user=> (->> data (sort-by :driving_under_influence) (take-last 10) (map #(select-keys % [:driving_under_influence :fips_county_code :fips_state_code])) pprint) ({:fips_state_code "06", :fips_county_code "067", :driving_under_influence 8589} {:fips_state_code "48", :fips_county_code "201", :driving_under_influence 10432} {:fips_state_code "32", :fips_county_code "003", :driving_under_influence 10443} {:fips_state_code "06", :fips_county_code "065", :driving_under_influence 10814} {:fips_state_code "53", :fips_county_code "033", :driving_under_influence 11439} {:fips_state_code "06", :fips_county_code "071", :driving_under_influence 13983} {:fips_state_code "06", :fips_county_code "059", :driving_under_influence 17572} {:fips_state_code "06", :fips_county_code "073", :driving_under_influence 18562} {:fips_state_code "04", :fips_county_code "013", :driving_under_influence 26235} {:fips_state_code "06", :fips_county_code "037", :driving_under_influence 45056})

Now we’re getting somewhere–but we need a way to interpret these state and county codes. Googling for “FIPS” led me to Wikipedia’s account of the FIPS county code system, and the NOAA’s ERDDAP service, which has a table mapping FIPS codes to state and county names. Let’s save that file as fips.json.

Now we’ll slurp that file into the REPL and parse it, just like we did with the crime dataset.

user=> (def fips (parse-string (slurp "fips.json") true))

Let’s take a quick look at the structure of this data:

user=> (keys fips) (:table) user=> (keys (:table fips)) (:columnNames :columnTypes :rows) user=> (->> fips :table :columnNames) ["FIPS" "Name"]

Great, so we expect the rows to be a list of FIPS code and Name.

user=> (->> fips :table :rows (take 3) pprint) (["02000" "AK"] ["02013" "AK, Aleutians East"] ["02016" "AK, Aleutians West"])

Perfect. Now that’s we’ve done some exploratory work in the REPL, let’s shift back to an editor. Create a new file in src/scratch/crime.clj:

(ns scratch.crime (:require [cheshire.core :as json])) (def fips "A map of FIPS codes to their county names." (->> (json/parse-string (slurp "fips.json") true) :table :rows (into {})))

We’re just taking a snippet we wrote in the REPL–parsing the FIPS dataset–and writing it down for use as a part of a bigger program. We use (into {}) to convert the sequence of [fips, name] pairs into a map, just like we used into (sorted-set) to merge a list of keywords into a set earlier. into works just like conj, repeated over and over again, and is an incredibly useful tool for building up collections of things.

Back in the REPL, let’s check if that worked:

user=> (use 'scratch.crime :reload) nil user=> (fips "10001") "DE, Kent"

Remember, maps act like functions in Clojure, so we can use the fips map to look up the names of counties efficiently.

We also have to load the UCR crime file–so let’s split that load-and-parse code into its own function:

(defn load-json "Given a filename, reads a JSON file and returns it, parsed, with keywords." [file] (json/parse-string (slurp file) true)) (def fips "A map of FIPS codes to their county names." (->> "fips.json" load-json :table :rows (into {})))

Now we can re-use load-json to load the UCR crime file.

(defn most-duis "Given a JSON filename of UCR crime data for a particular year, finds the counties with the most DUIs." [file] (->> file load-json (sort-by :driving_under_influence) (take-last 10) (map #(select-keys % [:driving_under_influence :fips_county_code :fips_state_code])))) user=> (use 'scratch.crime :reload) (pprint (most-duis "2008.json")) nil ({:fips_state_code "06", :fips_county_code "067", :driving_under_influence 8589} {:fips_state_code "48", :fips_county_code "201", :driving_under_influence 10432} {:fips_state_code "32", :fips_county_code "003", :driving_under_influence 10443} {:fips_state_code "06", :fips_county_code "065", :driving_under_influence 10814} {:fips_state_code "53", :fips_county_code "033", :driving_under_influence 11439} {:fips_state_code "06", :fips_county_code "071", :driving_under_influence 13983} {:fips_state_code "06", :fips_county_code "059", :driving_under_influence 17572} {:fips_state_code "06", :fips_county_code "073", :driving_under_influence 18562} {:fips_state_code "04", :fips_county_code "013", :driving_under_influence 26235} {:fips_state_code "06", :fips_county_code "037", :driving_under_influence 45056})

Almost there. We need to join together the state and county FIPS codes into a single string, because that’s how fips represents the county code:

(defn fips-code "Given a county (a map with :fips_state_code and :fips_county_code keys), returns the five-digit FIPS code for the county, as a string." [county] (str (:fips_state_code county) (:fips_county_code county)))

Let’s write a quick test in test/scratch/crime_test.clj to verify that function works correctly:

(ns scratch.crime-test (:require [clojure.test :refer :all] [scratch.crime :refer :all])) (deftest fips-code-test (is (= "12345" (fips-code {:fips_state_code "12" :fips_county_code "345"})))) aphyr@waterhouse:~/scratch$ lein test scratch.crime-test lein test scratch.crime-test Ran 1 tests containing 1 assertions. 0 failures, 0 errors.

Great. Note that lein test some-namespace runs only the tests in that particular namespace. For the last step, let’s take the most-duis function and, using fips and fips-code, construct a map of county names to DUI reports.

(defn most-duis "Given a JSON filename of UCR crime data for a particular year, finds the counties with the most DUIs." [file] (->> file load-json (sort-by :driving_under_influence) (take-last 10) (map (fn [county] [(fips (fips-code county)) (:driving_under_influence county)])) (into {}))) user=> (use 'scratch.crime :reload) (pprint (most-duis "2008.json")) nil {"CA, Orange" 17572, "CA, San Bernardino" 13983, "CA, Los Angeles" 45056, "CA, Riverside" 10814, "NV, Clark" 10443, "WA, King" 11439, "AZ, Maricopa" 26235, "CA, San Diego" 18562, "TX, Harris" 10432, "CA, Sacramento" 8589}

Our question is, at least in part, answered: Los Angeles and Maricopa counties, in California and Arizona, have the most reports of drunk driving out of any counties in the 2008 Uniform Crime Reporting database. These might be good counties for a PSA campaign. California has either lots of drunk drivers, or aggressive enforcement, or both! Remember, this only tells us about reports of crimes; not the crimes themselves. Numbers vary based on how the state enforces the laws!

(ns scratch.crime (:require [cheshire.core :as json])) (defn load-json "Given a filename, reads a JSON file and returns it, parsed, with keywords." [file] (json/parse-string (slurp file) true)) (def fips "A map of FIPS codes to their county names." (->> "fips.json" load-json :table :rows (into {}))) (defn fips-code "Given a county (a map with :fips_state_code and :fips_county_code keys), returns the five-digit FIPS code for the county, as a string." [county] (str (:fips_state_code county) (:fips_county_code county))) (defn most-duis "Given a JSON filename of UCR crime data for a particular year, finds the counties with the most DUIs." [file] (->> file load-json (sort-by :driving_under_influence) (take-last 10) (map (fn [county] [(fips (fips-code county)) (:driving_under_influence county)])) (into {})))

Recap

In this chapter, we expanded beyond transient programs written in the REPL. We learned how projects combine static resources, code, and tests into a single package, and how projects can relate to one another through dependencies. We learned the basics of Clojure’s namespace system, which isolates distinct chunks of code from one another, and how to include definitions from one namespace in another via require and use. We learned how to write and run tests to verify our code’s correctness, and how to move seamlessly between the repl and code in .clj files. We made use of Cheshire, a Clojure library published on Clojars, to parse JSON–a common data format. Finally, we brought together our knowledge of Clojure’s basic grammar, immutable data structures, core functions, sequences, threading macros, and vars to explore a real-world problem.

Exercises

  1. most-duis tells us about the raw number of reports, but doesn’t account for differences in county population. One would naturally expect counties with more people to have more crime! Divide the :driving_under_influence of each county by its :county_population to find a prevalence of DUIs, and take the top ten counties based on prevalence. How should you handle counties with a population of zero?

  2. How do the prevalence counties compare to the original counties? Expand most-duis to return vectors of [county-name, prevalence, report-count, population] What are the populations of the high-prevalence counties? Why do you suppose the data looks this way? If you were leading a public-health campaign to reduce drunk driving, would you target your intervention based on report count or prevalence? Why?

  3. We can generalize the most-duis function to handle any type of crime. Write a function most-prevalent which takes a file and a field name, like :arson, and finds the counties where that field is most often reported, per capita.

  4. Write a test to verify that most-prevalent is correct.

A few weeks ago I criticized a proposal by Antirez for a hypothetical linearizable system built on top of Redis WAIT and a strong coordinator. I showed that the coordinator he suggested was physically impossible to build, and that anybody who tried to actually implement that design would run into serious problems. I demonstrated those problems (and additional implementation-specific issues) in an experiment on Redis' unstable branch.

Antirez' principal objections, as I understand them, are:

  1. Some readers mistakenly assumed that the system I discussed was a proposal for Redis Cluster.
  2. I showed that the proposal was physically impossible, but didn’t address its safety if it were possible.
  3. The impossible parts of the proposed system could be implemented in a real asynchronous network by layering in additional constraints on the leader election process.

I did not assert that this was a design for Redis Cluster, and the term “Redis Cluster” appeared nowhere in the post. To be absolutely clear; at no point in these posts have I discussed Redis Cluster. Antirez acknowledges that Cluster makes essentially no safety guarantees, so I haven’t felt the need to write about it.

I did, however, provide ample reference to multiple points in the mailing list thread where Antirez made strong claims about the consistency of hypothetical systems built with Redis WAIT and strong failover coordinators, and cited the gist in question as the canonical example thereof. I also thought it was clear that the system Antirez proposed were physically impossible, and that in addition to those flaws I analyzed weaker, practically achievable designs. However, comments on the post, on Twitter, and on Hacker News suggest a clarification is in order.

If Aphyr was interested in a real discussion, I could just agree about the obvious, that if you can totally control the orchestration of the system then synchronous replication is obviously a building block that can be part of a strong consistent system. Apparently he as just interested in spreading FUD. Congratulations.

Allow me to phrase this unambiguously: not only is this system impossible to build, but even if it were possible, it would not be linearizable.

There are obvious flaws in Antirez’s proposal, but I’m not convinced that simply explaining those flaws will do enough good. This is unlikely to be the last of Antirez'–or anyone else’s–consistency schemes, and I can’t possibly QA all of them! Instead, I’d like to raise the level of discussion around linearizability by showing how to find problems in concurrent algorithms–even if you don’t know where those problems lie.

So here, have a repo.

Knossos

Named after the ruins where Linear B was discovered, Knossos identifies whether or not a history of events from a concurrent system is linearizable. We’re going to be working through knossos.core and knossos.redis in this post. I’ll elide some code in this post for clarity, but it’s all there in the repo.

In Knossos, we analyze histories. Histories are a sequence of operations, each of which is a map:

[{:process :c2, :type :invoke, :f :write, :value 850919} {:process :c1, :type :ok, :f :write, :value 850914} {:process :c1, :type :invoke, :f :read, :value 850919} {:process :c1, :type :ok, :f :read, :value 850919}]

:process is the logical thread performing the operation. :invoke marks the start of an operation, and :ok marks its completion. :f is the kind of operation being invoked, and :value is an arbitrary argument for that operation, e.g. the value of a read or write. The interpretation of :f and :value depends on the datatype we’re modeling: for a set, we might support :add, :remove, and :contains.

To verify the history, we need a model which verifies that a sequence of operations applied in a particular order is valid. For instance, if we’re describing a register (e.g. a variable in a programming language–a mutable reference that points to a single value), we would like to enforce that every read sees the most recently written value. If we write 1, then write 2, then read, the read should see 2.

We represent a model with the function knossos.core/step which takes a model’s state, and an operation, and returns a new model. If the operation applied to that state would be invalid, step throws.

(defprotocol Model (step [model op])) (defrecord Register [value] Model (step [r op] (condp = (:f op) :write (Register. (:value op)) :read (if (or (nil? (:value op)) ; We don't know what the read was (= value (:value op))) ; Read was a specific value r (throw (RuntimeException. (str "read " (pr-str (:value op)) " from register " value)))))))

A Register implements the Model protocol and defines two functions: :write, which returns a modified copy of the register with the new value, and :read, which returns the register itself–if the read corresponds to the current value.

In a real experiment (as opposed to a mathematical model), we may not know what the read’s value will be until it returns. We allow any read with an unknown (nil?) value to succeed; when the read comes back we can re-evaluate the model with the value in mind.

This definition of step lets us reduce a sequence of operations over the model to produce a final state:

user=> (reduce step (Register. 0) [{:process :c1, :type :ok, :f :write, :value 4} {:process :c2, :type :ok, :f :read, :value 4}] #knossos.core.Register{:value 4} user=> (reduce step (Register. 0) [{:process :c1, :type :ok, :f :write, :value 4} {:process :c2, :type :ok, :f :read, :value 7}]) RuntimeException read 7 from register 4 knossos.core.Register (core.clj:43)

Now our problem consists of taking a history with pairs of (invoke, ok) operations, and finding an equivalent history of single operations which is consistent with the model. This equivalent single-threaded history is called a linearization; a system is linearizable if at least one such history exists. The actual definition is a bit more complicated, accounting for unmatched invoke/ok pairs, but this a workable lay definition.

Finding linearizations

The space of possible histories is really big. If we invoke 10 operations and none of them return OK, any subset of those operations might have taken place. So first we have to take the power set for any incomplete operations: that’s 2n. Then for each of those subsets we have to compute every possible interleaving of operations. If every operations' invocation and completion overlap, we construct a full permutation. That’s m!.

“Did you just tell me to go fuck myself?”

“I believe I did, Bob.”

My initial approach was to construct a radix tree of all possible histories (or, equivalently, a transition graph), and try to exploit degeneracy to prune the state space. Much of the literature on linearizability generates the full set of sequential histories and tests each one separately. Microsoft Research’s PARAGLIDER, in the absence of known linearization points, relies on this brute-force approach using the SPIN model checker.

A straightforward way to automatically check whether a concurrent history has a corresponding linearization is to simply try all possible permutations of the concurrent history until we either find a linearization and stop, or fail to find one and report that the concurrent history is not linearizable. We refer to this approach as Automatic Linearization… Despite its inherent complexity costs, we do use this method for checking concurrent histories of small length (e.g. less than 20). In practice, the space used for concurrent algorithms is typically small because incorrect algorithms often exhibit an incorrect concurrent history which is almost sequential.

In my experiments, enumerating all interleavings and testing each one started to break down around 12 to 16-operation histories.

Burckhardt, Dern, Musuvathi, and Tan wrote Line-Up, which verifies working C# algorithms by enumerating all thread interleavings through the CHESS model checker. This limits Line-Up to verifying only algorithms with a small state space–though this tradeoff allows them to do some very cool things around blocking and data races.

Two papers I know of attempt to reduce the search space itself. Golab, Li, and Shah developed a wicked-smart online checker using Gibbons and Korach’s algorithm and dynamic programming, but GK applies only to registers; I’d like to be able to test sets, queues, and other, more complex datatypes. Yang Liu, et al use both state space symmetry and commutative operations to drastically prune the search space for their linearizability analysis, using the PAT model checker.

I haven’t built symmetry reduction yet, but I do have a different trick: pruning the search space incrementally, as we move through the history, by using the model itself. This is a lot more complex than simply enumerating all possible interleavings–but if we can reject a branch early in the history, it saves huge amounts of work later in the search. The goal is to keep the number of possible worlds bounded by the concurrency of the history, not the length of the history.

So let’s do something paradoxical. Let’s make the problem even harder by multiplying the state space by N. Given a history of four invocations

[a b c d]

Let’s consider the N histories

[] [a] [a b] [a b c] [a b c d]

[] is trivially linearizable; nothing happens. [a] has two possible states: in one timeline, a completes. In another, a does not complete–remember, calls can fail. Assuming [a] passes the model, both are valid linearizations.

For the history [a b] we have five options. Neither a nor b can occur, one or the other could occur, or both could occur, in either order.

[] [a] [b] [a b] [b a]

Let’s say the register is initially nil, a is “write 5”, and b is “read 5”. [b] can’t take place on its own because we can’t read nil and get 5, and [b a] is invalid for the same reason. So we test five possibilities and find three linearizations. Not too bad, but we’re starting to see a hint of that n! explosion. By the third juncture we’ll have 16 sequential histories to test:

user=> (mapcat permutations (subsets ['a 'b 'c])) ([] (a) (b) (c) (a b) (b a) (a c) (c a) (b c) (c b) (a b c) (a c b) (b a c) (b c a) (c a b) (c b a))

And by the time we have to work with 10 operations concurrently, we’ll be facing 9864101 possible histories to test; it’ll take several minutes to test that many. But here’s the key: only some of those histories are even reachable, and we already have a clue as to which.

The 3-operation histories will include some histories which we already tested. [b a c], for instance, begins with [b a]; so if we already tested [b a] and found it impossible, we don’t even have to test [b a c] at all. The same goes for [b c a]–and every history, of any length, which begins with b.

So instead of testing all six 3-operation histories, we only have to test four. If the model rejects some of those, we can use those prefixes to reject longer histories, and so on. This dramatically cuts the state space, allowing us to test much longer histories in reasonable time.

Knossos uses Clojure’s shared-state immutable data structures to implement this search efficiently. We reduce over the history in order, maintaining a set of possible worlds. Every invocation bifurcates the set of worlds into those in which the operation happens immediately, and those in which it is deferred for later. Every completion prunes the set of worlds to only those in which the given operation completed. We can then ask Knossos to produce a set of worlds–linearized histories and the resulting states of their models–which are consistent with a given concurrent history.

Testing a model

Now let’s apply that linearizability checker to a particular system. We could measure a system experimentally, like I’ve done with Jepsen, or we could generate histories based on a formal model of a system. As an example, let’s test the model suggested by Antirez, describing a linearizable system built on top of Redis, WAIT, and a magical coordinator. As I described earlier, this model is physically impossible; it can not be built because the coordinator would need to violate the laws of physics. But let’s pretend we live on Planet Terah, and see whether the system is actually sound.

Antirez writes:

There are five nodes, using master-slave replication. When we start A is the master.

The nodes are capable of synchronous replication, when a client writes, it gets as relpy the number or replicas that received the write. A client can consider a write accepted only when “3” or more is returned, otherwise the result is not determined (false negatives are possbile).

Every node has a serial number, called the replication offset. It is always incremented as the replication stream is processed by a replica. Replicas are capable of telling an external entity, called “the controller”, what is the replication offset processed so far.

At some point, the controller, dictates that the current master is down, and decides that a failover is needed, so the master is switched to another one, in this way:

  1. The controller completely partition away the current master.
  2. The controller selects, out of a majority of replicas that are still available, the one with the higher replication offset.
  3. The controller tells all the reachable slaves what is the new master: the slaves start to get new data from the new master.
  4. The controller finally reconfigure all the clients to write to the new master.

So everything starts again. We assume that a re-appearing master, or other slaves that are again available after partitions heal, are capable of understand what the new master is. However both the old master and the slaves can’t accept writes. Slaves are read-only, while the re-apprearing master will not be able to return the majority on writes, so the client will not be able to consider the writes accepted.

In this model, it is possible to reach linearizability? I believe, yes, because we removed all the hard part, for which the strong protocols like Raft use epochs.

If you’ve spotted some of the problems in this approach, good work! But let’s say there were no obvious problems, and we weren’t sure how to find some. To do this, we’ll need a description of the system which is unambiguous and complete. Something a computer can understand.

First off, let’s describe a node:

(defn node "A node consists of a register, a primary it replicates from, whether it is isolated from all other nodes, a local replication offset, and a map of node names to known replication offsets." [name] {:name name :register nil :primary nil :isolated false :offset 0 :offsets {}})

Seems straightforward enough. This is a really simple model of a Redis server–one which only has a single register to read and write. We could extend it with more complex types, like lists and sets, but we’re trying to keep things simple. Notice how things like “Each node has a serial number, called the replication offset” have been translated into a field in a structure. We’ve also encoded things which were implicit in the proposal, like the fact that the WAIT command relies on the node knowing the replication offsets of its peers.

Remember, in proofs we try to deal as much as possible with immutable, pure systems; Clojure, Erlang, ML, and Haskell all lend themselves naturally to this approach. If you’re writing your checker in something like Ruby or Java, try to write immutable code anyway. It may be a bit unnatural, but it’ll really simplify things later.

(defn client "A client is a singlethreaded process which can, at any time, have at most one request in-flight to the cluster. It has a primary that it uses for reads and writes, and an in-flight request. Clients can be waiting for a response, in which case :wait will be the replication offset from the primary they're awaiting. :waiting is the value they're waiting for, if conducting a write." [name] {:name name :node nil :writing nil :waiting nil})

We’ll also need a coordinator. This one’s simple:

(defn coordinator "A controller is an FSM which manages the election process for nodes. It comprises a state (the phase of the election cycle it's in), and the current primary." [primary] {:state :normal :primary primary})

Next we’ll put all these pieces together into a full system. Phrases like “When we start A is the master,” are translated into code which picks the first node as the primary, and code which ensures that primary state is reflected by the other nodes and the coordinator.

(defn system "A system is comprised of a collection of nodes, a collection of clients, and a coordinator; plus a *history*, which is the set of operations we're verifying is linearizable." [] (let [node-names [:n1 :n2 :n3] nodes (->> node-names (map node) ; Fill in offset maps (map (fn [node] (->> node-names (remove #{(:name node)}) (reduce #(assoc %1 %2 0) {}) (assoc node :offsets))))) ; Initial primary/secondary state [primary & secondaries] nodes nodes (cons primary (map #(assoc % :primary (:name primary)) secondaries)) ; Construct a map of node names to nodes nodes (->> nodes (map (juxt :name identity)) (into {})) ; Construct clients clients (->> [:c1 :c2] (map client) (map #(assoc % :node (:name primary))) (map (juxt :name identity)) (into {}))] {:coordinator (coordinator (:name primary)) :clients clients :nodes nodes :history []}))

Note that we’ve introduced, for any given state of the system, the history of operation which brought us to this point. This is the same history that we’ll be evaluating using our linearizability checker.

This formally describes the state of the model. Now we need to enumerate the state transitions which bring the system from one state to another.

State transitions

First, we need a model of Redis reads and writes. Writes have two phases: an invocation and a response to the client–implemented with WAIT.

(def write-state (atom 0)) (defn client-write "A client can send a write operation to a node." [system] (->> system clients (filter free-client?) (filter (partial valid-client? system)) (map (fn [client] (let [; Pick a value to write value (swap! write-state inc) ; Find the node name for this client node (:node client) ; And the new offset. offset (inc (get-in system [:nodes node :offset]))] (-> system (assoc-in [:nodes node :register] value) (assoc-in [:nodes node :offset] offset) (assoc-in [:clients (:name client) :waiting] offset) (assoc-in [:clients (:name client) :writing] value) (log (invoke-op (:name client) :write value))))))))

client-write is a function which takes a system and returns a sequence of possible systems, each of which corresponds to one client initiating a write to its primary. We encode multiple constraints here:

  1. Clients can only initiate a write when they are not waiting for another response–i.e. clients are singlethreaded.
  2. Clients must be connected to a node which is not isolated and thinks that it is a primary. Note that this assumes a false linearization point: in the real world, these checks are not guaranteed to be instantaneous. We are being overly generous to simplify the model.

For each of these clients, we generate a unique number to write using (swap write-state inc), set the primary’s register to that value, increment the primary’s offset, and update the client–it is now waiting for that particular replication offset to be acknowledged by a majority of nodes. We also keep track of the value we wrote, just so we can fill it into the history later.

Finally, we update the history of the system, adding an invocation of a write, from this client, for the particular value.

When a client’s primary determines that a majority of nodes have acknowledged the offset that the client is waiting for, we can complete the write operation.

(defn client-write-complete "A reachable primary node can inform a client that its desired replication offset has been reached." [system] (->> system clients (remove free-client?) (filter (partial valid-client? system)) (keep (fn [client] (let [offset (-> system :nodes (get (:node client)) majority-acked-offset)] (when (<= (:waiting client) offset) (-> system (assoc-in [:clients (:name client) :waiting] nil) (assoc-in [:clients (:name client) :writing] nil) (log (ok-op (:name client) :write (:writing client))))))))))

Again, note the constraints: we can’t always complete writes. Only when the client is waiting, and the client is connected to a valid non-isolated primary, and the replication offset is acked by a majority of nodes, can these transitions take place.

keep is a Clojure function analogous to map and filter combined: only non-nil results appear in the output sequence. We use keep here to compactly express that only clients which have satisfied the majority offset acknowledgement constraint are eligible for completion.

Reads are similar to writes, but we make another generous allowance: reads are assumed to be a linearization point of the model, and therefore take place instantaneously. We add both invocation and completion operations to the log in one step.

(defn client-read "A client can read a value from its node, if primary and reachable. Reads are instantaneous." [system] (->> system clients (filter free-client?) (filter (partial valid-client? system)) (map (fn [client] (let [node (:node client) value (get-in system [:nodes node :register])] (-> system (log (invoke-op (:name client) :read nil)) (log (ok-op (:name client) :read value))))))))

Replication

Redis replication is asynchronous: in one phase the client copies data from the primary, and after that, updates the primary with its replication offset. We assume that each phase takes place instantaneously. Is replication actually a linearization point in Redis? I don’t know–but we’ll be generous again.

(defn replicate-from-primary "A node can copy the state of its current primary, if the primary is reachable." [system] (->> system nodes (remove :isolated) (keep (fn [node] (when-let [primary (get-node system (:primary node))] (when-not (:isolated primary) (-> system (assoc-in [:nodes (:name node) :register] (:register primary)) (assoc-in [:nodes (:name node) :offset] (:offset primary)) (log (op (:name node) :info :replicate-from-primary (:primary node))))))))))

Pretty straightforward: each node can, if it has a primary and neither is isolated, copy its register state and offset. We’ll be generous and assume the primary’s total oplog is applied instantly and atomically.

The acknowledgement process is basically the reverse: we update the offset cache in the primary, so long as nodes are connected.

(defn ack-offset-to-primary "A node can inform its current primary of its offset, if the primary is reachable." [system] (->> system nodes (remove :isolated) (keep (fn [node] (when-let [primary (get-node system (:primary node))] (when-not (:isolated primary) (-> system (assoc-in [:nodes (:primary node) :offsets (:name node)] (:offset node)) (log (op (:name node) :info :ack-offset-to-primary (:primary node))))))))))

Failover

Four functions, corresponding to each of the four steps in the algorithm. We ensure they happen in order by ensuring that a transition can only take place if the coordinator just completed the previous step.

1) The controller completely partition away the current master.

(defn failover-1-isolate "If the coordinator is in normal mode, initiates failover by isolating the current primary." [system] (let [coord (:coordinator system)] (when (= :normal (:state coord)) (-> system (assoc-in [:coordinator :state] :isolated) (assoc-in [:coordinator :primary] nil) (assoc-in [:nodes (:primary coord) :isolated] true) (log (op :coord :info :failover-1-isolate (:primary coord)))))))

Notice how we formalized the English statement by encoding properties of the network throughout the model: each state transition checks the partitioned state of the nodes involved. This is an oversimplification of the real system, because this part of the algorithm is impossible in an asynchronous network: it modifies the current primary’s state directly instead of sending it a message. We’re being generous by assuming the network propagates messages instantly; a more thorough model would explicitly model the loss and delay of messages.

2) The controller selects, out of a majority of replicas that are still available, the one with the higher replication offset.

Again, translation is straightforward; in the model we can freely violate the laws of physics.

(defn failover-2-select "If the coordinator has isolated the old primary, selects a new primary by choosing the reachable node with the highest offset." [system] (let [coord (:coordinator system)] (when (= :isolated (:state coord)) (let [candidates (->> system nodes (remove :isolated))] ; Gotta reach a majority (when (<= (inc (Math/floor (/ (count (nodes system)) 2))) (count candidates)) (let [primary (:name (apply max-key :offset candidates))] (-> system (assoc-in [:coordinator :state] :selected) (assoc-in [:coordinator :primary] primary) (log (op :coord :info :failover-2-select primary)))))))))

3) The controller tells all the reachable slaves what is the new master: the slaves start to get new data from the new master.

You know the drill. We create a false point of linearization and assume this broadcast is atomic.

(defn failover-3-inform-nodes "If the coordinator has selected a new primary, broadcasts that primary to all reachable nodes." [system] (let [coord (:coordinator system) primary (:primary coord)] (when (= :selected (:state coord)) (-> system (assoc-in [:coordinator :state] :informed-nodes) (assoc :nodes (->> system :nodes (map (fn [ [name node] ] [name (cond ; If the node is isolated, state is ; unchanged. (:isolated node) node ; If this is the new primary node, make ; it a primary. (= primary name) (assoc node :primary nil) ; Otherwise, set the primary. :else (assoc node :primary primary))])) (into {}))) (log (op :coord :info :failover-3-inform-nodes primary))))))

4) The controller finally reconfigure all the clients to write to the new master.

Here too!

(defn failover-4-inform-clients "If the coordinator has informed all nodes of the new primary, update all client primaries." [system] (let [coord (:coordinator system) primary (:primary coord)] (when (= :informed-nodes (:state coord)) (-> system (assoc-in [:coordinator :state] :normal) (assoc :clients (->> system :clients (map (fn [ [name client] ] [name (assoc client :node primary)])) (into {}))) (log (op :coord :info :failover-4-inform-clients primary))))))

At each step there is exactly one failover transition that can happen–since the coordinator is magically sequential and never fails.

(defn failover "All four failover stages combined." [system] (when-let [system' (or (failover-1-isolate system) (failover-2-select system) (failover-3-inform-nodes system) (failover-4-inform-clients system))] (list system')))

Putting it all together

We assume that a re-appearing master, or other slaves that are again available after partitions heal, are capable of understand what the new master is.

I struggled with this, and I actually don’t know how to interpret this part of the proposal. Erring on the side of safety, let’s omit any resurrection of isolated nodes. Once a node fails, it stays dead forever. If you let them come back, things get much more dangerous.

Only one last part remains: we need to express, in a single function, every allowable state transition.

(defn step "All systems reachable in a single step from a given system." [system] (concat (client-write system) (client-write-complete system) (client-read system) (replicate-from-primary system) (ack-offset-to-primary system) (failover system)))

OK. So now we can evolve any particular state of the system in various directions. Let’s take a look at a basic system:

user=> (use 'knossos.redis) nil user=> (-> (system) pprint) {:coordinator {:state :normal, :primary :n1}, :clients {:c1 {:name :c1, :node :n1, :writing nil, :waiting nil}, :c2 {:name :c2, :node :n1, :writing nil, :waiting nil}}, :nodes {:n1 {:name :n1, :register nil, :primary nil, :isolated false, :offset 0, :offsets {:n3 0, :n2 0}}, :n2 {:name :n2, :register nil, :primary :n1, :isolated false, :offset 0, :offsets {:n3 0, :n1 0}}, :n3 {:name :n3, :register nil, :primary :n1, :isolated false, :offset 0, :offsets {:n2 0, :n1 0}}}, :history []}

What happens if we do a write?

user=> (-> (system) client-write rand-nth pprint) {:coordinator {:state :normal, :primary :n1}, :clients {:c1 {:name :c1, :node :n1, :writing nil, :waiting nil}, :c2 {:name :c2, :node :n1, :writing 10, :waiting 1}}, :nodes {:n1 {:name :n1, :register 10, :primary nil, :isolated false, :offset 1, :offsets {:n3 0, :n2 0}}, :n2 {:name :n2, :register nil, :primary :n1, :isolated false, :offset 0, :offsets {:n3 0, :n1 0}}, :n3 {:name :n3, :register nil, :primary :n1, :isolated false, :offset 0, :offsets {:n2 0, :n1 0}}}, :history [{:process :c2, :type :invoke, :f :write, :value 10}]}

Notice that client-write returns two systems: one in which :c1 writes, and one in which :c2 writes. We pick a random possibility using rand-nth. In this case, :c2 wrote the number 10 to :n1, and is waiting for replication offset 1 to be acknowledged. :n1, but not :n2 or :n3, has received the write. Note the history of this system reflects the invocation, but not the completion, of this write.

Let’s try to complete the write:

user=> (-> (system) client-write rand-nth client-write-complete pprint) ()

There are no possible worlds where the write can complete at this point. Why? Because the replication offset on the primary hasn’t completed yet. This is the whole point of Redis WAIT: we can’t consider a write complete until it’s been acknowledged.

user=> (-> (system) client-write rand-nth replicate-from-primary first ack-offset-to-primary first client-write-complete pprint) ({:coordinator {:state :normal, :primary :n1}, :clients {:c1 {:name :c1, :node :n1, :writing nil, :waiting nil}, :c2 {:name :c2, :node :n1, :writing nil, :waiting nil}}, :nodes {:n1 {:name :n1, :register 15, :primary nil, :isolated false, :offset 1, :offsets {:n3 0, :n2 1}}, :n2 {:name :n2, :register 15, :primary :n1, :isolated false, :offset 1, :offsets {:n3 0, :n1 0}}, :n3 {:name :n3, :register nil, :primary :n1, :isolated false, :offset 0, :offsets {:n2 0, :n1 0}}}, :history [{:process :c1, :type :invoke, :f :write, :value 15} {:process :n2, :type :info, :f :replicate-from-primary, :value :n1} {:process :n2, :type :info, :f :ack-offset-to-primary, :value :n1} {:process :c1, :type :ok, :f :write, :value 15}]})

A successful write! The value 15 has been replicated to both :n1 and :n2, and with the offset map on :n1 updated, the WAIT request for the client can complete. The history reflects the invocation and completion of :c1’s write request.

Having written down the state of the system, and encoded all possible state transitions in the step function, we can find random trajectories through the system by interleaving calls to step and rand-nth. Because we don’t allow the resurrection of nodes, this system can simply halt, unable to make progress. In that case, we simply return the terminal state.

(defn trajectory "Returns a system from a randomized trajectory, `depth` steps away from the given system." [system depth] (if (zero? depth) system (let [possibilities (step system)] (if (empty? possibilities) ; Dead end system ; Descend (recur (rand-nth possibilities) (dec depth))))))

Because our trajectory evolution is randomized, the histories it generates will often contain extraneous garbage–repeated sequences of identical reads, for instance, or replicating the same state over and over again. We could go back and re-explore the state space, omitting certain transitions in the search of a simpler trajectory–but for now, we’ll take the random trajectories.

Model checking

We’ve built a simple model of a single-threaded linearizable register, a concurrent model of a hypothetical Redis system, and a verifier which tests that a history is linearizable with respect to a singlethreaded model. Now let’s combine these three elements.

First, a way to show the system that we wound up in, and the history that led us there. We’ll use linearizable-prefix to find the longest string of the history that was still linearizable–that’ll help show where, exactly, we ran out of options.

(defn print-system [system history] (let [linearizable (linearizable-prefix (->Register nil) history)] (locking *out* (println "\n\n### No linearizable history for system ###\n") (pprint (dissoc system :history)) (println "\nHistory:\n") (pprint linearizable) (println "\nUnable to linearize past this point!\n") (pprint (drop (count linearizable) history)))))

Then we’ll generate a bunch of trajectories of, say, 15 steps apiece, and show any which have nonlinearizable histories.

(deftest redis-test (dothreads [i 4] ; hi haters (dotimes [i 10000] (let [system (trajectory (system) 15)] ; Is this system linearizable? (let [history (complete (:history system)) linears (linearizations (->Register nil) history)] (when (empty? linears) (print-system system history)) (is (not (empty? linears)))))))

And we’re ready to go. Is the model Antirez proposed linearizable?

$ lein test knossos.redis-test ### No linearizable history for system ### {:coordinator {:state :normal, :primary :n2}, :clients {:c1 {:name :c1, :node :n2, :writing nil, :waiting nil}, :c2 {:name :c2, :node :n2, :writing 9, :waiting 2}}, :nodes {:n1 {:name :n1, :register 9, :primary nil, :isolated true, :offset 2, :offsets {:n3 0, :n2 1}}, :n2 {:name :n2, :register 5, :primary nil, :isolated false, :offset 1, :offsets {:n3 0, :n1 0}}, :n3 {:name :n3, :register nil, :primary :n2, :isolated false, :offset 0, :offsets {:n2 0, :n1 0}}}} History: [{:process :c2, :type :invoke, :f :write, :value 5} {:process :n2, :type :info, :f :replicate-from-primary, :value :n1} {:process :n2, :type :info, :f :ack-offset-to-primary, :value :n1} {:process :c2, :type :ok, :f :write, :value 5} {:process :n2, :type :info, :f :replicate-from-primary, :value :n1} {:process :c2, :type :invoke, :f :write, :value 9} {:process :n3, :type :info, :f :ack-offset-to-primary, :value :n1} {:process :c1, :type :invoke, :f :read, :value 9} {:process :c1, :type :ok, :f :read, :value 9} {:process :coord, :type :info, :f :failover-1-isolate, :value :n1} {:process :coord, :type :info, :f :failover-2-select, :value :n2} {:process :coord, :type :info, :f :failover-3-inform-nodes, :value :n2} {:process :coord, :type :info, :f :failover-4-inform-clients, :value :n2} {:process :n3, :type :info, :f :ack-offset-to-primary, :value :n2} {:process :n3, :type :info, :f :ack-offset-to-primary, :value :n2} {:process :c1, :type :invoke, :f :read, :value 5}] Unable to linearize past this point! ({:process :c1, :type :ok, :f :read, :value 5}) lein test :only knossos.redis-test/redis-test FAIL in (redis-test) (redis_test.clj:44) expected: (not (empty? linears)) actual: (not (not true)) Ran 1 tests containing 38340 assertions. 6 failures, 0 errors.

No, it isn’t.

What happened here?

Knossos generated a state of the system which it believes is possible, under the rules of the Redis model we constructed, but not linearizable with respect to the register model. Up until the final read, Knossos could still construct a world where things made sense, but that last read was inconsistent with every possible interpretation. Why?

Well, in the final state, n1 is isolated with value 9 at offset 2, n2 is a primary with value 5 at offset 1, and n3 thinks the value is nil. n3’s offset is 0; it never participated in this history, so we can ignore it.

  1. First, c2 writes 5 to n1. n2 replicates and acknowledges the write of 5, and c2’s write completes.
  2. n2 initiates a (noop) replication from n1.
  3. c2 initiates a write of 9 to n1. c1 concurrently initiates a read from n1, which will see 9.
  4. n2 completes its replication; state is unchanged.
  5. Then, a failover occurs. The coordinator performs all four steps atomically, so no concurrency questions there. n2 is selected as the new primary and n1 is isolated.
  6. n3 acknowledges its offset of 0 to n2 twice; both of which are noops since n2 already thinks n3’s offset is 0.
  7. Finally, c1 invokes a read from n2 and sees 5. This is the read which proves the system is inconsistent. Up until this point the history has been linearizable–we could have assumed, for instance, that the write of 9 failed and the register has always been 5, but that assumption was invalidated by the successful read of 9 by c1 earlier. We also could have assumed that the final read of 5 failed–but when it succeeded, Knossos ran out of options.

This case demonstrates that reads are a critical aspect of linearizability. Redis WAIT is not transactional. It allows clients to read unreplicated state from the primary node, which is just as invalid as reading stale data from secondaries.

I hope this illustrates beyond any shred of doubt: not only is Antirez’s proposal physically impossible, but even wildly optimistic formal interpretations of his proposal are trivially non-linearizable.

Yeah, this is Fear, Uncertainty, and Doubt. You should be uncertain about algorithms without proofs. You should doubt a distributed system without a formal model. You should be fearful that said system will not live up to its claims.

Now you’ve got another tool to validate that uncertainty for yourself.

Math is hard; let’s go model checking

Proving linearizability is hard. Much harder than proving a system is not linearizable, when you get down to it. All I had to do here was find a single counterexample; but proving linearizability requires showing that every history is valid. Traditionally one does this by identifying all the linearization points of an algorithm–the points where things take place atomically–which is a subtle and complex process, especially where the linearization point depends on runtime behavior or lies outside the code itself.

Moreover, I am not that great at proofs–and I don’t want to exclude readers who don’t have the benefit of formal training. I want to equip ordinary programmers with the motivation and tools to reason about their systems–and for that, model checking is a terrific compromise.

There are many tools available for formal modeling of concurrency. Leslie Lamport’s TLA+ is the canonical tool for concurrency proofs, but its learning curve is steep to say the least and I have a lot of trouble trying to compose its models. Bell Labs' Spin is more accessible for programmers, encoding its models in a language called Promela. Spin has excellent tooling–it can even extract models from C code with assistance. There’s also Erigone, a reimplementation of Spin, and the aforementioned Line-Up for C#.

Knossos is a dead-simple verification system I hacked out in a week; it takes advantage of Clojure’s concise data-structure literals, immutable shared-state data structures, and concise syntax to make designing models and checking their linearizability easier. Knossos probably has some bugs, so be sure to check the failure cases by hand!

No matter which model checker you use, all of these systems let you formalize your algorithms by writing them down in a concise, unambiguous form–either a modeling language or a full programming language– and then verify that those models conform to certain invariants by exploring the state space. By working with a proof assistant, some of these specifications can also prove that the invariants hold always, instead of only proving that the invariants can fail to hold.

We verified a toy system in this blog post, but all the key elements are there. State, transition functions, invariants, and a model to verify against. We use hierarchical data structures and functions to break up the model into smaller, more manageable pieces. We generated counterexamples from probabilistic trajectories through the model.

Real models look just like this. Take a look at the model and proof sketch of the RAFT consensus algorithm, and see if you can spot the definitions of state, transitions, and invariants. Note that this isn’t a full proof–more like a sketch–and it relies on some propositions like type safety which are not mechanically verified, but this paper illustrates both formal and English proof techniques nicely.

This is the kind of argument you need to make, as a database engineer, before asserting a given system is linearizable. Formal verification will catch both obvious and subtle bugs before you, or your readers, try to implement them.

Previously: Macros.

Most programs encompass change. People grow up, leave town, fall in love, and take new names. Engines burn through fuel while their parts wear out, and new ones are swapped in. Forests burn down and their logs become nurseries for new trees. Despite these changes, we say “She’s still Nguyen”, “That’s my motorcycle”, “The same woods I hiked through as a child.”

Identity is a skein we lay across the world of immutable facts; a single entity which encompasses change. In programming, identities unify different values over time. Identity types are mutable references to immutable values.

In this chapter, we’ll move from immutable references to complex concurrent transactions. In the process we’ll get a taste of concurrency and parallelism, which will motivate the use of more sophisticated identity types. These are not easy concepts, so don’t get discouraged. You don’t have to understand this chapter fully to be a productive programmer, but I do want to hint at why things work this way. As you work with state more, these concepts will solidify.

Immutability

The references we’ve used in let bindings and function arguments are immutable: they never change.

user=> (let [x 1] (prn (inc x)) (prn (inc x))) 2 2

The expression (inc x) did not alter x: x remained 1. The same applies to strings, lists, vectors, maps, sets, and most everything else in Clojure:

user=> (let [x [1 2]] (prn (conj x :a)) (prn (conj x :b))) [1 2 :a] [1 2 :b]

Immutability also extends to let bindings, function arguments, and other symbols. Functions remember the values of those symbols at the time the function was constructed.

(defn present [gift] (fn [] gift)) user=> (def green-box (present "clockwork beetle")) #'user/green-box user=> (def red-box (present "plush tiger")) #'user/red-box user=> (red-box) "plush tiger" user=> (green-box) "clockwork beetle"

The present function creates a new function. That function takes no arguments, and always returns the gift. Which gift? Because gift is not an argument to the inner function, it refers to the value from the outer function body. When we packaged up the red and green boxes, the functions we created carried with them a memory of the gift symbol’s value.

This is called closing over the gift variable; the inner function is sometimes called a closure. In Clojure, new functions close over all variables except their arguments–the arguments, of course, will be provided when the function is invoked.

Delays

Because functions close over their arguments, they can be used to defer evaluation of expressions. That’s how we introduced functions originally–like let expressions, but with a number (maybe zero!) of symbols missing, to be filled in at a later time.

user=> (do (prn "Adding") (+ 1 2)) "Adding" 3 user=> (def later (fn [] (prn "Adding") (+ 1 2))) #'user/later user=> (later) "Adding" 3

Evaluating (def later ...) did not evaluate the expressions in the function body. Only when we invoked the function later did Clojure print "Adding" to the screen, and return 3. This is the basis of concurrency: evaluating expressions outside their normal, sequential order.

This pattern of deferring evaluation is so common that there’s a standard macro for it, called delay:

user=> (def later (delay (prn "Adding") (+ 1 2))) #'user/later user=> later #<Delay@2dd31aac: :pending> user=> (deref later) "Adding" 3

Instead of a function, delay creates a special type of Delay object: an identity which refers to expressions which should be evaluated later. We extract, or dereference, the value of that identity with deref. Delays follow the same rules as functions, closing over lexical scope–because delay actually macroexpands into an anonymous function.

user=> (source delay) (defmacro delay "Takes a body of expressions and yields a Delay object that will invoke the body only the first time it is forced (with force or deref/@), and will cache the result and return it on all subsequent force calls. See also - realized?" {:added "1.0"} [& body] (list 'new 'clojure.lang.Delay (list* `^{:once true} fn* [] body)))

Why the Delay object instead of a plain old function? Because unlike function invocation, delays only evaluate their expressions once. They remember their value, after the first evaluation, and return it for every successive deref.

user=> (deref later) 3 user=> (deref later) 3

By the way, there’s a shortcut for (deref something): the wormhole operator @:

user=> @later ; Interpreted as (deref later) 3

Remember how map returned a sequence immediately, but didn’t actually perform any computation until we asked for elements? That’s called lazy evaluation. Because delays are lazy, we can avoid doing expensive operations until they’re really needed. Like an IOU, we use delays when we aren’t ready to do something just yet, but when someone calls in the favor, we’ll make sure it happens.

Futures

What if we wanted to opportunistically defer computation? Modern computers have multiple cores, and operating systems let us share a core between two tasks. It would be great if we could use that multitasking ability to say, “I don’t need the result of evaluating these expressions yet, but I’d like it later. Could you start working on it in the meantime?”

Enter the future: a delay which is evaluated in parallel. Like delays, futures return immediately, and give us an identity which will point to the value of the last expression in the future–in this case, the value of (+ 1 2).

user=> (def x (future (prn "hi") (+ 1 2))) "hi" #'user/x user=> (deref x) 3

Notice how the future printed “hi” right away. That’s because futures are evaluated in a new thread. On multicore computers, two threads can run in parallel, on different cores the same time. When there are more threads than cores, the cores trade off running different threads. Both parallel and non-parallel evaluation of threads are concurrent because expressions from different threads can be evaluated out of order.

user=> (dotimes [i 5] (future (prn i))) 14 3 0 2 nil

Five threads running at once. Notice that the thread printing 1 didn’t even get to move to a new line before 4 showed up–then both threads wrote new lines at the same time. There are techniques to control this concurrent execution so that things happen in some well-defined sequence, like agents and locks, but we’ll discuss those later.

Just like delays, we can deref a future as many times as we want, and the expressions are only evaluated once.

user=> (def x (future (prn "hi") (+ 1 2))) #'user/x"hi" user=> @x 3 user=> @x 3

Futures are the most generic parallel construct in Clojure. You can use futures to do CPU-intensive computation faster, to wait for multiple network requests to complete at once, or to run housekeeping code periodically.

Promises

Delays defer evaluation, and futures parallelize it. What if we wanted to defer something we don’t even have yet? To hand someone an empty box and, later, before they open it, sneak in and replacing its contents with an actual gift? Surely I’m not the only one who does birthday presents this way.

user=> (def box (promise)) #'user/box user=> box #<core$promise$reify__6310@1d7762e: :pending>

This box is pending a value. Like futures and delays, if we try to open it, we’ll get stuck and have to wait for something to appear inside:

user=> (deref box)

But unlike futures and delays, this box won’t be filled automatically. Hold the Control key and hit c to give up on trying to open that package. Nobody else is in this REPL, so we’ll have to buy our own presents.

user=> (deliver box :live-scorpions!) #<core$promise$reify__6310@1d7762e: :live-scorpions!> user=> (deref box) :live-scorpions!

Wow, that’s a terrible gift. But at least there’s something there: when we dereference the box, it opens immediately and live scorpions skitter out. Can we get a do-over? Let’s try a nicer gift.

user=> (deliver box :puppy) nil user=> (deref box) :live-scorpions!

Like delays and futures, there’s no going back on our promises. Once delivered, a promise always refers to the same value. This is a simple identity type: we can set it to a value once, and read it as many times as we want. promise is also a concurrency primitive: it guarantees that any attempt to read the value will wait until the value has been written. We can use promises to synchronize a program which is being evaluated concurrently–for instance, this simple card game:

user=> (def card (promise)) #'user/card user=> (def dealer (future (Thread/sleep 5000) (deliver card [(inc (rand-int 13)) (rand-nth [:clubs :spades :hearts :diamonds])]))) #'user/dealer user=> (deref card) [5 :diamonds]

In this program, we set up a dealer thread which waits for five seconds (5000 milliseconds), then delivers a random card. While the dealer is sleeping, we try to deref our card–and have to wait until the five seconds are up. Synchronization and identity in one package.

Where delays are lazy, and futures are parallel, promises are concurrent without specifying how the evaluation occurs. We control exactly when and how the value is delivered. You can think of both delays and futures as being built atop promises, in a way.

Vars

So far the identities we’ve discussed have referred (eventually) to a single value, but the real world needs names that refer to different values at different points in time. For this, we use vars.

We’ve touched on vars before–they’re transparent mutable references. Each var has a value associated with it, and that value can change over time. When a var is evaluated, it is replaced by its present value transparently–everywhere in the program.

user=> (def x :mouse) #'user/x user=> (def box (fn [] x)) #'user/box user=> (box) :mouse user=> (def x :cat) #'user/x user=> (box) :cat

The box function closed over x–but calling (box) returned different results depending on the current value of x. Even though the var x remained unchanged throughout this example, the value associated with that var did change!

Using mutable vars allows us to write programs which we can redefine as we go along.

user=> (defn decouple [glider] #_=> (prn "bolts released")) #'user/decouple user=> (defn launch [glider] #_=> (decouple glider) #_=> (prn glider "away!")) #'user/launch user=> (launch "albatross") "bolts released" "albatross" "away!" nil user=> (defn decouple [glider] #_=> (prn "tether released")) #'user/decouple user=> (launch "albatross") "tether released" "albatross" "away!"

A reference which is the same everywhere is called a global variable, or simply a global. But vars have an additional trick up their sleeve: with a dynamic var, we can override their value only within the scope of a particular function call, and nowhere else.

user=> (def ^:dynamic *board* :maple) #'user/*board*

^:dynamic tells Clojure that this var can be overridden in one particular scope. By convention, dynamic variables are named with asterisks around them–this reminds us, as programmers, that they are likely to change. Next, we define a function that uses that dynamic var:

user=> (defn cut [] (prn "sawing through" *board*)) #'user/cut

Note that cut closes over the var *board*, but not the value :maple. Every time the function is invoked, it looks up the current value of *board*.

user=> (cut) "sawing through" :maple nil user=> (binding [*board* :cedar] (cut)) "sawing through" :cedar nil user=> (cut) "sawing through" :maple

Like let, the binding macro assigns a value to a name–but where fn and let create immutable lexical scope, binding creates dynamic scope. The difference? Lexical scope is constrained to the literal text of the fn or let expression–but dynamic scope propagates through function calls.

Within the binding expression, and in every function called from that expression, and every function called from those functions, and so on, *board* has the value :cedar. Outside the binding expression, the value is still :maple. This safety property holds even when the program is executed in multiple threads: only the thread which evaluated the binding expression uses that value. Other threads are unaffected.

While we use def all the time in the REPL, in real programs you should only mutate vars sparingly. They’re intended for naming functions, important bits of global data, and for tracking the environment of a program–like where to print messages with prn, which database to talk to, and so on. Using vars for mutable program state is a recipe for disaster, as we’re about to see.

Atoms

Vars can be read, set, and dynamically bound–but they aren’t easy to evolve. Imagine building up a set of integers:

user=> (def xs #{}) #'user/xs user=> (dotimes [i 10] (def xs (conj xs i))) user=> xs #{0 1 2 3 4 5 6 7 8 9}

For each number from 0 to 9, we take the current set of numbers xs, add a particular number i to that set, and redefine xs as the result. This is a common idiom in imperative language like C, Ruby, Javascript, or Java–all variables are mutable by default.

ImmutableSet xs = new ImmutableSet(); for (int i = 0; i++; i < 10) { xs = xs.add(i); }

It seems straightforward enough, but there are serious problems lurking here. Specifically, this program is not thread safe.

user=> (def xs #{}) user=> (dotimes [i 10] (future (def xs (conj xs i)))) #'user/xs nil user=> xs #{1 4 5 7}

This program runs 10 threads in parallel, and each reads the current value of xs, adds its particular number, and defines xs to be that new set of numbers. This read-modify-update process assumed that all updates would be consecutive–not concurrent. When we allowed the program to do two read-modify-updates at the same time, updates were lost.

  1. Thread 2 read #{0 1}
  2. Thread 3 read #{0 1}
  3. Thread 2 wrote #{0 1 2}
  4. Thread 3 wrote #{0 1 3}

This interleaving of operations allowed the number 2 to slip through the cracks. We need something stronger–an identity which supports safe transformation from one state to another. Enter *atoms.

user=> (def xs (atom #{})) #'user/xs user=> xs #<Atom@30bb8cc9: #{}>

The initial value of this atom is #{}. Unlike vars, atoms are not transparent. When evaluated, they don’t return their underlying values–but notice that when printed, the current value is hiding inside. To get the current value out of an atom, we have to use deref or @.

user=> (deref xs) #{} user=> @xs #{}

Like vars, atoms can be set to a particular value–but instead of def, we use reset!. The exclamation point (sometimes called a bang) is there to remind us that this function modifies the state of its arguments–in this case, changing the value of the atom.

user=> (reset! xs :foo) :foo user=> xs #<Atom@30bb8cc9: :foo>

Unlike vars, atoms can be safely updated using swap!. swap! uses a pure function which takes the current value of the atom and returns a new value. Under the hood, Clojure does some tricks to ensure that these updates are linearizable, which means:

  1. All updates with `swap! complete in what appears to be a single consecutive order.
  2. The effect of a swap! never takes place before calling swap!.
  3. The effect of a swap! is visible to everyone once swap! returns.
user=> (def x (atom 0)) #'user/x user=> (swap! x inc) 1 user=> (swap! x inc) 2

The first swap! reads the value 0, calls (inc 0) to obtain 1, and writes 1 back to the atom. Each call to swap! returns the value that was just written.

We can pass additional arguments to the function swap! calls. For instance, (swap! x + 5 6) will call (+ x 5 6) to find the new value. Now we have the tools to correct our parallel program from earlier:

user=> (def xs (atom #{})) #'user/xs user=> (dotimes [i 10] (future (swap! xs conj i))) nil user=> @xs #{0 1 2 3 4 5 6 7 8 9}

Note that the function we use to update an atom must be pure–must not mutate any state–because when resolving conflicts between multiple threads, Clojure might need to call the update function more than once. Clojure’s reliance on immutable datatypes, immutable variables, and pure functions enables this approach to linearizable mutability. Languages which emphasize mutable datatypes need to use other constructs.

Atoms are the workhorse of Clojure state. They’re lightweight, safe, fast, and flexible. You can use atoms with any immutable datatype–for instance, a map to track complex state. Reach for an atom whenever you want to update a single thing over time.

Refs

Atoms are a great way to represent state, but they are only linearizable individually. Updates to an atom aren’t well-ordered with respect to other atoms, so if we try to update more than one atom at once, we could see the same kinds of bugs that we did with vars.

For multi-identity updates, we need a stronger safety property than single-atom linearizability. We want serializability: a global order. For this, Clojure has an identity type called a Ref.

user=> (def x (ref 0)) #'user/x user=> x #<Ref@1835d850: 0>

Like all identity types, refs are dereferencable:

user=> @x 0

But where atoms are updated individually with swap!, refs are updated in groups using dosync transactions. Just as we reset! an atom, we can set refs to new values using ref-set–but unlike atoms, we can change more than one ref at once.

user=> (def x (ref 0)) user=> (def y (ref 0)) user=> (dosync (ref-set x 1) (ref-set y 2)) 2 user=> [@x @y] [1 2]

The equivalent of swap!, for a ref, is alter:

user=> (def x (ref 1)) user=> (def y (ref 2)) user=> (dosync (alter x + 2) (alter y inc)) 3 user=> [@x @y] [3 3]

All alter operations within a dosync take place atomically–their effects are never interleaved with other transactions. If it’s OK for an operation to take place out of order, you can use commute instead of alter for a performance boost:

user=> (dosync (commute x + 2) (commute y inc))

These updates are not guaranteed to take place in the same order–but if all our transactions are equivalent, we can relax the ordering constraints. x + 2 + 3 is equal to x + 3 + 2, so we can do the additions in either order. That’s what commutative means: the same result from all orders. It’s a weaker, but faster kind of safety property.

Finally, if you want to read a value from one ref and use it to update another, use ensure instead of deref to perform a strongly consistent read–one which is guaranteed to take place in the same logical order as the dosync transaction itself. To add y’s current value to x, use:

user=> (dosync (alter x + (ensure y)))

Refs are a powerful construct, and make it easier to write complex transactional logic safely. However, that safety comes at a cost: refs are typically an order of magnitude slower to update than atoms.

Use refs only where you need to update multiple pieces of state independently–specifically, where different transactions need to work with distinct but partly overlapping pieces of state. If there’s no overlap between updates, use distinct atoms. If all operations update the same identities, use a single atom to hold a map of the system’s state. If a system requires complex interlocking state spread throughput the program–that’s when to reach for refs.

Summary

We moved beyond immutable programs into the world of changing state–and discovered the challenges of concurrency and parallelism. Where symbols provide immutable and transparent names for values objects, Vars provide mutable transparent names. We also saw a host of anonymous identity types for different purposes: delays for lazy evaluation, futures for parallel evaluation, and promises for arbitrary handoff of a value. Updates to vars are unsafe, so atoms and refs provide linearizable and serializable identities where transformations are safe.

Where reading a symbol or var is transparent–they evaluate directly to their current values–reading these new identity types requires the use of deref. Delays, futures, and promises block: deref must wait until the value is ready. This allows synchronization of concurrent threads. Atoms and refs, by contrast, can be read immediately at any time–but updating their values should occur within a swap! or dosync transaction, respectively.

Type Mutability Reads Updates Evaluation Scope
Symbol Immutable Transparent Lexical
Var Mutable Transparent Unrestricted Global/Dynamic
Delay Mutable Blocking Once only Lazy
Future Mutable Blocking Once only Parallel
Promise Mutable Blocking Once only
Atom Mutable Nonblocking Linearizable
Ref Mutable Nonblocking Serializable

State is undoubtedly the hardest part of programming, and this chapter probably felt overwhelming! On the other hand, we’re now equipped to solve serious problems. We’ll take a break to apply what we’ve learned through practical examples, in Chapter Seven: Logistics.

Exercises

Finding the sum of the first 10000000 numbers takes about 1 second on my machine:

user=> (defn sum [start end] (reduce + (range start end))) user=> (time (sum 0 1e7)) "Elapsed time: 1001.295323 msecs" 49999995000000
  1. Use delay to compute this sum lazily; show that it takes no time to return the delay, but roughly 1 second to deref.

  2. We can do the computation in a new thread directly, using (.start (Thread. (fn [] (sum 0 1e7)))–but this simply runs the (sum) function and discards the results. Use a promise to hand the result back out of the thread. Use this technique to write your own version of the future macro.

  3. If your computer has two cores, you can do this expensive computation twice as fast by splitting it into two parts: (sum 0 (/ 1e7 2)), and (sum (/ 1e7 2) 1e7), then adding those parts together. Use future to do both parts at once, and show that this strategy gets the same answer as the single-threaded version, but takes roughly half the time.

  4. Instead of using reduce, store the sum in an atom and use two futures to add each number from the lower and upper range to that atom. Wait for both futures to complete using deref, then check that the atom contains the right number. Is this technique faster or slower than reduce? Why do you think that might be?

  5. Instead of using a lazy list, imagine two threads are removing tasks from a pile of work. Our work pile will be the list of all integers from 0 to 10000:

    user=> (def work (ref (apply list (range 1e5)))) user=> (take 10 @work) (0 1 2 3 4 5 6 7 8 9)

    And the sum will be a ref as well:

    user=> (def sum (ref 0))

    Write a function which, in a dosync transaction, removes the first number in work and adds it to sum.
    Then, in two futures, call that function over and over again until there’s no work left. Verify that @sum is 4999950000. Experiment with different combinations of alter and commute–if both are correct, is one faster? Does using deref instead of ensure change the result?

In Chapter 1, I asserted that the grammar of Lisp is uniform: every expression is a list, beginning with a verb, and followed by some arguments. Evaluation proceeds from left to right, and every element of the list must be evaluated before evaluating the list itself. Yet we just saw, at the end of Sequences, an expression which seemed to violate these rules.

Clearly, this is not the whole story.

Macroexpansion

There is another phase to evaluating an expression; one which takes place before the rules we’ve followed so far. That process is called macro-expansion. During macro-expansion, the code itself is restructured according to some set of rules–rules which you, the programmer, can define.

(defmacro ignore "Cancels the evaluation of an expression, returning nil instead." [expr] nil) user=> (ignore (+ 1 2)) nil

defmacro looks a lot like defn: it has a name, an optional documentation string, an argument vector, and a body–in this case, just nil. In this case, it looks like it simply ignored the expr (+ 1 2) and returned nil–but it’s actually deeper than that. (+ 1 2) was never evaluated at all.

user=> (def x 1) #'user/x user=> x 1 user=> (ignore (def x 2)) nil user=> x 1

def should have defined x to be 2 no matter what–but that never happened. At macroexpansion time, the expression (ignore (+ 1 2)) was replaced by the expression nil, which was then evaluated to nil. Where functions rewrite values, macros rewrite code.

To see these different layers in play, let’s try a macro which reverses the order of arguments to a function.

(defmacro rev [fun & args] (cons fun (reverse args)))

This macro, named rev, takes one mandatory argument: a function. Then it takes any number of arguments, which are collected in the list args. It constructs a new list, starting with the function, and followed by the arguments, in reverse order.

First, we macro-expand:

user=> (macroexpand '(rev str "hi" (+ 1 2))) (str (+ 1 2) "hi")

So the rev macro took str as the function, and "hi" and (+ 1 2) as the arguments; then constructed a new list with the same function, but the arguments reversed. When we evaluate that expression, we get:

user=> (eval (macroexpand '(rev str "hi" (+ 1 2)))) "3hi"

macroexpand takes an expression and returns that expression with all macros expanded. eval takes an expression and evaluates it. When you type an unquoted expression into the REPL, Clojure macroexpands, then evaluates. Two stages–the first transforming code, the second transforming values.

Across languages

Some languages have a metalanguage: a language for extending the language itself. In C, for example, macros are implemented by the C preprocessor, which has its own syntax for defining expressions, matching patterns in the source code’s text, and replacing that text with other text. But that preprocessor is not C–it is a separate language entirely, with special limitations. In Clojure, the metalanguage is Clojure itself–the full power of the language is available to restructure programs. This is called a procedural macro system. Some Lisps, like Scheme, use a macro system based on templating expressions, and still others use more powerful models like f-expressions–but that’s a discussion for a later time.

There is another key difference between Lisp macros and many other macro systems: in Lisp, the macros operate on expressions: the data structure of the code itself. Because Lisp code is written explicitly as a data structure, a tree made out of lists, this transformation is natural. You can see the structure of the code, which makes it easy to reason about its transformation. In the C preprocessor, macros operate only on text: there is no understanding of the underlying syntax. Even in languages like Scala which have syntactic macros, the fact that the code looks nothing like the syntax tree makes it cumbersome to truly restructure expressions.

When people say that Lisp’s syntax is “more elegant”, or “more beautiful”, or “simpler”, this is part of what they they mean. By choosing to represent the program directly as a a data structure, we make it much easier to define complex transformations of code itself.

Defining new syntax

What kind of transformations are best expressed with macros?

Most languages encode special syntactic forms–things like “define a function”, “call a function”, “define a local variable”, “if this, then that”, and so on. In Clojure, these are called special forms. if is a special form, for instance. Its definition is built into the language core itself; it cannot be reduced into smaller parts.

(if (< 3 x) "big" "small")

Or in Javascript:

if (3 < x) { return "big"; } else { return "small"; }

In Javascript, Ruby, and many other languages, these special forms are fixed. You cannot define your own syntax. For instance, one cannot define or in a language like JS or Ruby: it must be defined for you by the language author.

In Clojure, or is just a macro.

user=> (source or) (defmacro or "Evaluates exprs one at a time, from left to right. If a form returns a logical true value, or returns that value and doesn't evaluate any of the other expressions, otherwise it returns the value of the last expression. (or) returns nil." {:added "1.0"} ([] nil) ([x] x) ([x & next] `(let [or# ~x] (if or# or# (or ~@next))))) nil

That ` operator–that’s called syntax-quote. It works just like regular quote–preventing evaluation of the following list–but with a twist: we can escape the quoting rule and substitute in regularly evaluated expressions using unquote (~), and unquote-splice (~@). Think of a syntax-quoted expression like a template for code, with some parts filled in by evaluated forms.

user=> (let [x 2] `(inc x)) (clojure.core/inc user/x) user=> (let [x 2] `(inc ~x)) (clojure.core/inc 2)

See the difference? ~x substitutes the value of x, instead of using x as an unevaluated symbol. This code is essentially just shorthand for something like

user=> (let [x 2] (list 'clojure.core/inc x)) (inc 2)

… where we explicitly constructed a new list with the quoted symbol 'inc and the current value of x. Syntax quote just makes it easier to read the code, since the quoted and expanded expressions have similar shapes.

The ~@ unquote splice works just like ~, except it explodes a list into multiple expressions in the resulting form:

user=> `(foo ~[1 2 3]) (user/foo [1 2 3]) user=> `(foo ~@[1 2 3]) (user/foo 1 2 3)

~@ is particularly useful when a function or macro takes an arbitrary number of arguments. In the definition of or, it’s used to expand (or a b c) recursively.

user=> (pprint (macroexpand '(or a b c d))) (let* [or__3943__auto__ a] (if or__3943__auto__ or__3943__auto__ (clojure.core/or b c d)))

We’re using pprint (for “pretty print”) to make this expression easier to read. (or a b c d) is defined in terms of if: if the first element is truthy we return it; otherwise we evaluate (or b c d) instead, and so on.

The final piece of the puzzle here is that weirdly named symbol: or__3943__auto__. That variable was automatically generated by Clojure, to prevent conflicts with an existing variable name. Because macros rewrite code, they have to be careful not to interfere with local variables, or it could get very confusing. Whenever we need a new variable in a macro, we use gensym to generate a new symbol.

user=> (gensym "hi") hi326 user=> (gensym "hi") hi329 user=> (gensym "hi") hi332

Each symbol is different! If we tack on a # to the end of a symbol in a syntax-quoted expression, it’ll be expanded to a particular gensym:

user=> `(let [x# 2] x#) (clojure.core/let [x__339__auto__ 2] x__339__auto__)

Note that you can always escape this safety feature if you want to override local variables. That’s called symbol capture, or an anaphoric or unhygenic macro. To override local symbols, just use ~'foo instead of foo#.

With all the pieces on the board, let’s compare the or macro and its expansion:

(defmacro or "Evaluates exprs one at a time, from left to right. If a form returns a logical true value, or returns that value and doesn't evaluate any of the other expressions, otherwise it returns the value of the last expression. (or) returns nil." {:added "1.0"} ([] nil) ([x] x) ([x & next] `(let [or# ~x] (if or# or# (or ~@next))))) user=> (pprint (clojure.walk/macroexpand-all '(or (mossy? stone) (cool? stone) (wet? stone)))) (let* [or__3943__auto__ (mossy? stone)] (if or__3943__auto__ or__3943__auto__ (let* [or__3943__auto__ (cool? stone)] (if or__3943__auto__ or__3943__auto__ (wet? stone)))))

See how the macro’s syntax-quoted (let ... has the same shape as the resulting code? or# is expanded to a variable named or__3943__auto__, which is bound to the expression (mossy? stone). If that variable is truthy, we return it. Otherwise, we (and here’s the recursive part) rebind or__3943__auto__ to (cool? stone) and try again. If that fails, we fall back to evaluating (wet? stone)–thanks to the base case, the single-argument form of the or macro.

Control flow

We’ve seen that or is a macro written in terms of the special form if–and because of the way the macro is structured, it does not obey the normal execution order. In (or a b c), only a is evaluated first–then, only if it is false or nil, do we evaluate b. This is called short-circuiting, and it works for and as well.

Changing the order of evaluation in a language is called control flow, and lets programs make decisions based on varying circumstances. We’ve already seen if:

user=> (if (= 2 2) :a :b) :a

if takes a predicate and two expressions, and only evaluates one of them, depending on whether the predicate evaluates to a truthy or falsey value. Sometimes you want to evaluate more than one expression in order. For this, we have do.

user=> (if (pos? -5) (prn "-5 is positive") (do (prn "-5 is negative") (prn "Who would have thought?"))) "-5 is negative" "Who would have thought?" nil

prn is a function which has a side effect: it prints a message to the screen, and returns nil. We wanted to print two messages, but if only takes a single expression per branch–so in our false branch, we used do to wrap up two prns into a single expression, and evaluate them in order. do returns the value of the final expression, which happens to be nil here.

When you only want to take one branch of an if, you can use when:

user=> (when false (prn :hi) (prn :there)) nil user=> (when true (prn :hi) (prn :there)) :hi :there nil

Because there is only one path to take, when takes any number of expressions, and evaluates them only when the predicate is truthy. If the predicate evaluates to nil or false, when does not evaluate its body, and returns nil.

Both when and if have complementary forms, when-not and if-not, which simply invert the sense of their predicate.

user=> (when-not (number? "a string") :here) :here user=> (if-not (vector? (list 1 2 3)) :a :b) :a

Often, you want to perform some operation, and if it’s truthy, re-use that value without recomputing it. For this, we have when-let and if-let. These work just like when and let combined.

user=> (when-let [x (+ 1 2 3 4)] (str x)) "10" user=> (when-let [x (first [])] (str x)) nil

while evaluates an expression so long as its predicate is truthy. This is generally useful only for side effects, like prn or def; things that change the state of the world.

user=> (def x 0) #'user/x user=> (while (< x 5) #_=> (prn x) #_=> (def x (inc x))) 0 1 2 3 4 nil

cond (for “conditional”) is like a multiheaded if: it takes any number of test/expression pairs, and tries each test in turn. The first test which evaluates truthy causes the following expression to be evaluated; then cond returns that expression’s value.

user=> (cond #_=> (= 2 5) :nope #_=> (= 3 3) :yep #_=> (= 5 5) :cant-get-here #_=> :else :a-default-value) :yep

If you find yourself making several similar decisions based on a value, try condp, for “cond with predicate”. For instance, we might categorize a number based on some ranges:

(defn category "Determines the Saffir-Simpson category of a hurricane, by wind speed in meters/sec" [wind-speed] (condp <= wind-speed 70 :F5 58 :F4 49 :F3 42 :F2 :F1)) ; Default value user=> (category 10) :F1 user=> (category 50) :F3 user=> (category 100) :F5

condp generates code which combines the predicate <= with each number, and the value of wind-speed, like so:

(if (<= 70 wind-speed) :F5 (if (<= 58 wind-speed) :F4 (if (<= 49 wind-speed) :F3 (if (<= 42 wind-speed) :F2 :F1))))

Specialized macros like condp are less commonly used than if or when, but they still play an important role in simplifying repeated code. They clarify the meaning of complex expressions, making them easier to read and maintain.

Finally, there’s case, which works a little bit like a map of keys to values–only the values are code, to be evaluated. You can think of case like (condp = ...), trying to match an expression to a particular branch for which it is equal.

(defn with-tax "Computes the total cost, with tax, of a purchase in the given state." [state subtotal] (case state :WA (* 1.065 subtotal) :OR subtotal :CA (* 1.075 subtotal) ; ... 48 other states ... subtotal)) ; a default case

Unlike cond and condp, case does not evaluate its tests in order. It jumps immediately to the matching expression. This makes case much faster when there are many branches to take–at the cost of reduced generality.

Recursion

Previously, we defined recursive functions by having those functions call themselves explicitly.

(defn sum [numbers] (if-let [n (first numbers)] (+ n (sum (rest numbers))) 0)) user=> (sum (range 10)) 45

But this approach breaks down when we have the function call itself deeply, over and over again.

user=> (sum (range 100000)) StackOverflowError clojure.core/range/fn--4269 (core.clj:2664)

Every time you call a function, the arguments for that function are stored in memory, in a region called the stack. They remain there for as long as the function is being called–including any deeper function calls.

(+ n (sum (rest numbers)))

In order to add n and (sum (rest numbers)), we have to call sum first–while holding onto the memory for n and numbers. We can’t re-use that memory until every single recursive call has completed. Clojure complains, after tens of thousands of stack frames are in use, that it has run out of space in the stack and can allocate no more.

But consider this variation on sum:

(defn sum ([numbers] (sum 0 numbers)) ([subtotal numbers] (if-let [n (first numbers)] (recur (+ subtotal n) (rest numbers)) subtotal))) user=> (sum (range 100000)) 4999950000

We’ve added an additional parameter to the function. In its two-argument form, sum now takes an accumulator, subtotal, which represents the count so far. In addition, recur has taken the place of sum. Notice, however, that the final expression to be evaluated is not +, but sum (viz recur) itself. We don’t need to hang on to any of the variables in this function any more, because the final return value won’t depend on them. recur hints to the Clojure compiler that we don’t need to hold on to the stack, and can re-use that space for other things. This is called a tail-recursive function, and it requires only a single stack frame no matter how deep the recursive calls go.

Use recur wherever possible. It requires much less memory and is much faster than the explicit recursion.

You can also use recur within the context of the loop macro, where it acts just like an unnamed recursive function with initial values provided. Think of it, perhaps, like a recursive let.

user=> (loop [i 0 nums []] (if (< 10 i) nums (recur (inc i) (conj nums i)))) [0 1 2 3 4 5 6 7 8 9 10]

Laziness

In chapter 4 we mentioned that most of the sequences in Clojure, like map, filter, iterate, repeatedly, and so on, were lazy: they did not evaluate any of their elements until required. This too is provided by a macro, called lazy-seq.

(defn integers [x] (lazy-seq (cons x (integers (inc x))))) user=> (def xs (integers 0)) #'user/xs

This sequence does not terminate; it is infinitely recursive. Yet it returned instantaneously. lazy-seq interrupted that recursion and restructured it into a sequence which constructs elements only when they are requested.

user=> (take 10 xs) (0 1 2 3 4 5 6 7 8 9)

When using lazy-seq and its partner lazy-cat, you don’t have to use recur–or even be tail-recursive. The macros interrupt each level of recursion, preventing stack overflows.

You can also delay evaluation of some expressions until later, using delay and deref.

user=> (def x (delay (prn "computing a really big number!") (last (take 10000000 (iterate inc 0))))) #'user/x ; Did nothing, returned immediately user=> (deref x) "computing a really big number!" ; Now we have to wait! 9999999

List comprehensions

Combining recursion and laziness is the list comprehension macro, for. In its simplest form, for works like map:

user=> (for [x (range 10)] (- x)) (0 -1 -2 -3 -4 -5 -6 -7 -8 -9)

Like let, for takes a vector of bindings. Unlike let, however, for binds its variables to each possible combination of elements in their corresponding sequences.

user=> (for [x [1 2 3] y [:a :b]] [x y]) ([1 :a] [1 :b] [2 :a] [2 :b] [3 :a] [3 :b])

“For each x in the sequence [1 2 3], and for each y in the sequence [:a :b], find all [x y] pairs.” Note that the rightmost variable y iterates the fastest.

Like most sequence functions, the for macro yields lazy sequences. You can filter them with take, filter, et al like any other sequence. Or you can use :while to tell for when to stop, or :when to filter out combinations of elements.

(for [x (range 5) y (range 5) :when (and (even? x) (odd? y))] [x y]) ([0 1] [0 3] [2 1] [2 3] [4 1] [4 3])

Clojure includes a rich smörgåsbord of control-flow constructs; we’ll meet new ones throughout the book.

The threading macros

Sometimes you want to thread a computation through several expressions, like a chain. Object-oriented languages like Ruby or Java are well-suited to this style:

1.9.3p385 :004 > (0..10).select(&:odd?).reduce(&:+) 25

Start with the range 0 to 10, then call select on that range, with the function odd?. Finally, take that sequence of numbers, and reduce it with the + function.

The Clojure threading macros do the same by restructuring a sequence of expressions, inserting each expression as the first (or final) argument in the next expression.

user=> (pprint (clojure.walk/macroexpand-all '(->> (range 10) (filter odd?) (reduce +)))) (reduce + (filter odd? (range 10))) user=> (->> (range 10) (filter odd?) (reduce +)) 25

->> took (range 10) and inserted it at the end of (filter odd?), forming (filter odd? (range 10)). Then it took that expression and inserted it at the end of (reduce +). In essence, ->> flattens and reverses a nested chain of operations.

->, by contrast, inserts each form in as the first argument in the following expression.

user=> (pprint (clojure.walk/macroexpand-all '(-> {:proton :fermion} (assoc :photon :boson) (assoc :neutrino :fermion)))) (assoc (assoc {:proton :fermion} :photon :boson) :neutrino :fermion) user=> (-> {:proton :fermion} (assoc :photon :boson) (assoc :neutrino :fermion)) {:neutrino :fermion, :photon :boson, :proton :fermion}

Clojure isn’t just function-oriented in its syntax; it can be object-oriented, and stack-oriented, and array-oriented, and so on–and mix all of these styles freely, in a controlled way. If you don’t like the way the language fits a certain problem, you can write a macro which defines a new language, specifically for that subproblem.

cond, condp and case, for example, express a language for branching based on predicates. ->, ->>, and doto express object-oriented and other expression-chaining languages.

  • core.match is a set of macros which express powerful pattern-matching and substitution languages.
  • core.logic expresses syntax for logic programming, for finding values which satisfy complex constraints.
  • core.async restructures Clojure code into asynchronous forms so they can do many things at once.
  • For those with a twisted sense of humor, Swiss Arrows extends the threading macros into evil–but delightfully concise!–forms.

We’ll see a plethora of macros, from simple to complex, through the course of this book. Each one shares the common pattern of simplifying code; reducing tangled or verbose expressions into something more concise, more meaningful, better suited to the problem at hand.

When to use macros

While it’s important to be aware of the purpose and behavior of the macro system, you don’t need to write your own macros to be productive with Clojure. For now, you’ll be just fine writing code which uses the existing macros in the language. When you do need to delve deeper, come back to this guide and experiment. It’ll take some time to sink in.

First, know that writing macros is tricky, even for experts. It requires you to think at two levels simultaneously, and to be mindful of the distinction between expression and underlying evaluation. Writing a macro is essentially extending the language, the compiler, the syntax and evaluation model of Clojure, by restructuring arbitrary expressions into ones the evaluation system understands. This is hard, and it’ll take practice to get used to.

In addition, Clojure macros come with some important restrictions. Because they’re expanded prior to evaluation, macros are invisible to functions. They can’t be composed functionally–you can’t (map or ...), for instance.

So in general, if you can solve a problem without writing a macro, don’t write one. It’ll be easier to debug, easier to understand, and easier to compose later. Only reach for macros when you need new syntax, or when performance demands the code be transformed at compile time.

When you do write a macro, consider its scope carefully. Keep the transformation simple; and do as much in normal functions as possible. Provide an escape hatch where possible, by doing most of the work in a function, and writing a small wrapper macro which calls that function. Finally, remember the distinction between code and what that code evaluates to. Use let whenever a value is to be re-used, to prevent it being evaluated twice by accident.

For a deeper exploration of Clojure macros in a real-world application, try Language Power.

Review

In Chapter 4, deeply nested expressions led to the desire for a simpler, more direct expression of a chain of sequence operations. We learned that the Clojure compiler first expands expressions before evaluating them, using macros–special functions which take code and return other code. We used macros to define the short-circuiting or operator, and followed that with a tour of basic control flow, recursion, laziness, list comprehensions, and chained expressions. Finally, we learned a bit about when and how to write our own macros.

Throughout this chapter we’ve brushed against the idea of side effects: things which change the outside world. We might change a var with def, or print a message to the screen with prn. Real languages must model a continually shifting universe, which leads us to Chapter Six: Side effects and state.

Problems

  1. Using the control flow constructs we’ve learned, write a schedule function which, given an hour of the day, returns what you’ll be doing at that time. (schedule 18), for me, returns :dinner.

  2. Using the threading macros, find how many numbers from 0 to 9999 are palindromes: identical when written forwards and backwards. 121 is a palindrome, as is 7447 and 5, but not 12 or 953.

  3. Write a macro id which takes a function and a list of args: (id f a b c), and returns an expression which calls that function with the given args: (f a b c).

  4. Write a macro log which uses a var, logging-enabled, to determine whether or not to print an expression to the console at compile time. If logging-enabled is false, (log :hi) should macroexpand to nil. If logging-enabled is true, (log :hi) should macroexpand to (prn :hi). Why would you want to do this check during compilation, instead of when running the program? What might you lose?

  5. (Advanced) Using the rationalize function, write a macro exact which rewrites any use of +, -, *, or / to force the use of ratios instead of floating-point numbers. (* 2452.45 100) returns 245244.99999999997, but (exact (* 2452.45 100)) should return 245245N

In Chapter 3, we discovered functions as a way to abstract expressions; to rephrase a particular computation with some parts missing. We used functions to transform a single value. But what if we want to apply a function to more than one value at once? What about sequences?

For example, we know that (inc 2) increments the number 2. What if we wanted to increment every number in the vector [1 2 3], producing [2 3 4]?

user=> (inc [1 2 3]) ClassCastException clojure.lang.PersistentVector cannot be cast to java.lang.Number clojure.lang.Numbers.inc (Numbers.java:110)

Clearly inc can only work on numbers, not on vectors. We need a different kind of tool.

A direct approach

Let’s think about the problem in concrete terms. We want to increment each of three elements: the first, second, and third. We know how to get an element from a sequence by using nth, so let’s start with the first number, at index 0:

user=> (def numbers [1 2 3]) #'user/numbers user=> (nth numbers 0) 1 user=> (inc (nth numbers 0)) 2

So there’s the first element incremented. Now we can do the second:

user=> (inc (nth numbers 1)) 3 user=> (inc (nth numbers 2)) 4

And it should be straightforward to combine these into a vector…

user=> [(inc (nth numbers 0)) (inc (nth numbers 1)) (inc (nth numbers 2))] [2 3 4]

Success! We’ve incremented each of the numbers in the list! How about a list with only two elements?

user=> (def numbers [1 2]) #'user/numbers user=> [(inc (nth numbers 0)) (inc (nth numbers 1)) (inc (nth numbers 2))] IndexOutOfBoundsException clojure.lang.PersistentVector.arrayFor (PersistentVector.java:107)

Shoot. We tried to get the element at index 2, but couldn’t, because numbers only has indices 0 and 1. Clojure calls that “index out of bounds”.

We could just leave off the third expression in the vector; taking only elements 0 and 1. But the problem actually gets much worse, because we’d need to make this change every time we wanted to use a different sized vector. And what of a vector with 1000 elements? We’d need 1000 (inc (nth numbers ...)) expressions! Down this path lies madness.

Let’s back up a bit, and try a slightly smaller problem.

Recursion

What if we just incremented the first number in the vector? How would that work? We know that first finds the first element in a sequence, and rest finds all the remaining ones.

user=> (first [1 2 3]) 1 user=> (rest [1 2 3]) (2 3)

So there’s the pieces we’d need. To glue them back together, we can use a function called cons, which says “make a list beginning with the first argument, followed by all the elements in the second argument”.

user=> (cons 1 [2]) (1 2) user=> (cons 1 [2 3]) (1 2 3) user=> (cons 1 [2 3 4]) (1 2 3 4)

OK so we can split up a sequence, increment the first part, and join them back together. Not so hard, right?

(defn inc-first [nums] (cons (inc (first nums)) (rest nums))) user=> (inc-first [1 2 3 4]) (2 2 3 4)

Hey, there we go! First element changed. Will it work with any length list?

user=> (inc-first [5]) (6) user=> (inc-first []) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942)

Shoot. We can’t increment the first element of this empty vector, because it doesn’t have a first element.

user=> (first []) nil user=> (inc nil) NullPointerException clojure.lang.Numbers.ops (Numbers.java:942)

So there are really two cases for this function. If there is a first element in nums, we’ll increment it as normal. If there’s no such element, we’ll return an empty list. To express this kind of conditional behavior, we’ll use a Clojure special form called if:

user=> (doc if) ------------------------- if (if test then else?) Special Form Evaluates test. If not the singular values nil or false, evaluates and yields then, otherwise, evaluates and yields else. If else is not supplied it defaults to nil. Please see http://clojure.org/special_forms#if

To confirm our intuition:

user=> (if true :a :b) :a user=> (if false :a :b) :b

Seems straightforward enough.

(defn inc-first [nums] (if (first nums) ; If there's a first number, build a new list with cons (cons (inc (first nums)) (rest nums)) ; If there's no first number, just return an empty list (list))) user=> (inc-first []) () user=> (inc-first [1 2 3]) (2 2 3)

Success! Now we can handle both cases: empty sequences, and sequences with things in them. Now how about incrementing that second number? Let’s stare at that code for a bit.

(rest nums)

Hang on. That list–(rest nums)–that’s a list of numbers too. What if we… used our inc-first function on that list, to increment its first number? Then we’d have incremented both the first and the second element.

(defn inc-more [nums] (if (first nums) (cons (inc (first nums)) (inc-more (rest nums))) (list))) user=> (inc-more [1 2 3 4]) (2 3 4 5)

Odd. That didn’t just increment the first two numbers. It incremented all the numbers. We fell into the complete solution entirely by accident. What happened here?

Well first we… yes, we got the number one, and incremented it. Then we stuck that onto (inc-first [2 3 4]), which got two, and incremented it. Then we stuck that two onto (inc-first [3 4]), which got three, and then we did the same for four. Only that time around, at the very end of the list, (rest [4]) would have been empty. So when we went to get the first number of the empty list, we took the second branch of the if, and returned the empty list.

Having reached the bottom of the function calls, so to speak, we zip back up the chain. We can imagine this function turning into a long string of cons calls, like so:

(cons 2 (cons 3 (cons 4 (cons 5 '())))) (cons 2 (cons 3 (cons 4 '(5)))) (cons 2 (cons 3 '(4 5))) (cons 2 '(3 4 5)) '(2 3 4 5)

This technique is called recursion, and it is a fundamental principle in working with collections, sequences, trees, or graphs… any problem which has small parts linked together. There are two key elements in a recursive program:

  1. Some part of the problem which has a known solution
  2. A relationship which connects one part of the problem to the next

Incrementing the elements of an empty list returns the empty list. This is our base case: the ground to build on. Our inductive case, also called the recurrence relation, is how we broke the problem up into incrementing the first number in the sequence, and incrementing all the numbers in the rest of the sequence. The if expression bound these two cases together into a single function; a function defined in terms of itself.

Once the initial step has been taken, every step can be taken.

user=> (inc-more [1 2 3 4 5 6 7 8 9 10 11 12]) (2 3 4 5 6 7 8 9 10 11 12 13)

This is the beauty of a recursive function; folding an unbounded stream of computation over and over, onto itself, until only a single step remains.

Generalizing from inc

We set out to increment every number in a vector, but nothing in our solution actually depended on inc. It just as well could have been dec, or str, or keyword. Let’s parameterize our inc-more function to use any transformation of its elements:

(defn transform-all [f xs] (if (first xs) (cons (f (first xs)) (transform-all f (rest xs))) (list)))

Because we could be talking about any kind of sequence, not just numbers, we’ve named the sequence xs, and its first element x. We also take a function f as an argument, and that function will be applied to each x in turn. So not only can we increment numbers…

user=> (transform-all inc [1 2 3 4]) (2 3 4 5)

…but we can turn strings to keywords…

user=> (transform-all keyword ["bell" "hooks"]) (:bell :hooks)

…or wrap every element in a list:

user=> (transform-all list [:codex :book :manuscript]) ((:codex) (:book) (:manuscript))

In short, this function expresses a sequence in which each element is some function applied to the corresponding element in the underlying sequence. This idea is so important that it has its own name, in mathematics, Clojure, and other languages. We call it map.

user=> (map inc [1 2 3 4]) (2 3 4 5)

You might remember maps as a datatype in Clojure, too–they’re dictionaries that relate keys to values.

{:year 1969 :event "moon landing"}

The function map relates one sequence to another. The type map relates keys to values. There is a deep symmetry between the two: maps are usually sparse, and the relationships between keys and values may be arbitrarily complex. The map function, on the other hand, usually expresses the same type of relationship, applied to a series of elements in fixed order.

Building sequences

Recursion can do more than just map. We can use it to expand a single value into a sequence of values, each related by some function. For instance:

(defn expand [f x count] (if (pos? count) (cons x (expand f (f x) (dec count)))))

Our base case is x itself, followed by the sequence beginning with (f x). That sequence in turn expands to (f (f x)), and then (f (f (f x))), and so on. Each time we call expand, we count down by one using dec. Once the count is zero, the if returns nil, and evaluation stops. If we start with the number 0 and use inc as our function:

user=> user=> (expand inc 0 10) (0 1 2 3 4 5 6 7 8 9)

Clojure has a more general form of this function, called iterate.

user=> (take 10 (iterate inc 0)) (0 1 2 3 4 5 6 7 8 9)

Since this sequence is infinitely long, we’re using take to select only the first 10 elements. We can construct more complex sequences by using more complex functions:

user=> (take 10 (iterate (fn [x] (if (odd? x) (+ 1 x) (/ x 2))) 10)) (10 5 6 3 4 2 1 2 1 2)

Or build up strings:

user=> (take 5 (iterate (fn [x] (str x "o")) "y")) ("y" "yo" "yoo" "yooo" "yoooo")

iterate is extremely handy for working with infinite sequences, and has some partners in crime. repeat, for instance, constructs a sequence where every element is the same.

user=> (take 10 (repeat :hi)) (:hi :hi :hi :hi :hi :hi :hi :hi :hi :hi) user=> (repeat 3 :echo) (:echo :echo :echo)

And its close relative repeatedly simply calls a function (f) to generate an infinite sequence of values, over and over again, without any relationship between elements. For an infinite sequence of random numbers:

user=> (rand) 0.9002678382322784 user=> (rand) 0.12375594203332863 user=> (take 3 (repeatedly rand)) (0.44442397843046755 0.33668691162169784 0.18244875487846746)

Notice that calling (rand) returns a different number each time. We say that rand is an impure function, because it cannot simply be replaced by the same value every time. It does something different each time it’s called.

There’s another very handy sequence function specifically for numbers: range, which generates a sequence of numbers between two points. (range n) gives n successive integers starting at 0. (range n m) returns integers from n to m-1. (range n m step) returns integers from n to m, but separated by step.

user=> (range 5) (0 1 2 3 4) user=> (range 2 10) (2 3 4 5 6 7 8 9) user=> (range 0 100 5) (0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95)

To extend a sequence by repeating it forever, use cycle:

user=> (take 10 (cycle [1 2 3])) (1 2 3 1 2 3 1 2 3 1)

Transforming sequences

Given a sequence, we often want to find a related sequence. map, for instance, applies a function to each element–but has a few more tricks up its sleeve.

user=> (map (fn [n vehicle] (str "I've got " n " " vehicle "s")) [0 200 9] ["car" "train" "kiteboard"]) ("I've got 0 cars" "I've got 200 trains" "I've got 9 kiteboards")

If given multiple sequences, map calls its function with one element from each sequence in turn. So the first value will be (f 0 "car"), the second (f 200 "train"), and so on. Like a zipper, map folds together corresponding elements from multiple collections. To sum three vectors, column-wise:

user=> (map + [1 2 3] [4 5 6] [7 8 9]) (12 15 18)

If one sequence is bigger than another, map stops at the end of the smaller one. We can exploit this to combine finite and infinite sequences. For example, to number the elements in a vector:

user=> (map (fn [index element] (str index ". " element)) (iterate inc 0) ["erlang" "ruby" "haskell"]) ("0. erlang" "1. ruby" "2. haskell")

Transforming elements together with their indices is so common that Clojure has a special function for it: map-indexed:

user=> (map-indexed (fn [index element] (str index ". " element)) ["erlang" "ruby" "haskell"]) ("0. erlang" "1. ruby" "2. haskell")

You can also tack one sequence onto the end of another, like so:

user=> (concat [1 2 3] [:a :b :c] [4 5 6]) (1 2 3 :a :b :c 4 5 6)

Another way to combine two sequences is to riffle them together, using interleave.

user=> (interleave [:a :b :c] [1 2 3]) (:a 1 :b 2 :c 3)

And if you want to insert a specific element between each successive pair in a sequence, try interpose:

user=> (interpose :and [1 2 3 4]) (1 :and 2 :and 3 :and 4)

To reverse a sequence, use reverse.

user=> (reverse [1 2 3]) (3 2 1) user=> (reverse "woolf") (\f \l \o \o \w)

Strings are sequences too! Each element of a string is a character, written \f. You can rejoin those characters into a string with apply str:

user=> (apply str (reverse "woolf")) "floow"

…and break strings up into sequences of chars with seq.

user=> (seq "sato") (\s \a \t \o)

To randomize the order of a sequence, use shuffle.

user=> (shuffle [1 2 3 4]) [3 1 2 4] user=> (apply str (shuffle (seq "abracadabra"))) "acaadabrrab"

Subsequences

We’ve already seen take, which selects the first n elements. There’s also drop, which removes the first n elements.

user=> (range 10) (0 1 2 3 4 5 6 7 8 9) user=> (take 3 (range 10)) (0 1 2) user=> (drop 3 (range 10)) (3 4 5 6 7 8 9)

And for slicing apart the other end of the sequence, we have take-last and drop-last:

user=> (take-last 3 (range 10)) (7 8 9) user=> (drop-last 3 (range 10)) (0 1 2 3 4 5 6)

take-while and drop-while work just like take and drop, but use a function to decide when to cut.

user=> (take-while pos? [3 2 1 0 -1 -2 10]) (3 2 1)

In general, one can cut a sequence in twain by using split-at, and giving it a particular index. There’s also split-with, which uses a function to decide when to cut.

(split-at 4 (range 10)) [(0 1 2 3) (4 5 6 7 8 9)] user=> (split-with number? [1 2 3 :mark 4 5 6 :mark 7]) [(1 2 3) (:mark 4 5 6 :mark 7)]

Notice that because indexes start at zero, sequence functions tend to have predictable numbers of elements. (split-at 4) yields four elements in the first collection, and ensures the second collection begins at index four. (range 10) has ten elements, corresponding to the first ten indices in a sequence. (range 3 5) has two (since 5 - 3 is two) elements. These choices simplify the definition of recursive functions as well.

We can select particular elements from a sequence by applying a function. To find all positive numbers in a list, use filter:

user=> (filter pos? [1 5 -4 -7 3 0]) (1 5 3)

filter looks at each element in turn, and includes it in the resulting sequence only if (f element) returns a truthy value. Its complement is remove, which only includes those elements where (f element) is false or nil.

user=> (remove string? [1 "turing" :apple]) (1 :apple)

Finally, one can group a sequence into chunks using partition, partition-all, or partition-by. For instance, one might group alternating values into pairs:

user=> (partition 2 [:cats 5 :bats 27 :crocodiles 0]) ((:cats 5) (:bats 27) (:crocodiles 0))

Or separate a series of numbers into negative and positive runs:

(user=> (partition-by neg? [1 2 3 2 1 -1 -2 -3 -2 -1 1 2]) ((1 2 3 2 1) (-1 -2 -3 -2 -1) (1 2))

Collapsing sequences

After transforming a sequence, we often want to collapse it in some way; to derive some smaller value. For instance, we might want the number of times each element appears in a sequence:

user=> (frequencies [:meow :mrrrow :meow :meow]) {:meow 3, :mrrrow 1}

Or to group elements by some function:

user=> (pprint (group-by :first [{:first "Li" :last "Zhou"} {:first "Sarah" :last "Lee"} {:first "Sarah" :last "Dunn"} {:first "Li" :last "O'Toole"}])) {"Li" [{:last "Zhou", :first "Li"} {:last "O'Toole", :first "Li"}], "Sarah" [{:last "Lee", :first "Sarah"} {:last "Dunn", :first "Sarah"}]}

Here we’ve taken a sequence of people with first and last names, and used the :first keyword (which can act as a function!) to look up those first names. group-by used that function to produce a map of first names to lists of people–kind of like an index.

In general, we want to combine elements together in some way, using a function. Where map treated each element independently, reducing a sequence requires that we bring some information along. The most general way to collapse a sequence is reduce.

user=> (doc reduce) ------------------------- clojure.core/reduce ([f coll] [f val coll]) f should be a function of 2 arguments. If val is not supplied, returns the result of applying f to the first 2 items in coll, then applying f to that result and the 3rd item, etc. If coll contains no items, f must accept no arguments as well, and reduce returns the result of calling f with no arguments. If coll has only 1 item, it is returned and f is not called. If val is supplied, returns the result of applying f to val and the first item in coll, then applying f to that result and the 2nd item, etc. If coll contains no items, returns val and f is not called.

That’s a little complicated, so we’ll start small. We need a function, f, which combines successive elements of the sequence. (f state element) will return the state for the next invocation of f. As f moves along the sequence, it carries some changing state with it. The final state is the return value of reduce.

user=> (reduce + [1 2 3 4]) 10

reduce begins by calling (+ 1 2), which yields the state 3. Then it calls (+ 3 3), which yields 6. Then (+ 6 4), which returns 10. We’ve taken a function over two elements, and used it to combine all the elements. Mathematically, we could write:

1 + 2 + 3 + 4 3 + 3 + 4 6 + 4 10

So another way to look at reduce is like sticking a function between each pair of elements. To see the reducing process in action, we can use reductions, which returns a sequence of all the intermediate states.

user=> (reductions + [1 2 3 4]) (1 3 6 10)

Oftentimes we include a default state to start with. For instance, we could start with an empty set, and add each element to it as we go along:

user=> (reduce conj #{} [:a :b :b :b :a :a]) #{:a :b}

Reducing elements into a collection has its own name: into. We can conj [key value] vectors into a map, for instance, or build up a list:

user=> (into {} [[:a 2] [:b 3]]) {:a 2, :b 3} user=> (into (list) [1 2 3 4]) (4 3 2 1)

Because elements added to a list appear at the beginning, not the end, this expression reverses the sequence. Vectors conj onto the end, so to emit the elements in order, using reduce, we might try:

user=> (reduce conj [] [1 2 3 4 5]) (reduce conj [] [1 2 3 4 5]) [1 2 3 4 5]

Which brings up an interesting thought: this looks an awful lot like map. All that’s missing is some kind of transformation applied to each element.

(defn my-map [f coll] (reduce (fn [output element] (conj output (f element))) [] coll)) user=> (my-map inc [1 2 3 4]) [2 3 4 5]

Huh. map is just a special kind of reduce. What about, say, take-while?

(defn my-take-while [f coll] (reduce (fn [out elem] (if (f elem) (conj out elem) (reduced out))) [] coll))

We’re using a special function here, reduced, to indicate that we’ve completed our reduction early and can skip the rest of the sequence.

user=> (my-take-while pos? [2 1 0 -1 0 1 2]) [2 1]

reduce really is the uberfunction over sequences. Almost any operation on a sequence can be expressed in terms of a reduce–though for various reasons, many of the Clojure sequence functions are not written this way. For instance, take-while is actually defined like so:

user=> (source take-while) (defn take-while "Returns a lazy sequence of successive items from coll while (pred item) returns true. pred must be free of side-effects." {:added "1.0" :static true} [pred coll] (lazy-seq (when-let [s (seq coll)] (when (pred (first s)) (cons (first s) (take-while pred (rest s)))))))

There’s a few new pieces here, but the structure is essentially the same as our initial attempt at writing map. When the predicate matches the first element, cons the first element onto take-while, applied to the rest of the sequence. That lazy-seq construct allows Clojure to compute this sequence as required, instead of right away. It defers execution to a later time.

Most of Clojure’s sequence functions are lazy. They don’t do anything until needed. For instance, we can increment every number from zero to infinity:

user=> (def infseq (map inc (iterate inc 0))) #'user/infseq user=> (realized? infseq) false

That function returned immediately. Because it hasn’t done any work yet, we say the sequence is unrealized. It doesn’t increment any numbers at all until we ask for them:

user=> (take 10 infseq) (1 2 3 4 5 6 7 8 9 10) user=> (realized? infseq) true

Lazy sequences also remember their contents, once evaluated, for faster access.

Putting it all together

We’ve seen how recursion generalizes a function over one thing into a function over many things, and discovered a rich landscape of recursive functions over sequences. Now let’s use our knowledge of sequences to solve a more complex problem: find the sum of the products of consecutive pairs of the first 1000 odd integers.

First, we’ll need the integers. We can start with 0, and work our way up to infinity. To save time printing an infinite number of integers, we’ll start with just the first 10.

user=> (take 10 (iterate inc 0)) (0 1 2 3 4 5 6 7 8 9)

Now we need to find only the ones which are odd. Remember, filter pares down a sequence to only those elements which pass a test.

user=> (take 10 (filter odd? (iterate inc 0))) (1 3 5 7 9 11 13 15 17 19)

For consecutive pairs, we want to take [1 3 5 7 ...] and find a sequence like ([1 3] [3 5] [5 7] ...). That sounds like a job for partition:

user=> (take 3 (partition 2 (filter odd? (iterate inc 0)))) ((1 3) (5 7) (9 11))

Not quite right–this gave us non-overlapping pairs, but we wanted overlapping ones too. A quick check of (doc partition) reveals the step parameter:

user=> (take 3 (partition 2 1 (filter odd? (iterate inc 0)))) ((1 3) (3 5) (5 7))

Now we need to find the product for each pair. Given a pair, multiply the two pieces together… yes, that sounds like map:

user=> (take 3 (map (fn [pair] (* (first pair) (second pair))) (partition 2 1 (filter odd? (iterate inc 0))))) (3 15 35)

Getting a bit unwieldy, isn’t it? Only one final step: sum all those products. We’ll adjust the take to include the first 1000, not the first 3, elements.

user=> (reduce + (take 1000 (map (fn [pair] (* (first pair) (second pair))) (partition 2 1 (filter odd? (iterate inc 0))))) 1335333000

The sum of the first thousand products of consecutive pairs of the odd integers starting at 0. See how each part leads to the next? This expression looks a lot like the way we phrased the problem in English–but both English and Lisp expressions are sort of backwards, in a way. The part that happens first appears deepest, last, in the expression. In a chain of reasoning like this, it’d be nicer to write it in order.

user=> (->> 0 (iterate inc) (filter odd?) (partition 2 1) (map (fn [pair] (* (first pair) (second pair)))) (take 1000) (reduce +)) 1335333000

Much easier to read: now everything flows in order, from top to bottom, and we’ve flattened out the deeply nested expressions into a single level. This is how object-oriented languages structure their expressions: as a chain of function invocations, each acting on the previous value.

But how is this possible? Which expression gets evaluated first? (take 1000) isn’t even a valid call–where’s its second argument? How are any of these forms evaluated?

What kind of arcane function is ->>?

All these mysteries, and more, in Chapter 5: Macros.

Problems

  1. Write a function to find out if a string is a palindrome–that is, if it looks the same forwards and backwards.
  2. Find the number of ‘c’s in “abracadabra”.
  3. Write your own version of filter.
  4. Find the first 100 prime numbers: 2, 3, 5, 7, 11, 13, 17, ….

We left off last chapter with a question: what are verbs, anyway? When you evaluate (type :mary-poppins), what really happens?

user=> (type :mary-poppins) clojure.lang.Keyword

To understand how type works, we’ll need several new ideas. First, we’ll expand on the notion of symbols as references to other values. Then we’ll learn about functions: Clojure’s verbs. Finally, we’ll use the Var system to explore and change the definitions of those functions.

Let bindings

We know that symbols are names for things, and that when evaluated, Clojure replaces those symbols with their corresponding values. +, for instance, is a symbol which points to the verb #<core$_PLUS_ clojure.core$_PLUS_@12992c>.

user=> + #<core$_PLUS_ clojure.core$_PLUS_@12992c>

When you try to use a symbol which has no defined meaning, Clojure refuses:

user=> cats CompilerException java.lang.RuntimeException: Unable to resolve symbol: cats in this context, compiling:(NO_SOURCE_PATH:0:0)

But we can define a meaning for a symbol within a specific expression, using let.

user=> (let [cats 5] (str "I have " cats " cats.")) "I have 5 cats."

The let expression first takes a vector of bindings: alternating symbols and values that those symbols are bound to, within the remainder of the expression. “Let the symbol cats be 5, and construct a string composed of "I have ", cats, and " cats".

Let bindings apply only within the let expression itself. They also override any existing definitions for symbols at that point in the program. For instance, we can redefine addition to mean subtraction, for the duration of a let:

user=> (let [+ -] (+ 2 3)) -1

But that definition doesn’t apply outside the let:

user=> (+ 2 3) 5

We can also provide multiple bindings. Since Clojure doesn’t care about spacing, alignment, or newlines, I’ll write this on multiple lines for clarity.

user=> (let [person "joseph" num-cats 186] (str person " has " num-cats " cats!")) "joseph has 186 cats!"

When multiple bindings are given, they are evaluated in order. Later bindings can use previous bindings.

user=> (let [cats 3 legs (* 4 cats)] (str legs " legs all together")) "12 legs all together"

So fundamentally, let defines the meaning of symbols within an expression. When Clojure evaluates a let, it replaces all occurrences of those symbols in the rest of the let expression with their corresponding values, then evaluates the rest of the expression.

Functions

We saw in chapter one that Clojure evaluates lists by substituting some other value in their place:

user=> (inc 1) 2

inc takes any number, and is replaced by that number plus one. That sounds an awful lot like a let:

user=> (let [x 1] (+ x 1)) 2

If we bound x to 5 instead of 1, this expression would evaluate to 6. We can think about inc like a let expression, but without particular values provided for the symbols.

(let [x] (+ x 1))

We can’t actually evaluate this program, because there’s no value for x yet. It could be 1, or 4, or 1453. We say that x is unbound, because it has no binding to a particular value. This is the nature of the function: an expression with unbound symbols.

user=> (fn [x] (+ x 1)) #<user$eval293$fn__294 user$eval293$fn__294@663fc37>

Does the name of that function remind you of anything?

user=> inc #<core$inc clojure.core$inc@16bc0b3c>

Almost all verbs in Clojure are functions. Functions represent unrealized computation: expressions which are not yet evaluated, or incomplete. This particular function works just like inc: it’s an expression which has a single unbound symbol, x. When we invoke the function with a particular value, the expressions in the function are evaluated with x bound to that value.

user=> (inc 2) 3 user=> ((fn [x] (+ x 1)) 2) 3

We say that x is this functions argument, or parameter. When Clojure evaluates (inc 2), we say that inc is called with 2, or that 2 is passed to inc. The result of that function invocation is the function’s return value. We say that (inc 2) returns 3.

Fundamentally, functions describe the relationship between arguments and return values: given 1, return 2. Given 2, return 3, and so on. Let bindings describe a similar relationship, but with a specific set of values for those arguments. let is evaluated immediately, whereas fn is evaluated later, when bindings are provided.

There’s a shorthand for writing functions, too: #(+ % 1) is equivalent to (fn [x] (+ x 1)). % takes the place of the first argument to the function. You’ll sometime see %1, %2, etc. used for the first argument, second argument, and so on.

user=> (let [burrito #(list "beans" % "cheese")] (burrito "carnitas")) ("beans" "carnitas" "cheese")

Since functions exist to defer evaluation, there’s no sense in creating and invoking them in the same expression as we’ve done here. What we want is to give names to our functions, so they can be recombined in different ways.

user=> (let [twice (fn [x] (* 2 x))] (+ (twice 1) (twice 3))) 8

Compare that expression to an equivalent, expanded form:

user=> (+ (* 2 1) (* 2 3))

The name twice is gone, and in its place is the same sort of computation–(* 2 something)–written twice. While we could represent our programs as a single massive expression, it’d be impossible to reason about. Instead, we use functions to compact redundant expressions, by isolating common patterns of computation. Symbols help us re-use those functions (and other values) in more than one place. By giving the symbols meaningful names, we make it easier to reason about the structure of the program as a whole; breaking it up into smaller, understandable parts.

This is core pursuit of software engineering: organizing expressions. Almost every programming language is in search of the right tools to break apart, name, and recombine expressions to solve large problems. In Clojure we’ll see one particular set of tools for composing programs, but the underlying ideas will transfer to many other languages.

Vars

We’ve used let to define a symbol within an expression, but what about the default meanings of +, conj, and type? Are they also let bindings? Is the whole universe one giant let?

Well, not exactly. That’s one way to think about default bindings, but it’s brittle. We’d need to wrap our whole program in a new let expression every time we wanted to change the meaning of a symbol. And moreover, once a let is defined, there’s no way to change it. If we want to redefine symbols for everyone–even code that we didn’t write–we need a new construct: a mutable variable.

user=> (def cats 5) #'user/cats user=> (type #'user/cats) clojure.lang.Var

def defines a type of value we haven’t seen before: a var. Vars, like symbols, are references to other values. When evaluated, a symbol pointing to a var is replaced by the var’s corresponding value:

user=> user/cats 5

def also binds the symbol cats (and its globally qualified equivalent user/cats) to that var.

user=> user/cats 5 user=> cats 5

When we said in chapter one that inc, list, and friends were symbols that pointed to functions, that wasn’t the whole story. The symbol inc points to the var #'inc, which in turn points to the function #<core$inc clojure.core$inc@16bc0b3c>. We can see the intermediate var with resolve:

user=> 'inc inc ; the symbol user=> (resolve 'inc) #'clojure.core/inc ; the var user=> (eval 'inc) #<core$inc clojure.core$inc@16bc0b3c> ; the value

Why two layers of indirection? Because unlike the symbol, we can change the meaning of a Var for everyone, globally, at any time.

user=> (def astronauts []) #'user/astronauts user=> (count astronauts) 0 user=> (def astronauts ["Sally Ride" "Guy Bluford"]) #'user/astronauts user=> (count astronauts) 2

Notice that astronauts had two distinct meanings, depending on when we evaluated it. After the first def, astronauts was an empty vector. After the second def, it had one entry.

If this seems dangerous, you’re a smart cookie. Redefining names in this way changes the meaning of expressions everywhere in a program, without warning. Expressions which relied on the value of a Var could suddenly take on new, possibly incorrect, meanings. It’s a powerful tool for experimenting at the REPL, and for updating a running program, but it can have unexpected consequences. Good Clojurists use def to set up a program initially, and only change those definitions with careful thought.

Totally redefining a Var isn’t the only option. There are safer, controlled ways to change the meaning of a Var within a particular part of a program, which we’ll explore later.

Defining functions

Armed with def, we’re ready to create our own named functions in Clojure.

user=> (def half (fn [number] (/ number 2))) #'user/half user=> (half 6) 3

Creating a function and binding it to a var is so common that it has its own form: defn, short for def fn.

user=> (defn half [number] (/ number 2)) #'user/half

Functions don’t have to take an argument. We’ve seen functions which take zero arguments, like (+).

user=> (defn half [] 1/2) #'user/half user=> (half) 1/2

But if we try to use our earlier form with one argument, Clojure complains that the arity–the number of arguments to the function–is incorrect.

user=> (half 10) ArityException Wrong number of args (1) passed to: user$half clojure.lang.AFn.throwArity (AFn.java:437)

To handle multiple arities, functions have an alternate form. Instead of an argument vector and a body, one provides a series of lists, each of which starts with an argument vector, followed by the body.

user=> (defn half ([] 1/2) ([x] (/ x 2))) user=> (half) 1/2 user=> (half 10) 5

Multiple arguments work just like you expect. Just specify an argument vector of two, or three, or however many arguments the function takes.

user=> (defn add [x y] (+ x y)) #'user/add user=> (add 1 2) 3

Some functions can take any number of arguments. For that, Clojure provides &, which slurps up all remaining arguments as a list:

user=> (defn vargs [x y & more-args] {:x x :y y :more more-args}) #'user/vargs user=> (vargs 1) ArityException Wrong number of args (1) passed to: user$vargs clojure.lang.AFn.throwArity (AFn.java:437) user=> (vargs 1 2) {:x 1, :y 2, :more nil} user=> (vargs 1 2 3 4 5) {:x 1, :y 2, :more (3 4 5)}

Note that x and y are mandatory, though there don’t have to be any remaining arguments.

To keep track of what arguments a function takes, why the function exists, and what it does, we usually include a docstring. Docstrings help fill in the missing context around functions, to explain their assumptions, context, and purpose to the world.

(defn launch "Launches a spacecraft into the given orbit by initiating a controlled on-axis burn. Does not automatically stage, but does vector thrust, if the craft supports it." [craft target-orbit] "OK, we don't know how to control spacecraft yet.")

Docstrings are used to automatically generate documentation for Clojure programs, but you can also access them from the REPL.

user=> (doc launch) ------------------------- user/launch ([craft target-orbit]) Launches a spacecraft into the given orbit by initiating a controlled on-axis burn. Does not automatically stage, but does vector thrust, if the craft supports it. nil

doc tells us the full name of the function, the arguments it accepts, and its docstring. This information comes from the #'launch var’s metadata, and is saved there by defn. We can inspect metadata directly with the meta function:

(meta #'launch) {:arglists ([craft target-orbit]), :ns #<Namespace user>, :name launch!, :column 1, :doc "Launches a spacecraft into the given orbit.", :line 1, :file "NO_SOURCE_PATH"}

There’s some other juicy information in there, like the file the function was defined in and which line and column it started at, but that’s not particularly useful since we’re in the REPL, not a file. However, this does hint at a way to answer our motivating question: how does the type function work?

How does type work?

We know that type returns the type of an object:

user=> (type 2) java.lang.long

And that type, like all functions, is a kind of object with its own unique type:

user=> type #<core$type clojure.core$type@39bda9b9> user=> (type type) clojure.core$type

This tells us that type is a particular instance, at memory address 39bda9b9, of the type clojure.core$type. clojure.core is a namespace which defines the fundamentals of the Clojure language, and $type tells us that it’s named type in that namespace. None of this is particularly helpful, though. Maybe we can find out more about the clojure.core$type by asking what its supertypes are:

user=> (supers (type type)) #{clojure.lang.AFunction clojure.lang.IMeta java.util.concurrent.Callable clojure.lang.Fn clojure.lang.AFn java.util.Comparator java.lang.Object clojure.lang.RestFn clojure.lang.IObj java.lang.Runnable java.io.Serializable clojure.lang.IFn}

This is a set of all the types that include type. We say that type is an instance of clojure.lang.AFunction, or that it implements or extends java.util.concurrent.Callable, and so on. Since it’s a member of clojure.lang.IMeta it has metadata, and since it’s a member of clojure.lang.AFn, it’s a function. Just to double check, let’s confirm that type is indeed a function:

user=> (fn? type) true

What about its documentation?

user=> (doc type) ------------------------- clojure.core/type ([x]) Returns the :type metadata of x, or its Class if none nil

Ah, that’s helpful. type can take a single argument, which it calls x. If it has :type metadata, that’s what it returns. Otherwise, it returns the class of x. Let’s take a deeper look at type’s metadata for more clues.

user=> (meta #'type) {:ns #<Namespace clojure.core>, :name type, :arglists ([x]), :column 1, :added "1.0", :static true, :doc "Returns the :type metadata of x, or its Class if none", :line 3109, :file "clojure/core.clj"}

Look at that! This function was first added to Clojure in version 1.0, and is defined in the file clojure/core.clj, on line 3109. We could go dig up the Clojure source code and read its definition there–or we could ask Clojure to do it for us:

user=> (source type) (defn type "Returns the :type metadata of x, or its Class if none" {:added "1.0" :static true} [x] (or (get (meta x) :type) (class x))) nil

Aha! Here, at last, is how type works. It’s a function which takes a single argument x, and returns either :type from its metadata, or (class x).

We can delve into any function in Clojure using these tools:

user=> (source +) (defn + "Returns the sum of nums. (+) returns 0. Does not auto-promote longs, will throw on overflow. See also: +'" {:inline (nary-inline 'add 'unchecked_add) :inline-arities >1? :added "1.2"} ([] 0) ([x] (cast Number x)) ([x y] (. clojure.lang.Numbers (add x y))) ([x y & more] (reduce1 + (+ x y) more))) nil

Almost every function in a programming language is made up of other, simpler functions. +, for instance, is defined in terms of cast, add, and reduce1. Sometimes functions are defined in terms of themselves. + uses itself twice in this definition; a technique called recursion.

At the bottom, though, are certain fundamental constructs below which you can go no further. Core axioms of the language. Lisp calls these "special forms”. def and let are special forms (well–almost: let is a thin wrapper around let*, which is a special form) in Clojure. These forms are defined by the core implementation of the language, and are not reducible to other Clojure expressions.

user=> (source def) Source not found

Some Lisps are written entirely in terms of a few special forms, but Clojure is much less pure. Many functions bottom out in Java functions and types, or, for CLJS, in terms of Javascript. Any time you see an expression like (. clojure.lang.Numbers (add x y)), there’s Java code underneath. Below Java lies the JVM, which might be written in C or C++, depending on which one you use. And underneath C and C++ lie more libraries, the operating system, assembler, microcode, registers, and ultimately, electrons flowing through silicon.

A well-designed language isolates you from details you don’t need to worry about, like which logic gates or registers to use, and lets you focus on the task at hand. Good languages also need to allow escape hatches for performance or access to dangerous functionality, as we saw with Vars. You can write entire programs entirely in terms of Clojure, but sometimes, for performance or to use tools from other languages, you’ll rely on Java. The Clojure code is easy to explore with doc and source, but Java can be more opaque–I usually rely on the java source files and online documentation.

Review

We’ve seen how let associates names with values in a particular expression, and how Vars allow for mutable bindings which apply universally. and whose definitions can change over time. We learned that Clojure verbs are functions, which express the general shape of an expression but with certain values unbound. Invoking a function binds those variables to specific values, allowing evaluation of the function to proceed.

Functions decompose programs into simpler pieces, expressed in terms of one another. Short, meaningful names help us understand what those functions (and other values) mean.

Finally, we learned how to introspect Clojure functions with doc and source, and saw the definition of some basic Clojure functions. The Clojure cheatsheet gives a comprehensive list of the core functions in the language, and is a great starting point when you have to solve a problem but don’t know what functions to use.

We’ll see a broad swath of those functions in Chapter 4: Sequences.

My thanks to Zach Tellman, Kelly Sommers, and Michael R Bernstein for reviewing drafts of this chapter.

We’ve learned the basics of Clojure’s syntax and evaluation model. Now we’ll take a tour of the basic nouns in the language.

Types

We’ve seen a few different values already–for instance, nil, true, false, 1, 2.34, and "meow". Clearly all these things are different values, but some of them seem more alike than others.

For instance, 1 and 2 are very similar numbers; both can be added, divided, multiplied, and subtracted. 2.34 is also a number, and acts very much like 1 and 2, but it’s not quite the same. It’s got decimal points. It’s not an integer. And clearly true is not very much like a number. What is true plus one? Or false divided by 5.3? These questions are poorly defined.

We say that a type is a group of values which work in the same way. It’s a property that some values share, which allows us to organize the world into sets of similar things. 1 + 1 and 1 + 2 use the same addition, which adds together integers. Types also help us verify that a program makes sense: that you can only add together numbers, instead of adding numbers to porcupines.

Types can overlap and intersect each other. Cats are animals, and cats are fuzzy too. You could say that a cat is a member (or sometimes “instance”), of the fuzzy and animal types. But there are fuzzy things like moss which aren’t animals, and animals like alligators that aren’t fuzzy in the slightest.

Other types completely subsume one another. All tabbies are housecats, and all housecats are felidae, and all felidae are animals. Everything which is true of an animal is automatically true of a housecat. Hierarchical types make it easier to write programs which don’t need to know all the specifics of every value; and conversely, to create new types in terms of others. But they can also get in the way of the programmer, because not every useful classification (like “fuzziness”) is purely hierarchical. Expressing overlapping types in a hierarchy can be tricky.

Every language has a type system; a particular way of organizing nouns into types, figuring out which verbs make sense on which types, and relating types to one another. Some languages are strict, and others more relaxed. Some emphasize hierarchy, and others a more ad-hoc view of the world. We call Clojure’s type system strong in that operations on improper types are simply not allowed: the program will explode if asked to subtract a dandelion. We also say that Clojure’s types are dynamic because they are enforced when the program is run, instead of when the program is first read by the computer.

We’ll learn more about the formal relationships between types later, but for now, keep this in the back of your head. It’ll start to hook in to other concepts later.

Integers

Let’s find the type of the number 3:

user=> (type 3) java.lang.Long

So 3 is a java.lang.Long, or a “Long”, for short. Because Clojure is built on top of Java, many of its types are plain old Java types.

Longs, internally, are represented as a group of sixty-four binary digits (ones and zeroes), written down in a particular pattern called signed two’s complement representation. You don’t need to worry about the specifics–there are only two things to remember about longs. First, longs use one bit to store the sign: whether the number is positive or negative. Second, the other 63 bits represent the size of the number. That means the biggest number you can represent with a long is 263 - 1 (the minus one is because of the number 0), and the smallest long is -263.

How big is 263 - 1?

user=> Long/MAX_VALUE 9223372036854775807

That’s a reasonably big number. Most of the time, you won’t need anything bigger, but… what if you did? What happens if you add one to the biggest Long?

user=> (inc Long/MAX_VALUE) ArithmeticException integer overflow clojure.lang.Numbers.throwIntOverflow (Numbers.java:1388)

An error occurs! This is Clojure telling us that something went wrong. The type of error was an ArithmeticException, and its message was “integer overflow”, meaning “this type of number can’t hold a number that big”. The error came from a specific place in the source code of the program: Numbers.java, on line 1388. That’s a part of the Clojure source code. Later, we’ll learn more about how to unravel error messages and find out what went wrong.

The important thing is that Clojure’s type system protected us from doing something dangerous; instead of returning a corrupt value, it aborted evaluation and returned an error.

If you do need to talk about really big numbers, you can use a BigInt: an arbitrary-precision integer. Let’s convert the biggest Long into a BigInt, then increment it:

user=> (inc (bigint Long/MAX_VALUE)) 9223372036854775808N

Notice the N at the end? That’s how Clojure writes arbitrary-precision integers.

user=> (type 5N) clojure.lang.BigInt

There are also smaller numbers.

user=> (type (int 0)) java.lang.Integer user=> (type (short 0)) java.lang.Short user=> (type (byte 0)) java.lang.Byte

Integers are half the size of Longs; they store values in 32 bits. Shorts are 16 bits, and Bytes are 8. That means their biggest values are 231-1, 215-1, and 27-1, respectively.

user=> Integer/MAX_VALUE 2147483647 user=> Short/MAX_VALUE 32767 user=> Byte/MAX_VALUE 127

Fractional numbers

To represent numbers between integers, we often use floating-point numbers, which can represent small numbers with fine precision, and large numbers with coarse precision. Floats use 32 bits, and Doubles use 64. Doubles are the default in Clojure.

user=> (type 1.23) java.lang.Double user=> (type (float 1.23)) java.lang.Float

Floating point math is complicated, and we won’t get bogged down in the details just yet. The important thing to know is floats and doubles are approximations. There are limits to their correctness:

user=> 0.99999999999999999 1.0

To represent fractions exactly, we can use the ratio type:

user=> (type 1/3) clojure.lang.Ratio

Mathematical operations

The exact behavior of mathematical operations in Clojure depends on their types. In general, though, Clojure aims to preserve information. Adding two longs returns a long; adding a double and a long returns a double.

user=> (+ 1 2) 3 user=> (+ 1 2.0) 3.0

3 and 3.0 are not the same number; one is a long, and the other a double. But for most purposes, they’re equivalent, and Clojure will tell you so:

user=> (= 3 3.0) false user=> (== 3 3.0) true

= asks whether all the things that follow are equal. Since floats are approximations, = considers them different from integers. == also compares things, but a little more loosely: it considers integers equivalent to their floating-point representations.

We can also subtract with -, multiply with *, and divide with /.

user=> (- 3 1) 2 user=> (* 1.5 3) 4.5 user=> (/ 1 2) 1/2

Putting the verb first in each list allows us to add or multiply more than one number in the same step:

user=> (+ 1 2 3) 6 user=> (* 2 3 1/5) 6/5

Subtraction with more than 2 numbers subtracts all later numbers from the first. Division divides the first number by all the rest.

user=> (- 5 1 1 1) 2 user=> (/ 24 2 3) 4

By extension, we can define useful interpretations for numeric operations with just a single number:

user=> (+ 2) 2 user=> (- 2) -2 user=> (* 4) 4 user=> (/ 4) 1/4

We can also add or multiply a list of no numbers at all, obtaining the additive and multiplicative identities, respectively. This might seem odd, especially coming from other languages, but we’ll see later that these generalizations make it easier to reason about higher-level numeric operations.

user=> (+) 0 user=> (*) 1

Often, we want to ask which number is bigger, or if one number falls between two others. <= means “less than or equal to”, and asserts that all following values are in order from smallest to biggest.

user=> (<= 1 2 3) true user=> (<= 1 3 2) false

< means “strictly less than”, and works just like <=, except that no two values may be equal.

user=> (<= 1 1 2) true user=> (< 1 1 2) false

Their friends > and >= mean “greater than” and “greater than or equal to”, respectively, and assert that numbers are in descending order.

user=> (> 3 2 1) true user=> (> 1 2 3) false

Also commonly used are inc and dec, which add and subtract one to a number, respectively:

user=> (inc 5) 6 user=> (dec 5) 4

One final note: equality tests can take more than 2 numbers as well.

user=> (= 2 2 2) true user=> (= 2 2 3) false

Strings

We saw that strings are text, surrounded by double quotes, like "foo". Strings in Clojure are, like Longs, Doubles, and company, backed by a Java type:

user=> (type "cat") java.lang.String

We can make almost anything into a string with str. Strings, symbols, numbers, booleans; every value in Clojure has a string representation. Note that nil’s string representation is ""; an empty string.

user=> (str "cat") "cat" user=> (str 'cat) "cat" user=> (str 1) "1" user=> (str true) "true" user=> (str '(1 2 3)) "(1 2 3)" user=> (str nil) ""

str can also combine things together into a single string, which we call “concatenation”.

user=> (str "meow " 3 " times") "meow 3 times"

To look for patterns in text, we can use a regular expression, which is a tiny language for describing particular arrangements of text. re-find and re-matches look for occurrences of a regular expression in a string. To find a cat:

user=> (re-find #"cat" "mystic cat mouse") "cat" user=> (re-find #"cat" "only dogs here") nil

That #"..." is Clojure’s way of writing a regular expression.

With re-matches, you can extract particular parts of a string which match an expression. Here we find two strings, separated by a :. The parentheses mean that the regular expression should capture that part of the match. We get back a list containing the part of the string that matched the first parentheses, followed by the part that matched the second parentheses.

user=> (rest (re-matches #"(.+):(.+)" "mouse:treat")) ("mouse" "treat")

Regular expressions are a powerful tool for searching and matching text, especially when working with data files. Since regexes work the same in most languages, you can use any guide online to learn more. It’s not something you have to master right away; just learn specific tricks as you find you need them. For a deeper guide, try Fitzgerald’s Introducing Regular Expressions.

Booleans and logic

Everything in Clojure has a sort of charge, a truth value, sometimes called “truthiness”. true is positive and false is negative. nil is negative, too.

user=> (boolean true) true user=> (boolean false) false user=> (boolean nil) false

Every other value in Clojure is positive.

user=> (boolean 0) true user=> (boolean 1) true user=> (boolean "hi there") true user=> (boolean str) true

If you’re coming from a C-inspired language, where 0 is considered false, this might be a bit surprising. Likewise, in much of POSIX, 0 is considered success and nonzero values are failures. Lisp allows no such confusion: the only negative values are false and nil.

We can reason about truth values using and, or, and not. and returns the first negative value, or the last value if all are truthy.

user=> (and true false true) false user=> (and true true true) true user=> (and 1 2 3) 3

Similarly, or returns the first positive value.

user=> (or false 2 3) 2 user=> (or false nil) nil

And not inverts the logical sense of a value:

user=> (not 2) false user=> (not nil) true

We’ll learn more about Boolean logic when we start talking about control flow; the way we alter evaluation of a program and express ideas like “if I’m a cat, then meow incessantly”.

Symbols

We saw symbols in the previous chapter; they’re bare strings of characters, like foo or +.

user=> (class 'str) clojure.lang.Symbol

Symbols can have either short or full names. The short name is used to refer to things locally. The fully qualified name is used to refer unambiguously to a symbol from anywhere. If I were a symbol, my name would be “Kyle”, and my full name “Kyle Kingsbury.”

Symbol names are separated with a /. For instance, the symbol str is also present in a family called clojure.core; the corresponding full name is clojure.core/str.

user=> (= str clojure.core/str) true user=> (name 'clojure.core/str) "str"

When we talked about the maximum size of an integer, that was a fully-qualified symbol, too.

(type 'Integer/MAX_VALUE) clojure.lang.Symbol

The job of symbols is to refer to things, to point to other values. When evaluating a program, symbols are looked up and replaced by their corresponding values. That’s not the only use of symbols, but it’s the most common.

Keywords

Closely related to symbols and strings are keywords, which begin with a :. Keywords are like strings in that they’re made up of text, but are specifically intended for use as labels or identifiers. These aren’t labels in the sense of symbols: keywords aren’t replaced by any other value. They’re just names, by themselves.

user=> (type :cat) clojure.lang.Keyword user=> (str :cat) ":cat" user=> (name :cat) "cat"

As labels, keywords are most useful when paired with other values in a collection, like a map. Keywords can also be used as verbs to look up specific values in other data types. We’ll learn more about keywords shortly.

Lists

A collection is a group of values. It’s a container which provides some structure, some framework, for the things that it holds. We say that a collection contains elements, or members. We saw one kind of collection–a list–in the previous chapter.

user=> '(1 2 3) (1 2 3) user=> (type '(1 2 3)) clojure.lang.PersistentList

Remember, we quote lists with a ' to prevent them from being evaluated. You can also construct a list using list:

user=> (list 1 2 3) (1 2 3)

Lists are comparable just like every other value:

user=> (= (list 1 2) (list 1 2)) true

You can modify a list by conjoining an element onto it:

user=> (conj '(1 2 3) 4) (4 1 2 3)

We added 4 to the list–but it appeared at the front. Why? Internally, lists are stored as a chain of values: each link in the chain is a tiny box which holds the value and a connection to the next link. This data structure, called a linked list, offers immediate access to the first element.

user=> (first (list 1 2 3)) 1

But getting to the second element requires an extra hop down the chain

user=> (second (list 1 2 3)) 2

and the third element a hop after that, and so on.

user=> (nth (list 1 2 3) 2) 3

nth gets the element of an ordered collection at a particular index. The first element is index 0, the second is index 1, and so on.

This means that lists are well-suited for small collections, or collections which are read in linear order, but are slow when you want to get arbitrary elements from later in the list. For fast access to every element, we use a vector.

Vectors

Vectors are surrounded by square brackets, just like lists are surrounded by parentheses. Because vectors aren’t evaluated like lists are, there’s no need to quote them:

user=> [1 2 3] [1 2 3] user=> (type [1 2 3]) clojure.lang.PersistentVector

You can also create vectors with vector, or change other structures into vectors with vec:

user=> (vector 1 2 3) [1 2 3] user=> (vec (list 1 2 3)) [1 2 3]

conj on a vector adds to the end, not the start:

user=> (conj [1 2 3] 4) [1 2 3 4]

Our friends first, second, and nth work here too; but unlike lists, nth is fast on vectors. That’s because internally, vectors are represented as a very broad tree of elements, where each part of the tree branches into 32 smaller trees. Even very large vectors are only a few layers deep, which means getting to elements only takes a few hops.

In addition to first, you’ll often want to get the remaining elements in a collection. There are two ways to do this:

user=> (rest [1 2 3]) (2 3) user=> (next [1 2 3]) (2 3)

rest and next both return “everything but the first element”. They differ only by what happens when there are no remaining elements:

user=> (rest [1]) () user=> (next [1]) nil

rest returns logical true, next returns logical false. Each has their uses, but in almost every case they’re equivalent–I interchange them freely.

We can get the final element of any collection with last:

user=> (last [1 2 3]) 3

And figure out how big the vector is with count:

user=> (count [1 2 3]) 3

Because vectors are intended for looking up elements by index, we can also use them directly as verbs:

user=> ([:a :b :c] 1) :b

So we took the vector containing three keywords, and asked “What’s the element at index 1?” Lisp, like most (but not all!) modern languages, counts up from zero, not one. Index 0 is the first element, index 1 is the second element, and so on. In this vector, finding the element at index 1 evaluates to :b.

Finally, note that vectors and lists containing the same elements are considered equal in Clojure:

user=> (= '(1 2 3) [1 2 3]) true

In almost all contexts, you can consider vectors, lists, and other sequences as interchangeable. They only differ in their performance characteristics, and in a few data-structure-specific operations.

Sets

Sometimes you want an unordered collection of values; especially when you plan to ask questions like “does the collection have the number 3 in it?” Clojure, like most languages, calls these collections sets.

user=> #{:a :b :c} #{:a :c :b}

Sets are surrounded by #{...}. Notice that though we gave the elements :a, :b, and :c, they came out in a different order. In general, the order of sets can shift at any time. If you want a particular order, you can ask for it as a list or vector:

user=> (vec #{:a :b :c}) [:a :c :b]

Or ask for the elements in sorted order:

(sort #{:a :b :c}) (:a :b :c)

conj on a set adds an element:

user=> (conj #{:a :b :c} :d) #{:a :c :b :d} user=> (conj #{:a :b :c} :a) #{:a :c :b}

Sets never contain an element more than once, so conjing an element which is already present does nothing. Conversely, one removes elements with disj:

user=> (disj #{"hornet" "hummingbird"} "hummingbird") #{"hornet"}

The most common operation with a set is to check whether something is inside it. For this we use contains?.

user=> (contains? #{1 2 3} 3) true user=> (contains? #{1 2 3} 5) false

Like vectors, you can use the set itself as a verb. Unlike contains?, this expression returns the element itself (if it was present), or nil.

user=> (#{1 2 3} 3) 3 user=> (#{1 2 3} 4) nil

You can make a set out of any other collection with set.

user=> (set [:a :b :c]) #{:a :c :b}

Maps

The last collection on our tour is the map: a data structure which associates keys with values. In a dictionary, the keys are words and the definitions are the values. In a library, keys are call signs, and the books are values. Maps are indexes for looking things up, and for representing different pieces of named information together. Here’s a cat:

user=> {:name "mittens" :weight 9 :color "black"} {:weight 9, :name "mittens", :color "black"}

Maps are surrounded by braces {...}, filled by alternating keys and values. In this map, the three keys are :name, :color, and :weight, and their values are "mittens", "black", and 9, respectively. We can look up the corresponding value for a key with get:

user=> (get {"cat" "meow" "dog" "woof"} "cat") "meow" user=> (get {:a 1 :b 2} :c) nil

get can also take a default value to return instead of nil, if the key doesn’t exist in that map.

user=> (get {:glinda :good} :wicked :not-here) :not-here

Since lookups are so important for maps, we can use a map as a verb directly:

user=> ({"amlodipine" 12 "ibuprofen" 50} "ibuprofen") 50

And conversely, keywords can also be used as verbs, which look themselves up in maps:

user=> (:raccoon {:weasel "queen" :raccoon "king"}) "king"

You can add a value for a given key to a map with assoc.

user=> (assoc {:bolts 1088} :camshafts 3) {:camshafts 3 :bolts 1088} user=> (assoc {:camshafts 3} :camshafts 2) {:camshafts 2}

Assoc adds keys if they aren’t present, and replaces values if they’re already there. If you associate a value onto nil, it creates a new map.

user=> (assoc nil 5 2) {5 2}

You can combine maps together using merge, which yields a map containing all the elements of all given maps, preferring the values from later ones.

user=> (merge {:a 1 :b 2} {:b 3 :c 4}) {:c 4, :a 1, :b 3}

Finally, to remove a value, use dissoc.

user=> (dissoc {:potatoes 5 :mushrooms 2} :mushrooms) {:potatoes 5}

Putting it all together

All these collections and types can be combined freely. As software engineers, we model the world by creating a particular representation of the problem in the program. Having a rich set of values at our disposal allows us to talk about complex problems. We might describe a person:

{:name "Amelia Earhart" :birth 1897 :death 1939 :awards {"US" #{"Distinguished Flying Cross" "National Women's Hall of Fame"} "World" #{"Altitude record for Autogyro" "First to cross Atlantic twice"}}}

Or a recipe:

{:title "Chocolate chip cookies" :ingredients {"flour" [(+ 2 1/4) :cup] "baking soda" [1 :teaspoon] "salt" [1 :teaspoon] "butter" [1 :cup] "sugar" [3/4 :cup] "brown sugar" [3/4 :cup] "vanilla" [1 :teaspoon] "eggs" 2 "chocolate chips" [12 :ounce]}}

Or the Gini coefficients of nations, as measured over time:

{"Afghanistan" {2008 27.8} "Indonesia" {2008 34.1 2010 35.6 2011 38.1} "Uruguay" {2008 46.3 2009 46.3 2010 45.3}}

In Clojure, we compose data structures to form more complex values; to talk about bigger ideas. We use operations like first, nth, get, and contains? to extract specific information from these structures, and modify them using conj, disj, assoc, dissoc, and so on.

We started this chapter with a discussion of types: groups of similar objects which obey the same rules. We learned that bigints, longs, ints, shorts, and bytes are all integers, that doubles and floats are approximations to decimal numbers, and that ratios represent fractions exactly. We learned the differences between strings for text, symbols as references, and keywords as short labels. Finally, we learned how to compose, alter, and inspect collections of elements. Armed with the basic nouns of Clojure, we’re ready to write a broad array of programs.

I’d like to conclude this tour with one last type of value. We’ve inspected dozens of types so far–but what what happens when you turn the camera on itself?

user=> (type type) clojure.core$type

What is this type thing, exactly? What are these verbs we’ve been learning, and where do they come from? This is the central question of chapter three: functions.

This guide aims to introduce newcomers and experienced programmers alike to the beauty of functional programming, starting with the simplest building blocks of software. You’ll need a computer, basic proficiency in the command line, a text editor, and an internet connection. By the end of this series, you’ll have a thorough command of the Clojure programming language.

Who is this guide for?

Science, technology, engineering, and mathematics are deeply rewarding fields, yet few women enter STEM as a career path. Still more are discouraged by a culture which repeatedly asserts that women lack the analytic aptitude for writing software, that they are not driven enough to be successful scientists, that it’s not cool to pursue a passion for structural engineering. Those few with the talent, encouragement, and persistence to break in to science and tech are discouraged by persistent sexism in practice: the old boy’s club of tenure, being passed over for promotions, isolation from peers, and flat-out assault. This landscape sucks. I want to help change it.

Women Who Code, PyLadies, Black Girls Code, RailsBridge, Girls Who Code, Girl Develop It, and Lambda Ladies are just a few of the fantastic groups helping women enter and thrive in software. I wholeheartedly support these efforts.

In addition, I want to help in my little corner of the technical community–functional programming and distributed systems–by making high-quality educational resources available for free. The Jepsen series has been, in part, an effort to share my enthusiasm for distributed systems with beginners of all stripes–but especially for women, LGBT folks, and people of color.

As technical authors, we often assume that our readers are white, that our readers are straight, that our readers are traditionally male. This is the invisible default in US culture, and it’s especially true in tech. People continue to assume on the basis of my software and writing that I’m straight, because well hey, it’s a statistically reasonable assumption.

But I’m not straight. I get called faggot, cocksucker, and sinner. People say they’ll pray for me. When I walk hand-in-hand with my boyfriend, people roll down their car windows and stare. They threaten to beat me up or kill me. Every day I’m aware that I’m the only gay person some people know, and that I can show that not all gay people are effeminate, or hypermasculine, or ditzy, or obsessed with image. That you can be a manicurist or a mathematician or both. Being different, being a stranger in your culture, comes with all kinds of challenges. I can’t speak to everyone’s experience, but I can take a pretty good guess.

At the same time, in the technical community I’ve found overwhelming warmth and support, from people of all stripes. My peers stand up for me every day, and I’m so thankful–especially you straight dudes–for understanding a bit of what it’s like to be different. I want to extend that same understanding, that same empathy, to people unlike myself. Moreover, I want to reassure everyone that though they may feel different, they do have a place in this community.

So before we begin, I want to reinforce that you can program, that you can do math, that you can design car suspensions and fire suppression systems and spacecraft control software and distributed databases, regardless of what your classmates and media and even fellow engineers think. You don’t have to be white, you don’t have to be straight, you don’t have to be a man. You can grow up never having touched a computer and still become a skilled programmer. Yeah, it’s harder–and yeah, people will give you shit, but that’s not your fault and has nothing to do with your ability or your right to do what you love. All it takes to be a good engineer, scientist, or mathematician is your curiosity, your passion, the right teaching material, and putting in the hours.

There’s nothing in this guide that’s just for lesbian grandmas or just for mixed-race kids; bros, you’re welcome here too. There’s nothing dumbed down. We’re gonna go as deep into the ideas of programming as I know how to go, and we’re gonna do it with everyone on board.

No matter who you are or who people think you are, this guide is for you.

Why Clojure?

This book is about how to program. We’ll be learning in Clojure, which is a modern dialect of a very old family of computer languages, called Lisp. You’ll find that many of this book’s ideas will translate readily to other languages; though they may be expressed in different ways.

We’re going to explore the nature of syntax, metalanguages, values, references, mutation, control flow, and concurrency. Many languages leave these ideas implicit in the language construction, or don’t have a concept of metalanguages or concurrency at all. Clojure makes these ideas explicit, first-class language constructs.

At the same time, we’re going to defer or omit any serious discussion of static type analysis, hardware, and performance. This is not to say that these ideas aren’t important; just that they don’t fit well within this particular narrative arc. For a deep exploration of type theory I recommend a study in Haskell, and for a better understanding of underlying hardware, learning C and an assembly language will undoubtedly help.

In more general terms, Clojure is a well-rounded language. It offers broad library support and runs on multiple operating systems. Clojure performance is not terrific, but is orders of magnitude faster than Ruby, Python, or Javascript. Unlike some faster languages, Clojure emphasizes safety in its type system and approach to parallelism, making it easier to write correct multithreaded programs. Clojure is concise, requiring very little code to express complex operations. It offers a REPL and dynamic type system: ideal for beginners to experiment with, and well-suited for manipulating complex data structures. A consistently designed standard library and full-featured set of core datatypes rounds out the Clojure toolbox.

Finally, there are some drawbacks. As a compiled language, Clojure is much slower to start than a scripting language; this makes it unsuitable for writing small scripts for interactive use. Clojure is also not well-suited for high-performance numeric operations. Though it is possible, you have to jump through hoops to achieve performance comparable with Java. I’ll do my best to call out these constraints and shortcomings as we proceed through the text.

With that context out of the way, let’s get started by installing Clojure!

Getting set up

First, you’ll need a Java Virtual Machine, or JVM, and its associated development tools, called the JDK. This is the software which runs a Clojure program. If you’re on Windows, install Oracle JDK 1.7. If you’re on OS X or Linux, you may already have a JDK installed. In a terminal, try:

which javac

If you see something like

/usr/bin/javac

Then you’re good to go. If you don’t see any output from that command, install the appropriate Oracle JDK 1.7 for your operating system, or whatever JDK your package manager has available.

When you have a JDK, you’ll need Leiningen, the Clojure build tool. If you’re on a Linux or OS X computer, the instructions below should get you going right away. If you’re on Windows, see the Leiningen page for an installer. If you get stuck, you might want to start with a primer on command line basics.

mkdir -p ~/bin cd ~/bin curl -O https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein chmod a+x lein

Leiningen automatically handles installing Clojure, finding libraries from the internet, and building and running your programs. We’ll create a new Leiningen project to play around in:

cd lein new scratch

This creates a new directory in your homedir, called scratch. If you see command not found instead, it means the directory ~/bin isn’t registered with your terminal as a place to search for programs. To fix this, add the line

export PATH="$PATH":~/bin

to the file .bash_profile in your home directory, then run source ~/.bash_profile. Re-running lein new scratch should work.

Let’s enter that directory, and start using Clojure itself:

cd scratch lein repl

The structure of programs

When you type lein repl at the terminal, you’ll see something like this:

aphyr@waterhouse:~/scratch$ lein repl nREPL server started on port 45413 REPL-y 0.2.0 Clojure 1.5.1 Docs: (doc function-name-here) (find-doc "part-of-name-here") Source: (source function-name-here) Javadoc: (javadoc java-object-or-class-here) Exit: Control+D or (exit) or (quit) user=>

This is an interactive Clojure environment called a REPL, for “Read, Evaluate, Print Loop”. It’s going to read a program we enter, run that program, and print the results. REPLs give you quick feedback, so they’re a great way to explore a program interactively, run tests, and prototype new ideas.

Let’s write a simple program. The simplest, in fact. Type “nil”, and hit enter.

user=> nil nil

nil is the most basic value in Clojure. It represents emptiness, nothing-doing, not-a-thing. The absence of information.

user=> true true user=> false false

true and false are a pair of special values called Booleans. They mean exactly what you think: whether a statement is true or false. true, false, and nil form the three poles of the Lisp logical system.

user=> 0 0

This is the number zero. Its numeric friends are 1, -47, 1.2e-4, 1/3, and so on. We might also talk about strings, which are chunks of text surrounded by double quotes:

user=> "hi there!" "hi there!"

nil, true, 0, and "hi there!" are all different types of values; the nouns of programming. Just as one could say “House.” in English, we can write a program like "hello, world" and it evaluates to itself: the string "hello world". But most sentences aren’t just about stating the existence of a thing; they involve action. We need verbs.

user=> inc #<core$inc clojure.core$inc@6f7ef41c>

This is a verb called inc–short for “increment”. Specifically, inc is a symbol which points to a verb: #<core$inc clojure.core$inc@6f7ef41c>– just like the word “run” is a name for the concept of running.

There’s a key distinction here–that a signifier, a reference, a label, is not the same as the signified, the referent, the concept itself. If you write the word “run” on paper, the ink means nothing by itself. It’s just a symbol. But in the mind of a reader, that symbol takes on meaning; the idea of running.

Unlike the number 0, or the string “hi”, symbols are references to other values. when Clojure evaluates a symbol, it looks up that symbol’s meaning. Look up inc, and you get #<core$inc clojure.core$inc@6f7ef41c>.

Can we refer to the symbol itself, without looking up its meaning?

user=> 'inc inc

Yes. The single quote ' escapes a sentence. In programming languages, we call sentences expressions or statements. A quote says “Rather than evaluating this expression’s text, simply return the text itself, unchanged.” Quote a symbol, get a symbol. Quote a number, get a number. Quote anything, and get it back exactly as it came in.

user=> '123 123 user=> '"foo" "foo" user=> '(1 2 3) (1 2 3)

A new kind of value, surrounded by parentheses: the list. LISP originally stood for LISt Processing, and lists are still at the core of the language. In fact, they form the most basic way to compose expressions, or sentences. A list is a single expression which has multiple parts. For instance, this list contains three elements: the numbers 1, 2, and 3. Lists can contain anything: numbers, strings, even other lists:

user=> '(nil "hi") (nil "hi")

A list containing two elements: the number 1, and a second list. That list contains two elements: the number 2, and another list. That list contains two elements: 3, and an empty list.

user=> '(1 (2 (3 ()))) (1 (2 (3 ())))

You could think of this structure as a tree–which is a provocative idea, because languages are like trees too: sentences are comprised of clauses, which can be nested, and each clause may have subjects modified by adjectives, and verbs modified by adverbs, and so on. “Lindsay, my best friend, took the dog which we found together at the pound on fourth street, for a walk with her mother Michelle.”

Took Lindsay my best friend the dog which we found together at the pound on fourth street for a walk with her mother Michelle

But let’s try something simpler. Something we know how to talk about. “Increment the number zero.” As a tree:

Increment the number zero

We have a symbol for incrementing, and we know how to write the number zero. Let’s combine them in a list:

clj=> '(inc 0) (inc 0)

A basic sentence. Remember, since it’s quoted, we’re talking about the tree, the text, the expression, by itself. Absent interpretation. If we remove the single-quote, Clojure will interpret the expression:

user=> (inc 0) 1

Incrementing zero yields one. And if we wanted to increment that value?

Increment increment the number zerouser=> (inc (inc 0)) 2

A sentence in Lisp is a list. It starts with a verb, and is followed by zero or more objects for that verb to act on. Each part of the list can itself be another list, in which case that nested list is evaluated first, just like a nested clause in a sentence. When we type

(inc (inc 0))

Clojure first looks up the meanings for the symbols in the code:

(#<core$inc clojure.core$inc@6f7ef41c> (#<core$inc clojure.core$inc@6f7ef41c> 0))

Then evaluates the innermost list (inc 0), which becomes the number 1:

(#<core$inc clojure.core$inc@6f7ef41c> 1)

Finally, it evaluates the outer list, incrementing the number 1:

2

Every list starts with a verb. Parts of a list are evaluated from left to right. Innermost lists are evaluated before outer lists.

(+ 1 (- 5 2) (+ 3 4)) (+ 1 3 (+ 3 4)) (+ 1 3 7) 11

That’s it.

The entire grammar of Lisp: the structure for every expression in the language. We transform expressions by substituting meanings for symbols, and obtain some result. This is the core of the Lambda Calculus, and it is the theoretical basis for almost all computer languages. Ruby, Javascript, C, Haskell; all languages express the text of their programs in different ways, but internally all construct a tree of expressions. Lisp simply makes it explicit.

Review

We started by learning a few basic nouns: numbers like 5, strings like "cat", and symbols like inc and +. We saw how quoting makes the difference between an expression itself and the thing it evaluates to. We discovered symbols as names for other values, just like how words represent concepts in any other language. Finally, we combined lists to make trees, and used those trees to represent a program.

With these basic elements of syntax in place, it’s time to expand our vocabulary with new verbs and nouns; learning to represent more complex values and transform them in different ways.

Riemann 0.2.0 is ready. There's so much left that I want to build, but this release includes a ton of changes that should improve usability for everyone, and I'm excited to announce its release.

Version 0.2.0 is a fairly major improvement in Riemann's performance and capabilities. Many things have been solidified, expanded, or tuned, and there are a few completely new ideas as well. There are a few minor API changes, mostly to internal structure–but a few streams are involved as well. Most functions will continue to work normally, but log a deprecation notice when used.

I dedicated the past six months to working on Riemann full-time. I was fortunate to receive individual donations as well as formal contracts with Blue Mountain Capital, SevenScale, and Iovation during that time. That money gave me months of runway to help make these improvements–but even more valuable was the feedback I received from production users, big and small. I've used your complaints, frustrations, and ideas to plan Riemann's roadmap, and I hope this release reflects that.

This release includes contributions from a broad cohort of open-source developers, and I want to recognize everyone who volunteered their time and energy to make Riemann better. In particular, I'd like to call out Pierre-Yves Ritschard, lwf, Ben Black, Thomas Omans, Dave Cottlehuber, and, well, the list goes on and on. You rock.

These months have seen not only improvements to Riemann itself, but to the dashboard, clients, and integration packages. While I'm spending most of my time working on the core Riemann server, it's really this peripheral software that make Riemann useful for instrumenting production systems. There's no way I could hope to understand, let alone write and test the code to integrate with all these technologies–which makes your work particularly valuable.

This week I started my new job at Factual. I won't be able to work 10 hours each day on Riemann any more, but I'm really happy with what we've built together, and I'll definitely keep working on the next release.

To all Riemann's users and contributors, thank you. Here's to 0.2.0.

New features

  • Arbitrary key-value (string) pairs on events
  • Hot config reloading
  • Integrated nrepl server
  • streams/sdo: bind together multiple streams as one
  • streams/split: like (cond), dispatch an event to the first matching stream
  • streams/splitp: like split, but on the basis of a specific predicate
  • config/delete-from-index: explicitly remove (similar) events from the index
  • streams/top: streaming top-k
  • streams/tag: add tags to events
  • RPM packaging
  • Init scripts, proper log dirs, and users for debian and RPM packages. Yeah, this means you can /etc/init.d/riemann reload, and Stuff Just Works ™.
  • folds/difference, product, and quotient.
  • Folds come in sloppy and strict variants which should “Do What I Mean” in most contexts.
  • Executor Services for asynchronous queued processing of events.
  • streams/exception-stream: captures exceptions and converts them to events.

Improvements

  • http://riemann.io site
  • Lots more documentation and examples
  • Config file syntax errors are detected early
  • Cleaned up server logging
  • Helpful messages (line numbers! filenames!) for configuration errors
  • Silence closed channel exceptions
  • Cores can preserve services like pubsub, the index, etc through reloads
  • Massive speedups in TCP and UDP server throughput
  • streams/rate works in real-time: no need for fill-in any more
  • Graphite client is faster, more complete
  • Config files can include other files by relative path
  • streams/coalesce passes on expired events
  • riemann.email/mailer can take custom :subject and :body functions
  • riemann.config includes some common time/scheduling functions
  • streams/where returns whether it matched an event, which means (where) is now re-usable as a predicate in lots of different contexts.
  • streams/tagged-any and tagged-all return whether they matched
  • streams/counter is resettable to a particular metric, and supports expiry
  • Bring back “hyperspace core online”
  • Update to netty 3.6.1
  • Reduced the number of threadpools used by the servers
  • Massive speedup in Netty performance by re-organizing execution handlers
  • core/reaper takes a :keep-keys option to specify which fields on an event are preserved
  • streams/smap ignores nil values for better use with folds
  • Update to aleph 0.3.0-beta15
  • Config files ship with emacs modelines, too

Bugfixes

  • Fixed a bug in part-time-fast causing undercounting under high contention
  • Catch exceptions while processing expired events
  • Fix a bug escaping metric names for librato
  • riemann.email/mailer can talk to SMTP relays again
  • graphite-path-percentiles will convert decimals of three or more places to percentile strings
  • streams/rollup is much more efficient; doesn't leak tasks
  • streams/rollup aggregates and forwards expired events instead of stopping
  • Fixed a threadpool leak from Netty
  • streams/coalesce: fixed a bug involving lazy persistence of transients
  • streams/ddt: fixed a few edge cases

Internals

  • Cleaned up the test suite's logging
  • Pluggable transports for netty servers
  • Cores are immutable
  • Service protocol: provides lifecycle management for internal components
  • Tests for riemann.config
  • riemann.periodic is gone; replaced by riemann.time
  • Tried to clean up some duplicated functions between core, config, and streams
  • riemann.common/deprecated
  • Cleaned up riemann.streams, removing unused commented-out code
  • Lots of anonymous functions have names now, to help with profiling
  • Composing netty pipeline factories is much simpler
  • Clojure 1.5

Known bugs

  • Passing :host to websocket-server does nothing: it binds to * regardless.
  • Folds/mean throws when it receives empty lists
  • graphite-server has no tests
  • Riemann will happily overload browsers via websockets
  • streams/rate doesn't stop its internal poller correctly when self-expiring
  • When Netty runs out of filehandles, it'll hang new connections

The Netty redesign of riemann-java-client made it possible to expose an end-to-end asynchronous API for writes, which has a dramatic improvement on messages with a small number of events. By introducing a small queue of pipelined write promises, riemann-clojure-client can now push 65K events per second, as individual messages, over a single TCP socket. Works out to about 120 mbps of sustained traffic.

single-events.png

I'm really happy about the bulk throughput too: three threads using a single socket, sending messages of 100 events each, can push around 185-200K events/sec, at over 200 mbps. That throughput took 10 sockets and hundreds of threads to achieve in earlier tests.

bulk.png

This isn't a particularly useful feature as far as clients go; it's unlikely most users will want to push this much from a single client. It is critical, however, for optimizing Riemann's server performance. The server, running the bulk test, consumes about 115% CPU on my 2.5Ghz Q8300. I believe this puts a million events/sec within reach for production hardware, though at that throughput CAS contention in the streams may become a limiting factor. If I can find a box (and network) powerful enough to test, I'd love to give it a shot!

This is the last major improvement for Riemann 0.2.0. I'll be focusing on packaging and documentation tomorrow. :)

In the previous post, I described an approximation of Heroku's Bamboo routing stack, based on their blog posts. Hacker News, as usual, is outraged that the difficulty of building fast, reliable distributed systems could prevent Heroku from building a magically optimal architecture. Coda Hale quips:

Really enjoying @RapGenius’s latest mix tape, “I Have No Idea How Distributed Systems Work”.

Coda understands the implications of the CAP theorem. This job is too big for one computer–any routing system we design must be distributed. Distribution increases the probability of a failure, both in nodes and in the network itself. These failures are usually partial, and often take the form of degradation rather than the system failing as a whole. Two nodes may be unable to communicate with each other, though a client can see both. Nodes can lie to each other. Time can flow backwards.

CAP tells us that under these constraints, we can pick two of three properties (and I'm going to butcher them in an attempt to be concise):

  1. Consistency: nodes agree on the system's state.
  2. Availability: the system accepts requests.
  3. Partition tolerance: the system runs even when the network delays or drops some messages.

In the real world, partitions are common, and failing to operate during a partition is essentially a failure of availability. We must choose CP or AP, or some probabilistic blend of the two.

There's a different way to talk about the properties of a distributed system–and I think Peter Bailis explains it well. Liveness means that at every point, there exists a sequence of operations that allows the “right thing” to happen–e.g. “threads are never deadlocked” or “you never get stuck in an infinite loop”. Safety means the system fails to do anything bad. Together, safety and liveness ensure the system does good things on time.

With this in mind, what kind of constraints apply to HTTP request routing?

  1. The system must be partition tolerant.
  2. The system must be available–as much as possible, anyway. Serving web pages slower is preferable to not serving them at all. In the language of CAP, our system must be AP.
  3. But we can't wait too long, because requests which take more than a minute to complete are essentially useless. We have a liveness constraint.
  4. Requests must complete correctly, or not at all. We can't route an HTTP POST to multiple servers at once, or drop pieces of requests on the floor. We have a safety constraint.

It's impossible to do this perfectly. If all of our data centers are nuked, there's no way we can remain available. If the network lies to us, it can be impractical to guarantee correct responses. And we can let latencies rise to accommodate failure: the liveness constraint is flexible.

Finally, we're real engineers. We're going to make mistakes. We have limited time and money, limited ability to think, and must work with existing systems which were never designed for the task at hand. Complex algorithms are extraordinarily difficult to prove–let alone predict–at scale, or under the weird failure modes of distributed systems. This means it's often better to choose a dumb but predictable algorithm over an optimal but complex one.

What I want to make clear is that Heroku is full of smart engineers–and if they're anything like the engineers I know, they're trying their hardest to adapt to a rapidly changing problem, fighting fires and designing new systems at the same time. Their problems don't look anything like yours or mine. Their engineering decisions are driven by complex and shifting internal constraints which we can't really analyze or predict. When I talk about “improved routing models” or “possible alternatives”, please understand that those models may be too complex, incompatible, or unpredictable to build in a given environment.

Dealing with unreliability

Returning to our Bamboo stack simulation, I'd like to start by introducing failure dynamics.

Real nodes fail. We'll make our dynos unreliable with the faulty function, which simulates a component which stays online for an exponentially-distributed time before crashing, then returns error responses instead of allowing requests to pass through. After another exponentially-distributed outage time, it recovers, and the process continues. You can interpret this as a physical piece of hardware, or a virtual machine, or a hot-spare scenario where another node spins up to take the downed one's place, etc. This is a fail-fast model–the node returns failure immediately instead of swallowing messages indefinitely. Since the simulations we're running are short-lived, I'm going to choose relatively short failure times so we can see what happens under changing dynamics.

(defn faulty-dyno [] (cable 2 ; Mean time before failure of 20 seconds, and ; mean time before resolution of one second. (faulty 20000 1000 (queue-exclusive (delay-fixed 20 (delay-exponential 100 (server :rails))))))

Again, we're using a pool of 250 dynos and a poisson-distributed load function. Let's compare an even load balancer with a pool of perfect dynos vs a pool of faulty ones:

(test-node "Reliable min-conn -> pool of faulty dynos." (lb-min-conn (pool pool-size (faulty-dyno))))). Ideal dynos 95% available dynos Total reqs: 100000 100000 Selected reqs: 50000 50000 Successful frac: 1.0 0.62632 Request rate: 678.2972 reqs/s 679.6156 reqs/s Response rate: 673.90894 reqs/s 676.74567 reqs/s Latency distribution: Min: 24.0 4.0 Median: 93.0 46.5 95th %: 323.0 272.0 99th %: 488.0 438.0 Max: 1044.0 914.0

Well that was unexpected. Even though our pool is 95% available, over a third of all requests fail. Because our faulty nodes fail immediately, they have smaller queues on average–and the min-conns load balancer routes more requests to them. Real load balancers like HAProxy keep track of which nodes fail and avoid routing requests to them. Haproxy uses active health checks, but for simplicity I'll introduce a passive scheme: when a request fails, don't decrement that host's connection counter immediately. Instead, wait for a while–say 1 second, the mean time to resolution for a given dyno. We can still return the error response immediately, so this doesn't stop the load balancer from failing fast, but it will reduce the probability of assigning requests to broken nodes.

(lb-min-conn :lb {:error-hold-time 1000} (pool pool-size (faulty-dyno)))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.98846 Request rate: 678.72076 reqs/s Response rate: 671.3302 reqs/s Latency distribution: Min: 4.0 Median: 92.0 95th %: 323.0 99th %: 486.0 Max: 1157.0

Throughput is slightly lower than the ideal, perfect pool of dynos, but we've achieved 98% reliability over a pool of nodes which is only 95% available, and done it without any significant impact on latencies. This system is more than the sum of its parts.

This system has an upper bound on its reliability: some requests must fail in order to determine which dynos are available. Can we do better? Let's wrap the load balancer with a system that retries requests on error, up to three requests total:

(test-node "Retry -> min-conn -> faulty pool" (retry 3 (lb-min-conn :lb {:error-hold-time 1000} (pool pool-size (faulty-dyno))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.99996 Request rate: 676.8098 reqs/s Response rate: 670.16046 reqs/s Latency distribution: Min: 12.0 Median: 94.0 95th %: 320.0 99th %: 484.0 Max: 944.0

The combination of retries, least-conns balancing, and diverting requests away from failing nodes allows us to achieve 99.996% availability with minimal latency impact. This is a great building block to work with. Now let's find a way to compose it into a large-scale distributed system.

Multilayer routing

Minimum-connections and round-robin load balancers require coordinated state. If the machines which comprise our load balancer are faulty, we might try to distribute the load balancer itself in a highly available fashion. That would require state coordination with low latency bounds–and the CAP theorem tells us this is impossible to do. We'd need to make probabilistic tradeoffs under partitions, like allowing multiple requests to flow to the same backend.

What if we punt on AP min-conns load balancers? What if we make them single machines, or CP clusters? As soon as the load balancer encountered a problem, it would become completely unavailable.

(defn faulty-lb [pool] (faulty 20000 1000 (retry 3 (lb-min-conn :lb {:error-hold-time 1000} pool))))

Let's model the Bamboo architecture again: a stateless, random routing layer on top, which allocates requests to a pool of 10 faulty min-conns load balancers, all of which route over a single pool of faulty dynos:

(test-node "Random -> 10 faulty lbs -> One pool" (let [dynos (dynos pool-size)] (lb-random (pool 10 (cable 5 (faulty-lb dynos)))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.9473 Request rate: 671.94366 reqs/s Response rate: 657.87744 reqs/s Latency distribution: Min: 10.0 Median: 947.0 95th %: 1620.0 99th %: 1916.0 Max: 3056.0

Notice that our availability dropped to 95% in the two-layer distributed model. This is a consequence of state isolation: because the individual least-conns routers don't share any state, they can't communicate about which nodes are down. That increases the probability that we'll allocate requests to broken dynos. A load-balancer which performed active state-checks wouldn't have this problem; but we can work around it by adding a second layer of retries on top of the stateless random routing layer:

(let [dynos (pool pool-size (faulty-dyno))] (retry 3 (lb-random (pool 10 (cable 5 (faulty-lb dynos))))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.99952 Request rate: 686.97363 reqs/s Response rate: 668.2616 reqs/s Latency distribution: Min: 30.0 Median: 982.0 95th %: 1639.0 99th %: 1952.010000000002 Max: 2878.0

This doesn't help our latency problem, but it does provide three nines availability! Not bad for a stateless routing layer on top of a 95% available pool. However, we can do better.

homogenous.jpg

Isolating the least-conns routers from each other is essential to preserve liveness and availability. On the other hand, it means that they can't share state about how to efficiently allocate requests over the same dynos–so they'll encounter more failures, and queue multiple requests on the same dyno independently. One way to resolve this problem is to ensure that each least-conns router has a complete picture of its backends' state. We isolate the dynos from one another:

distinct.jpg

This has real tradeoffs! For one, an imbalance in the random routing topology means that some min-conns routers will have more load than their neighbors–and they can't re-route requests to dynos outside their pool. And since our min-conns routers are CP systems in this architecture, when they fail, an entire block of dynos is unroutable. We have to strike a balance between more dynos per block (efficient least-conns routing) and more min-conn blocks (reduced impact of a router failure).

Let's try 10 blocks of 25 dynos each:

(test-node "Retry -> Random -> 10 faulty lbs -> 10 pools" (retry 3 (lb-random (pool 10 (cable 5 (faulty-lb (pool (/ pool-size 10) (faulty-dyno)))))))))Total reqs: 100000 Selected reqs: 50000 Successful frac: 0.99952 Request rate: 681.8213 reqs/s Response rate: 677.8099 reqs/s Latency distribution: Min: 30.0 Median: 104.0 95th %: 335.0 99th %: 491.0 Max: 1043.0

Whoah! We're still 99.9% available, even with a stateless random routing layer on top of 10 95% available routers. Throughput is slightly down, but our median latency is nine times lower than the homogenous dyno pool.

single-distinct.png

I think system composition is important in distributed design. Every one of these components is complex. It helps to approach each task as an isolated system, and enforce easy-to-understand guarantees about that component's behavior. Then you can compose different systems together to make something bigger and more useful. In these articles, we composed an efficient (but nonscalable) CP system with an inefficient (but scalable) AP system to provide a hybrid of the two.

If you have awareness of your network topology and are designing for singlethreaded, queuing backends, this kind of routing system makes sense. However, it's only going to be efficient if you can situate your dynos close to their least-conns load balancer. One obvious design is to put one load balancer in each rack, and hook it directly to the rack's switch. If blocks are going to fail as a group, you want to keep those blocks within the smallest network area possible. If you're working in EC2, you may not have clear network boundaries to take advantage of, and correlated failures across blocks could be a real problem.

This architecture also doesn't make sense for concurrent servers–and that's a growing fraction of Heroku's hosted applications. I've also ignored the problem of dynamic pools, where dynos are spinning up and exiting pools constantly. Sadly I'm out of time to work on this project, but perhaps a reader will chime in a model for for distributed routing over concurrent servers–maybe with a nonlinear load model for server latencies?

Thanks for exploring networks with me!

For more on Timelike and routing simulation, check out part 2 of this article: everything fails all the time. There's also more discussion on Reddit.

RapGenius is upset about Heroku's routing infrastructure. RapGenius, like many web sites, uses Rails, and Rails is notoriously difficult to operate in a multithreaded environment. Heroku operates at large scale, and made engineering tradeoffs which gave rise to high latencies–latencies with adverse effects on customers. I'd like to explore why Heroku's Bamboo architecture behaves this way, and help readers reason about their own network infrastructure.

To start off with, here's a Rails server. Since we're going to be discussing complex chains of network software, I'll write it down as an s-expression:

(server :rails)

Let's pretend that server has some constant request-parsing overhead–perhaps 20 milliseconds–and an exponentially-distributed processing time with a mean of 100 milliseconds.

(delay-fixed 20 (delay-exponential 100 (server :rails)))

Heroku runs a Rails application in a virtual machine called a Dyno, on EC2. Since the Rails server can only do one thing at a time, the dyno keeps a queue of HTTP requests, and applies them sequentially to the rails application. We'll talk to the dyno over a 2-millisecond-long network cable.

(defn dyno [] (cable 2 (queue-exclusive (delay-fixed 20 (delay-exponential 100 (server :rails))))))

This node can process an infinite queue of requests at the average rate of 1 every 124 milliseconds (2 + 20 + 100 + 2). But some requests take longer than others. What happens if your request lands behind a different, longer request? How long do you, the user, have to wait?

Introducing Timelike

Surprise! This way of describing network systems is also executable code. Welcome to Timelike.

(cable 2 ...) returns a function which accepts a request, sleeps for 2 milliseconds, then passes the request to a child function–in this case, a queuing function returned by queue-exclusive. Then cable sleeps for 2 more milliseconds to simulate the return trip, and returns the response from queue-exclusive. The request (and response) are just a list of events, each one timestamped. The return value of each function, or “node”, is the entire history of a request as it passes through the pipeline.

Network node composition is function composition–and since they're functions, we can run them.

(let [responses (future* ; In a new thread, generate poisson-distributed ; requests. We want 10,000 total, spaced roughly ; 150 milliseconds apart. Apply them to a single ; dyno. (load-poisson 10000 150 req (dyno)))] (prn (first @responses)) (pstats @responses))

Timelike doesn't actually sleep for 150 milliseconds between requests. The openjdk and oracle schedulers are unreliable as it stands–and we don't actually need to wait that long to compute the value of this function. We just virtualize time for every thread in the network (in this case, a thread per request). All operations complete “immediately” according to the virtual clock, and the clock only advances when threads explicitly sleep. We can still exploit parallelism whenever two threads wake up at the same time, and advance the clock whenever there's no more work to be done at a given time. The scheduler will even detect deadlocks and allow the clock to advance when active threads are blocked waiting to acquire a mutex held by a thread which won't release it until the future… though that's a little slow. ;-)

The upside of all this ridiculous lisp is that you can simulate concurrent systems where the results are independent of wall-clock time, which makes it easier to compare parallel systems at different scales. You can simulate one machine or a network of thousands, and the dynamics are the same.

Here's an example request, and some response statistics. We discard the first and last parts of the request logs to avoid measuring the warm-up or cool-down period of the dyno queue.

[{:time 0} {:node :rails, :time 66}] Total reqs: 10000 Selected reqs: 5000 Successful frac: 1.0 Request rate: 6.6635394 reqs/s Response rate: 6.653865 reqs/s Latency distribution: Min: 22.0 Median: 387.0 95th %: 1728.0 99th %: 2894.1100000000024 Max: 3706.0

Since the request and response rates are close, we know the dyno was stable during this time–it wasn't overloaded or draining its queue. But look at that latency distribution! Our median request took 3 times the mean, and some requests blocked for multiple seconds. Requests which stack up behind each other have to wait, even if they could complete quickly. We need a way to handle more than one request at a time.

How do you do that with a singlethreaded Rails? You run more server processes at once. In Heroku, you add more dynos. Each runs in parallel, so with n dynos you can (optimally) process n requests at a time.

(defn dynos "A pool of n dynos" [n] (pool n (dyno)))

There's those funny macros again.

Now you have a new problem: how do you get requests to the right dynos? Remember, whatever routing system we design needs to be distributed–multiple load balancers have to coordinate about the environment.

Random routing

Random load balancers are simple. When you get a new request, you pick a random dyno and send the request over there. In the infinite limit this is fine; a uniformly even distribution will distribute an infinite number of requests evenly across the cluster. But our systems aren't infinite. A random LB will sometimes send two, or even a hundred requests to the same dyno even when its neighbors go unused. That dyno's queue will back up, and everyone in that queue has to wait for all the requests ahead of them.

(lb-random (dynos 250))Total reqs: 100000 Selected reqs: 50000 Successful frac: 1.0 Request rate: 1039.7172 reqs/s Response rate: 1012.6787 reqs/s Latency distribution: Min: 22.0 Median: 162.0 95th %: 631.0 99th %: 970.0 Max: 1995.0

A cool thing about random LBs is that they require little coordinated state. You don't have to agree with your peers about where to route a request. They also compose freely: a layer of random load balancers over another layer of random load balancers has exactly the same characteristics as a single random load balancer, assuming perfect concurrency. On the other hand, leaving nodes unused while piling up requests on a struggling dyno is silly. We can do better.

Round-Robin routing

Round-robin load balancers write down all their backends in a circular list (also termed a “ring”). The first request goes to the first backend in the ring; the second request to the second backend, and so forth, around and around. This has the advantage of evenly distributing requests, and it's relatively simple to manage the state involved: you only need to know a single number, telling you which element in the list to point to.

(lb-rr (dynos 250))Total reqs: 100000 Selected reqs: 50000 Successful frac: 1.0 Request rate: 1043.9939 reqs/s Response rate: 1029.6116 reqs/s Latency distribution: Min: 22.0 Median: 105.0 95th %: 375.0 99th %: 560.0 Max: 1173.0

We halved our 95th percentile latencies, and cut median request time by roughly a third. RR balancers have a drawback though. Most real-world requests–like the one in our model–take a variable amount of time. When that variability is large enough (relative to pool saturation), round robin balancers can put two long-running requests on the same dyno. Queues back up again.

Least-connections routing

A min-conn LB algorithm keeps track of the number of connections which it has opened on each particular backend. When a new connection arrives, you find the backend with the least number of current connections. For singlethreaded servers, this also corresponds to the server with the shortest queue (in terms of request count, not time).

(lb-min-conn (dynos 250))Total reqs: 100000 Selected reqs: 50000 Successful frac: 1.0 Request rate: 1049.7806 reqs/s Response rate: 1041.1244 reqs/s Latency distribution: Min: 22.0 Median: 92.0 95th %: 322.0 99th %: 483.0 Max: 974.0

Our 95th percentile latency has gone from 600 ms, to 375 ms, to 322ms. This algorithm is significantly more efficient over our simulated dynos than random or round-robin balancing–though it's still not optimal. An optimal algorithms would predict the future and figure out how long the request will take before allocating it–so it could avoid stacking two long-running requests in the same queue.

Least-conns also means keeping track of lots of state: a number for every dyno, at least. All that state has to be shared between the load balancers in a given cluster, which can be expensive. On the other hand, we could afford up to a 200-millisecond delay on each connection, and still be more efficient than a random balancer. That's a fair bit of headroom.

Meanwhile, in the real world

Heroku can't use round-robin or min-conns load balancers for their whole infrastructure–it's just too big a problem to coordinate. Moreover, some of the load balancers are far apart from each other so they can't communicate quickly or reliably. Instead, Heroku uses several independent least-conns load balancers for their Bamboo stack. This has a drawback: with two least-conns routers, you can load the same dyno with requests from both routers at once–which increases the queue depth variability.

Let's hook up a random router to a set of min-conns routers, all backed by the same pool of 250 dynos. We'll separate the random routing layer from the min-conns layer by a 5-millisecond-long network cable.

(defn bamboo-test [n] (test-node (str "Bamboo with " n " routers") (let [dynos (dynos pool-size)] (lb-random (pool n (cable 5 (lb-min-conn dynos))))))) (deftest ^:bamboo bamboo-2 (bamboo-test 2)) (deftest ^:bamboo bamboo-4 (bamboo-test 4)) (deftest ^:bamboo bamboo-8 (bamboo-test 8)) (deftest ^:bamboo bamboo-16 (bamboo-test 16))

This plot sums up, in a nutshell, why RapGenius saw terrible response times. Latencies in this model–especially those killer 95th and 99th percentile times–rise linearly with additional least-conns routers (asymptotically bounded by the performance of a random router). As Heroku's Bamboo cluster grew, so did the variability of dyno queue depths.

bamboo.png

This is not the only routing topology available. In part 2, I explore some other options for distributed load balancing. If you want to experiment with Timelike for yourself, check out the github project.

tl;dr Riemann is a monitoring system, so it emphasizes liveness over safety.

Riemann is aimed at high-throughput (millions of events/sec/node), partial-harvest event processing, where it is acceptable to trade completeness for throughput at low latencies. For instance, it's probably fine to drop half of your request latency events on the floor, if you're calculating a lossy histogram with sampling anyway. It's also typically acceptable to have nondeterministic behavior with respect to time windows: if one node's clock is skewed, it's better to process it “soonish” rather than waiting an unbounded amount of time for it to check in.

There is no synchronization or relationship between events. Events are immutable and have a total order, even though a given server or client may only have a fraction of the relevant events for a system. The events are, in a sense, the transaction log–except that the semantics of those transactions depend on the stream configuration.

Riemann is only trivially distributed: clients send events to servers. Servers can act as clients themselves. The protocol provides synchronous acknowledgement of each received event… which could mean “your write is durably stored on disk” or “I threw your write on a queue, good luck have fun”, or any mixture in between, like “I queued your write for use by a windowing stream, I queued it for submission to Librato metrics, and reacted to the failure condition by sending an email which has been acked by the mail system.”

All of these guarantees are present only for a single server. At some point Riemann will need to be available during partitions.

The “Fuck it, no coordination” model, which I have now, allows for degraded harvest and low latencies for data which it's OK to lose some of. A simple strategy is to carpetbomb every Riemann server in the cluster with your events with the usable tunable write-replica threshold. Each server might have a slightly different view of the world, depending on where it was partitioned and how long.

Stronger consistency

Some events (which happen infrequently) need strong coordination. We need to guarantee, for example, that of three Riemann servers responsible for this datacenter, exactly one sends the “hey, the web server's broken” email. These events require bounded guarantees of both liveness: “Someone must send an email in five seconds” and safety: “I don't care who but one of you better do it”.

I'm pretty sure these constraints on side effects essentially violate CAP, in the face of arbitrary partitions. If a node decides “I'll send it”, sends the email, then explodes just before telling the others “I sent it!”, the remaining nodes have no choice but to send a duplicate message.

In the event of these failure modes (like a total partition), duplicates are preferable to doing nothing. Waaay better to page someone twice than to risk not paging them at all.

However, there are some failure modes where I can provide delivered-once guarantees of side effects. For example, up to floor(n/2) node failures, or a partition which leaves a fully-connected quorum. In these circumstances, 2PC or Paxos can give me strong consistency guarantees, and I can detect (in many cases, I think) the failure modes which would result in sacrificing consistency and requiring a duplicate write. A Riemann server can call someone and say,

“Hey, I just paged you, and this is crazy, but I've got split brain, I'll call twice maybe.”

Since events are values, I can serialize and compare them. That means you might actually be able to write, in the streams config, an expression which means “attempt to ensure these events are processed on exactly one host in the cluster.”

(streams (where (state "critical") ; This is unsynchronized and proceeds on all nodes concurrently #(prn "Uh oh, this thing's broken!" %) (master ; Any events inside master are executed on exactly one node if ; quorum is preserved, or maybe multiple hosts if a node fails before ; acking. (email "aphyr@aphyr.com"))))

…which is most useful when clients can reach a majority of servers (and allows clients to guarantee whether or not their event was accepted.) I can also provide a weaker guarantee along the lines of “Try to prevent all connected peers from sending this event within this time window,” which is useful for scenarios where you want to know about errors which occurred in minority partitions and it's likely that clients will be partitioned with their servers; e.g. one Riemann per agg switch or DC.

This doesn't guarantee all nodes have the same picture of the world which led up to that failure. I think doing that would require full coordination between all nodes about the event stream (and its ordering), which would impose nontrivial synchronization costs. Explicit causal consistency could improve this, but we'd need a way to express and compute those causal relationships between arbitrary stream functions somehow.

Realistically, this may not be a problem. When Riemann sees a quorum loss it can wake someone up, and when the partition is resolved nodes will converge rapidly on “hey, that service still isn't checking in.”

A third path

What I don't know yet is whether there's a role for events which don't need the insane overhead of 2PC or paxos for every… single… event… but do need some kind of distributed consistency. HAT is interesting because it provides reasonably strong consistency guarantees for an AP system, but at the cost of liveness. Is that liveness tradeoff suitable for Riemann, where responding Right Now is critical? Probably not. But it might be useful for historical stores, or expressing distributed multi-event transactions–which currently don't exist. I don't even know what this would mean in an event-oriented context.

Why? Riemann's event model treats events as values. Well-behaved clients provide a total order and identity over events based on their host, service, and timestamps. This means reconstructing any linear subset of the event stream can be done in an eventually consistent way. if Riemann were to become a historical store, reconciling divergent histories would simply be the set union of all received events.

Except for derived events. What happens when a partition separates two Riemann servers measuring request throughput? Each receives half of the events it used to, and their rate streams start emitting events with a metric half as big as they used to. If both Riemann servers are logging these events to a historical store, the store will show only half the throughput it used to.

One option is to log only raw events and reconstruct derived events by replaying the merged event log. What was the rate at noon? Apply all the events from 11:55 to 12:00 to the rate stream and see.

Another option might be for rate streams themselves to be transactional in nature, but I'm not sure how to do that in a way which preserves liveness guarantees.

I've been doing a lot of performance tuning in Riemann recently, especially in the clients–but I'd like to share a particularly spectacular improvement from yesterday.

The Riemann protocol

Riemann's TCP protocol is really simple. Send a Msg to the server, receive a response Msg. Messages might include some new events for the server, or a query; and a response might include a boolean acknowledgement or a list of events matching the query. The protocol is ordered; messages on a connection are processed in-order and responses sent in-order. Each Message is serialized using Protocol Buffers. To figure out how large each message is, you read a four-byte length header, then read length bytes, and parse that as a Msg.

time ---> send: [length1][msg1] [length2][msg2] recv: [length1][msg1] [length2][msg2]

The optimization I discussed last time–pipelining requests–allows a client to send multiple messages before receiving their acknowledgements. There are many queues in between a client saying “send a message” and that message actually being parsed in Riemann: Java IO buffers, the kernel TCP stack, the network card, various pieces of networking hardware, the wires themselves… all act like queues. This means throughput is often limited by latency, so by writing messages asynchronously we can achieve higher throughput with only minor latency costs.

The other optimization I've been working on is batching. For various reasons, this kind of protocol performs better when messages are larger. If you can pack 100 events into a message, the server can buffer and parse it in one go, resulting in much higher throughputs at the cost of significantly higher latencies–especially if your event needs to sit in a buffer for a while, waiting for other events to show up so they can be sent in a Msg.

Netty's threadpools

For any given connection, Netty (as used in Riemann) has two threadpools handling incoming bytes: the IO worker pool, and a handler pool which actually handles Riemann events. The IO worker pool is busy shuttling bytes back and forth from the TCP connection buffers through the pipeline–but if an IO worker spends too much time on a single channel, it won't be able to handle other channels and latencies will rise. An ExecutionHandler takes over at some point in the pipeline, which uses the handler pool to do long-running work like handling a Msg.

Earlier versions of Riemann put the ExecutionHandler very close to the end of the pipeline, because all the early operations in the pipeline are really fast. The common advice goes, “Wrap long-running tasks in an execution handler, so they don't block”. OK, makes sense.

(channel-pipeline-factory int32-frame-decoder (int32-frame-decoder) ; Read off 32-bit length headers ^:shared int32-frame-encoder (int32-frame-encoder) ; Add length header on the way out ^:shared protobuf-decoder (protobuf-decoder) ; Decode bytes to a Msg ^:shared protobuf-encoder (protobuf-encoder) ; Encode a Msg to bytes ^:shared msg-decoder (msg-decoder) ; Convert Msg to a record ^:shared msg-encoder (msg-encoder) ; Convert a record to a Msg ^:shared executor (execution-handler) ; Switch to handler threadpool ^:shared handler (gen-tcp-handler ; Actually process the Msg core channel-group tcp-handler))

Now… a motivated or prescient reader might ask, “How, exactly, does the execution handler get data from an IO thread over to a handler thread?”

It puts it on a queue. Like every good queue it's bounded–but not by number of items, since some items could be way bigger than others. It's bounded by memory.

(defn execution-handler "Creates a new netty execution handler." [] (ExecutionHandler. (OrderedMemoryAwareThreadPoolExecutor. 16 ; Core pool size 1048576 ; 1MB per channel queued 10485760 ; 10MB total queued )))

How does the Executor know how much memory is in a given item? It uses a DefaultObjectSizeEstimator, which knows all about Bytes and Channels and Buffers… but absolutely nothing about the decoded Protobuf objects which it's being asked to enqueue. So the estimator goes and digs into the item's fields using reflection:

int answer = 8; // Basic overhead. for (Class<?> c = clazz; c != null; c = c.getSuperclass()) { Field[] fields = c.getDeclaredFields(); for (Field f : fields) { if ((f.getModifiers() & Modifier.STATIC) != 0) { // Ignore static fields. continue; } answer += estimateSize(f.getType(), visitedClasses);

Of course, I didn't know this at the time. Netty is pretty big, and despite extensive documentation it's not necessarily clear that an OrderedMemoryAwareThreadPoolExecutor is going to try and guess how much memory is in a given object, recursively.

So I'm staring at Yourkit, completely ignorant of everything I've just explained, and wondering why the devil DefaultObjectSizeEstimator is taking 38% of Riemann's CPU time. It takes me ~15 hours of digging through Javadoc and source and blogs and StackOverflow to realize that all I have to do is…

  1. Build my own ObjectSizeEstimator, or
  2. Enqueue things I already know the size of.
(channel-pipeline-factory int32-frame-decoder (int32-frame-decoder) ^:shared int32-frame-encoder (int32-frame-encoder) ^:shared executor (execution-handler) ; <--+ ^:shared protobuf-decoder (protobuf-decoder) ; | ^:shared protobuf-encoder (protobuf-encoder) ; | ^:shared msg-decoder (msg-decoder) ; | ^:shared msg-encoder (msg-encoder) ; ___| ^:shared handler (gen-tcp-handler core channel-group tcp-handler))

Just move one line. Now I enqueue buffers with known sizes, instead of complex Protobuf objects. DefaultObjectSizeEstimator runs in constant time. Throughput doubles. Minimum latency drops by a factor of two.

drop tcp event batch throughput.png

drop tcp event batch latency.png

Throughput here is measured in messages, each containing 100 events, so master is processing 200,000–215,000 events/sec. Latency is for synchronous calls to client.sendEvents(anEvent). The dropoff at the tail end of the time series is the pipelining client draining its message queue. Client and server are running on the same quad-core Q8300, pushing about 20 megabytes/sec of traffic over loopback. Here's what the riemann-bench session looks like, if you're curious.

Why didn't you figure this out sooner?

I wrote most of this code, and what code I didn't write, I reviewed and tested. Why did it take me so long to figure out what was going on?

When I started working on this problem, the code looked nothing like the pipeline I showed you earlier.

The Netty pipeline evolved piecemeal, by trial-and-error, and went through several refactorings. The UDP server, TCP server, and Graphite server share much of the same code, but do very different things. I made several changes to improve performance. In making these changes I tried to minimize API disruption–to keep function interfaces the same–which gradually pulled the pipeline into several interacting pieces. Since Netty's API is well-written, flexible Java code, it comes with literally hundreds of names to keep track of. Keeping function and variable names distinct became a challenge.

By the point I started digging into the problem, I was hard pressed to figure out what a channel pipeline factory was, let alone how it was constructed.

In order to solve the bug I had to understand the code, which meant inventing a new language to talk about pipelines. Once I'd expressed the pipeline clearly, it was obvious how the pieces interacted. Experimenting with new pipelines took a half hour, and I was able to almost double throughput with a single-line change.

I've had two observations floating around in my head, looking for a way to connect with each other.

Many “architecture patterns” are scar tissue around the absence of higher-level language features.

and a criterion for choosing languages and designing APIs

Write down the simplest syntactically valid expression of what you want to do. That expression should be a program.

First, let me clarify that there are all sorts of wonderful patterns in software–things like “functions”, “iteration”, “monads”, “concurrent execution”, “laziness”, “memoization”, and “parametric polymorphism”. Sometimes, though, we write the same combination of symbols over and over again, in a nontrivial way. Maybe it takes ten or twenty lines to encapsulate an idea, and you have to type those lines every time you want to use the idea, because the language cannot express it directly. It's not that the underlying concept is wrong–it's that the expression of it in a particular domain is unwieldy, and has taken on a life of its own. Things like Builders and, in this post, Factories.

Every language emphasizes some of these ideas. Erlang, for instance, emphasizes concurrency, and makes it easy to write concurrent code by introducing special syntax for actors and sending messages. Ruby considers lexical closures important, and so it has special syntax for writing blocks concisely. However, languages must balance the expressiveness of special syntax with the complexity of managing that syntax's complexity. Scala, for instance, includes special syntactic rules for a broad variety of constructs (XML literals, lexical closures, keyword arguments, implicit scope, variable declaration, types)—and often several syntaxes for the same construct (method invocation, arguments, code blocks). When there are many syntax rules, understanding how those rules interact with each other can be difficult.

I argue that defining new syntax should be a language feature: one of Lisp's strengths is that its syntax is both highly regular but also semantically fluid. Variable definition, iteration, concurrency, and even evaluation rules themselves can be defined as libraries—in a controlled, predictable way. In this article, I'd like to give some pragmatic examples as to why I think this way.

Netty

There's a Java library called Netty, which helps you write network servers. In Netty each connection is called a channel, and bytes which come from the network flow through a pipeline of handlers. Each handler transforms incoming messages in some way, and typically forwards a different kind of message to the next handler down the pipeline.

Now, some handlers are safe to re-use across different channels–perhaps because they don't store any mutable state. For instance, it's OK to use a ProtobufDecoder to decode several Protocol Buffer messages at the same time. It's not safe, however, to use a LengthFieldBasedFrameDecoder to decode two channels at once, because this kind of decoder reads a length header, then saves that state and uses it to figure out how many more bytes it needs to accept from that channel. We need a new LengthFieldBasedFrameDecoder every time we accept a new connection.

In languages which have first-class functions, the easiest way to get a new, say, Pipeline is to write down a function which makes a new Pipeline, and then call it whenever you need one. Here's one for Riemann.

(fn [] (doto (Channels/pipeline) (.addLast "integer-header-decoder" (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4)) (.addLast "protobuf-decoder" (ProtobufDecoder. (Proto$Msg/getDefaultInstance)))))

Doto is an example of redefinable syntax. It's a macro—a function which rewrites code at compile time. Doto transforms code like (doto obj (function1 arg1) (function2)) into (let [x obj] (function1 x arg1) (function2 x) x), where x is a unique variable which will not conflict with the surrounding scope. In short, it simplifies a common pattern: performing a series of operations on the same object, but eliminates the need to explicitly name the object with a variable, or to write the variable in each expression.

Every time you call this function, it creates a new pipeline (with Channels.pipeline()), and adds a new LengthFieldBasedFrameDecoder to it, then adds a new protobuf decoder to it, then returns the pipeline.

Java doesn't have first-class functions. It has something called Callable, which is a parameterizable class for zero-arity functions, but since there are no arguments you're stuck writing a new class and explicitly closing over variables you need every time you want a function. Java works around these gaps by creating a new class for every function it might need, and giving that class a single method. These classes are called “Factories”. Netty has a factory specifically for generating pipelines, so to build new Pipelines, you have to write a new class.

public class RiemannTcpChannelPipelineFactory implements ChannelPipelineFactory public ChannelPipeline getPipeline() throws Exception { ChannelPipeline p = Channels.Pipeline(); p.addLast("integer-header-decoder", new LengthBasedFieldFrameDecoder(Integer/MAX_VALUE, 0, 4, 0, 4); p.addLast("protobuf-decoder", new ProtobufDecoder(Proto.Msg.getDefaultInstance())); return p; } } new RiemannTcpChannelPipelineFactory()

The class (and the interface it implements) are basically irrelevant–this class only has one method, and its type is inferrable. This is a first-class function, in Java. We can shorten it a bit by writing an anonymous class:

new ChannelPipelineFactory() { public ChannelPipeline getPipeline throws Exception {

… which saves us from having to name our factory, but we still have to talk about ChannelPipelineFactory, remember its method signature and constructor, etc–and the implementer still needs to write a class or interface.

Since Netty expects a ChannelPipelineFactory, we can't just feed it a Clojure function. Instead, we can use (reify) to create a new instance of a dynamically compiled class which implements any number of interfaces, and has final local variables closed over from the local environment. So if we wanted to reuse the same protobuf decoder in every pipeline…

(let [pb (ProtobufDecoder. (Proto$Msg/getDefaultInstance))] (reify ChannelPipelineFactory (getPipeline [this] (doto (Channels/pipeline) (.addLast "integer-header-decoder" (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4)) (.addLast "protobuf-decoder" pb)))))

In Java, you'd create a new class variable, like so. Note that if you wanted to change pb you'd have to write some plumbing functions–getters, setters, constructors, or whatever, or use an anonymous class and close over a reference object.

public class RiemannTcpChannelPipelineFactory { final ProtobufDecoder pb = new ProtobufDecoder(Proto.Msg.getDefaultInstance()); ...

Now… these two create basically identical objects. Same logical flow. But notice what's missing in the Clojure code.

There's no name for the factory. We don't need one because it's a meaningless object–its sole purpose is to act like a partially applied function. It disappears into the bowels of Netty and we never think of it again. This is an entire object we didn't have to think up a name for, ensure that its name and constructor are consistent with the rest of the codebase, create a new file to put it in, and add that file to source control. The architecture pattern of “Factory”, and its associated single-serving packets of one verb each, has disappeared.

(let [adder (partial + 1 2)] (adder 3 4) ; => 1 + 2 + 3 + 4 = 10public class AdderFactory { public final int addend1; public final int addend2; ... public AdderFactory(final int addend1) { this.addend1 = addend1; } public AdderFactory(final int addend1, final int addend2) { this.addend1 = addend1; this.addend2 = addend2; } ... public int add(final int anotherAddend1, final int anotherAddend2) { return addend1 + addend2 + anotherAddend1 + anotherAddend2; } } AdderFactory adder = new AdderFactory(1, 2) adder.add(3, 4);

Factories are just awkward ways to express partial functions.

Back to Netty.

So far we've talked about a single ChannelPipelineFactory. What happens if you want to make more than one? Riemann has at least three–and I don't want to write down three classes for three almost-identical pipelines. I just want to write down their names, and the handlers themselves, and have a function take care of the rest of the plumbing.

Enter our sinister friend, the macro, stage left:

(defmacro channel-pipeline-factory "Constructs an instance of a Netty ChannelPipelineFactory from a list of names and expressions which evaluate to handlers. Names with metadata :shared are evaluated once and re-used in every invocation of getPipeline(), other handlers will be evaluated each time. (channel-pipeline-factory frame-decoder (make-an-int32-frame-decoder) ^:shared protobuf-decoder (ProtobufDecoder. (Proto$Msg/getDefaultInstance)) ^:shared msg-decoder msg-decoder)" [& names-and-exprs] (assert (even? (count names-and-exprs))) (let [handlers (partition 2 names-and-exprs) shared (filter (comp :shared meta first) handlers) forms (map (fn [[h-name h-expr] ] `(.addLast ~(str h-name) ~(if (:shared (meta h-name)) h-name h-expr))) handlers)] `(let [~@(apply concat shared)] (reify ChannelPipelineFactory (getPipeline [this] (doto (org.jboss.netty.channel.Channels/pipeline) ~@forms))))))

What the hell is this thing?

Well first, it's a macro. That means it's Clojure code which runs at compile time. It's going to receive Clojure source code as its arguments, and return other code to replace itself. Since Clojure is homoiconic, its source code looks like the data structure that it is. We can use the same language to manipulate data and code. Macros define new syntax.

First comes the docstring. If we say (doc channel-pipeline-factory) at a REPL, it'll show us the documentation written here, including an example of how to use the function. ^:shared foo is metadata–the symbol foo will have a special key called :shared set on its metadata map. We use that to discriminate between handlers that can be shared safely, and those which can't.

[& names-and-exprs]

These are the arguments: a list like [name1 handler1 name2 handler2].

(assert (even? (count names-and-exprs)))

This check runs at compile time, and verifies that we passed an even number of arguments to the function. This is a simple way to validate the new syntax we're inventing.

(let [handlers (partition 2 names-and-exprs) shared (filter (comp :shared meta first) handlers)

Now we assign a new variable: handlers. (partition 2) splits up the list of handlers into [name, handler] pairs, to make it easier to work with. Then we find all the handlers which are sharable between pipelines. (comp :shared meta first) composes three functions into one. Take the first part of the handler (the name), get its meta data, and tell me if it's :shared.

(let [handlers (partition 2 names-and-exprs) shared (filter (comp :shared meta first) handlers) forms (map (fn [[h-name h-expr] ] `(.addLast ~(str h-name) ~(if (:shared (meta h-name)) h-name h-expr))) handlers)]

Now we turn these pairs like [pb-decoder (ProtobufDecoder...)] into code like (.addLast "pb" pb) if it's shared, and (.addLast "pb" (ProtobufDecoder...)) otherwise. Where does the variable pb come from?

`(let [~@(apply concat shared)]

Ah, there it is. We take all the shared name/handler pairs and bind their names to their values as local variables. But wait–what's that backtick just before let? That's a special symbol for writing macros, and it means “Don't run this code–just construct it”. ~@ means “It's OK to run this code now–and insert whatever it returns in its place”. So the first part of the code we return will be the (let) expression binding shared names to handlers.

(reify ChannelPipelineFactory (getPipeline [this] (doto (org.jboss.netty.channel.Channels/pipeline) ~@forms))))))

And there's the pipelinefactory itself. We construct a new pipeline, and… insert new code–the forms we generated before.

Macros give us control of syntax, and allow us to solve problems at compilation time. You don't have access to the values behind the code, but you can manipulate the symbols of the code itself absent meaning. Syntax without semantics. At compile time, Clojure invokes our macro and generates this bulky code we had before…

(let [protobuf-decoder (ProtobufDecoder. (Proto$Msg/getDefaultInstance))] (reify ChannelPipelineFactory (getPipeline [this] (doto (Channels/pipeline) (.addLast "integer-header-decoder" (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4)) (.addLast "protobuf-decoder" protobuf-decoder)))))

… from a much simpler expression:

(channel-pipeline-factory integer-header-decoder (LengthFieldBasedFrameDecoder. Integer/MAX_VALUE 0 4 0 4) ^:shared protobuf-decoder (ProtobufDecoder. (Proto$Msg/getDefaultInstance))

Notice what's missing. We don't need to think about the pipeline class, or the name of its method. We don't have to name and manipulate variables. .addLast disappeared entirely. The protobuf handler is reused, and the length decoder is created anew every time–but they're expressed exactly the same way. We've fundamentally altered the syntax of the language–its execution order–in a controlled way. This expression is symmetric, compact, reusable, and efficient.

We've reduced the problem to a simple, minimal expression–and made that into code.

Tradeoffs

I didn't start out with this macro. Originally, Riemann used plain functions to compose pipelines. As the pipelines evolved and split into related variants, the code did too. When it came time to debug performance problems, I had a difficult time understanding what the pipelines actually looked like—composing a pipeline involved three to four layers of indirect functions across three namespaces. In order to understand the problem—and develop a solution—I needed a clear way to express pipelines themselves.

(channel-pipeline-factory int32-frame-decoder (int32-frame-decoder) ^:shared int32-frame-encoder (int32-frame-encoder) ^:shared executor shared-execution-handler ^:shared protobuf-decoder (protobuf-decoder) ^:shared protobuf-encoder (protobuf-encoder) ^:shared msg-decoder (msg-decoder) ^:shared msg-encoder (msg-encoder) ^:shared handler (gen-tcp-handler core channel-group tcp-handler))

In this code, the relationships between handlers is easy to understand, and making changes is simple. However, this isn't the only way to express the problem. We could provide exactly the same semantics with a plain old function taking other functions. Note that #(foo bar) is Clojure shorthand for (fn [] (foo bar)).

(channel-pipeline-factory :unshared :int32-frame-decoder #(int32-frame-decoder) :shared :int32-frame-encoder (int32-frame-encoder) :shared :executor shared-execution-handler :shared :protobuf-decoder (protobuf-decoder) :shared :protobuf-encoder (protobuf-encoder) :shared :msg-decoder (msg-decoder) :shared :msg-encoder (msg-encoder) :shared :handler (gen-tcp-handler core channel-group tcp-handler))

In this code we've replaced bare symbols for handler names with :keywords, since symbols in normal code are resolved in the current scope. Symbols can't take metadata, so we've introduced a :shared keyword to indicate that a handler is sharable. Non-shared handlers, like int32-frame-decoder, are written as functions which are invoked every time we generate a new pipeline. And to parse the list into distinct handlers, we could either wrap each handler in a list or vector, or (as shown here), introduce a mandatory :unshared keyword such that every handler has three parts.

This is still a clean way to express a pipeline factory—and it has distinct tradeoffs. First, the macro runs at compile time. That means you can do an expensive operation once at compile time, and generate code which is quick to execute at runtime. The naive function version, by contrast, has to iterate over the handler forms every time it's invoked, identify whether it's shared or unshared, and may invoke additional functions to generate unshared handlers. If this code is performance-critical, the iteration and function invocation may not be in a form the JIT can efficiently optimize.

Macros can simplify expressing the same terms over and over again, and many library authors use them to provide domain-specific languages. For example, Riemann has a compact query syntax built on macros, which cuts out much of the boilerplate required in filtering events with functions. This expressiveness comes at a cost; macros can make it hard to reason about when code is evaluated, and break the substitution rule that a variable is equivalent to its value. This means that macros are typically more suitable for end users than for library code—and you should typically provide function equivalents to macro expressions where possible.

As a consequence of violating the substitution rule (and evaluation order in general), macros sacrifice runtime composition. Since macros operate on expressions, and not the runtime-evaluated value of those expressions, they're difficult to use whenever you want to bind a form to a variable, or pass a value at runtime. For instance, (map future ['(+ 1 2) (+ 3 4)]) will throw a CompilerException, informing you that the compiler can't take the value of a macro. This gives rise to macro contagion: anywhere you want to invoke a macro without literal code, the calling expression must also be a macro. The power afforded by the macro system comes with a real cost: we can no longer enjoy the freedom of dynamic evaluation.

In Riemann's particular case, the performance characteristics of the (channel-pipeline-factory) macro outweigh the reusability costs—but I don't recommend making this choice lightly. Wherever possible, use a function.

Further examples

In general, any control flow can be expressed by a function called with stateful first-class function. Javascript, for instance, uses explicit callback functions to express futures:

var a = 1; var f = future(function() { return a + 2; }); f.await(); // returns 3

And equivalently, in Clojure one might write:

(let [a 1 f (future (fn [] (+ a 2)))] ; Or alternatively, #(+ a 2) (deref f)) ; returns 3

But we can erase the need for an anonymous function entirely by using a macro—like the one built in to Clojure for futures:

(let [a 1 f (future (+ a 2))] (deref f)) ; returns 3

The Clojure standard library uses macros extensively for control flow. Short-circuiting (and) and (or) are macros, as are the more complex conditionals (cond) and (condp). Java's special syntax for synchronize { … } is written as the (locking) macro—and the concurrency expressions (dosync) for STM transactions, (future) for futures, (delay) for laziness, and (lazy-seq) for sequence construction are macros as well. You can write your own try/catch by using the macro system, as Slingshot does to great effect. In short, language features which would be a part of the compiler in other languages can be written and used by anyone.

Summary

Macros are a powerful tool to express complex ideas in very little code; and where used judiciously, help us reason about difficult problems in a clear way. But—just as language designers do—we must balance the expressiveness of new syntax with the complexity of its interactions. In general, I recommend you:

  • Write simple macros which are as easy to reason about as possible.
  • Use macros to express purely syntactic transformations, like control flow.
  • Choose a macro to simplify writing efficient, but awkward, code which the runtime cannot optimize for you.
  • In most other cases, prefer normal functions.

I've been putting more work into riemann-java-client recently, since it's definitely the bottleneck in performance testing Riemann itself. The existing RiemannTcpClient and RiemannRetryingTcpClient were threadsafe, but almost fully mutexed; using one essentially serialized all threads behind the client itself. For write-heavy workloads, I wanted to do better.

There are two logical optimizations I can make, in addition to choosing careful data structures, mucking with socket options, etc. The first is to bundle multiple events into a single Message, which the API supports. However, your code may not be structured in a way to efficiently bundle events, so where higher latencies are OK, the client can maintain a buffer of outbound events and flush it regularly.

The second optimization is to take advantage of request pipelining. Riemann's protocol is simple and synchronous: you send a Message over a TCP connection, and receive exactly one TCP message in response. The existing clients, however, forced you to wait n milliseconds for the message to cross the network, be processed by Riemann, and receive an acknowledgement. We can do better by pipelining requests: sending new requests before waiting for the previous responses, and matching up received messages with their corresponding requests later.

ThreadedClient does exactly that. All threads enqueue Messages into a lockfree queue, and receive Promise objects to be fulfilled when their response is available. The standard synchronous API is still available, and allows N threads to pipeline their requests together. Meanwhile, a writer thread sucks messages out of the write queue and sends them to Riemann, enqueuing written messages onto an in-flight queue. A reader thread pulls responses out of the socket and matches them to enqueued messages. Bounded queues provide backpressure, which limits the number of requests that can be in-flight at any time. This allows for reasonable bounds on event loss in the event of failure.

Here's what the naive client (wait for round-trip requests) looks like on loopback:

throughput-tcp.png

And here's the same test with a RiemannThreadedClient:

throughput-threaded.png

I've done no tuning or optimization to this algorithm, and error handling is rough at best. It should perform best across real-world networks where latency is nontrivial. Even on loopback, though, I'm seeing roughly double the throughput at the cost of roughly double per-event latency.

Schadenfreude is a benchmarking tool I'm using to improve Riemann. Here's a profile generated by the new riemann-bench, comparing a few recent releases in their single-threaded TCP server throughput. These results are dominated by loopback read latency–maxing out at about 8-9 kiloevents/sec. I'll be using schadenfreude to improve client performance in high-volume and multicore scenarios.

throughput.png

I needed a tool to evaluate internal and network benchmarks of Riemann, to ask questions like

  • Is parser function A or B more efficient?
  • How many threads should I allocate to the worker threadpool?
  • How did commit 2556 impact the latency distribution?

In dealing with “realtime” systems it's often a lot more important to understand the latency distribution rather than a single throughput figure, and for GC reasons you often want to see a time dependence. Basho Bench does this well, but it's in Erlang which rules out microbenchmarking of Riemann functions (e.g. at the repl). So I've hacked together this little thing I'm calling Schadenfreude (from German; “happiness at the misfortune of others”). Sums up how I feel about benchmarks in general.

; A run is a benchmark specification. :f is the function we're going to ; measure--in this case, counting using ; ; 1. an atomic reference ; 2. unordered (commute) transactions ; 3. ordered (alter) transactions. ; ; :before and :after are callbacks to set up and tear down for the test run. (let [runs [(let [a (atom 0)] {:name "atoms" :before #(reset! a 0) :f #(swap! a inc)}) (let [r (ref 0)] {:name "commute" :before #(dosync (ref-set r 0)) :f #(dosync (commute r inc))}) (let [r (ref 0)] {:name "alter" :before #(dosync (ref-set r 0)) :f #(dosync (alter r inc))})] ; For these benchmarks, we'll prime the JVM by doing the test twice and ; discarding the first one's results. We'll run each benchmark 10K times runs (map #(merge % {:prime true :n 10000}) runs) ; And we'll try each one with 1 and 2 threads runs (mapcat (fn [run] (map (fn [threads] (merge run {:threads threads :name (str (:name run) " " threads)})) [1 2])) runs) ; Actually run the function and collect data runs (map record runs) ; And plot the results together plot (latency-plot runs)] ; For this one we'll use a log plot. (.setRangeAxis (.getPlot plot) (org.jfree.chart.axis.LogarithmicAxis. "Latency (s)")) (view plot))

latency.png

When I have something usable outside a REPL I'll publish it to clojars and github. Right now I think the time alignment looks pretty dodgy so I'd like to normalize it correctly, and figure out what exactly “throughput” means. Oh, and the actual timing code is completely naive: no OS cache drop, no forced GC/finalizers, etc. I'm gonna look into tapping Criterium's code for that.

In response to Results of the 2012 State of Clojure Survey:

The idea of having a primary language honestly comes off to me as a sign that the developer hasn’t spent much time programming yet: the real world has so many languages in it, and many times the practical choice is constrained by that of the platform or existing code to interoperate with.

I've been writing code for ~18 years, ~10 professionally. I've programmed in (chronological order here) Modula-2, C, Basic, the HTML constellation, Perl, XSLT, Ruby, PHP, Java, Mathematica, Prolog, C++, Python, ML, Erlang, Haskell, Clojure, and Scala. I can state unambiguously that Clojure is my primary language: it is the most powerful, the most fun, and has the fewest tradeoffs.

Like Haskell, I view Clojure as an apex language: the best confluence of software ideas towards a unified goal. Where Haskell excels at lazy, pure, strongly typed problems, Clojure is my first choice for dynamic, high-level, general-purpose programming. I wish it were faster, that it had a smarter compiler, that it had CPAN's breadth, that its error messages were less malevolent, that it had a strong type system for some problems. But for all this, you gain a fantastically expressive, concise, rich language built out of strikingly few ideas which lock together beautifully. It gives you a modern build system, a REPL, hot code reloading, hierarchies, parametric polymorphism, protocols, namespaces, immediate, lazy, logical, object-oriented, and functional modes, rich primitives, expressive syntax, immutable and mutable containers, many kinds of concurrency, thoughtful Java integration, hygenic and anaphoric macros, and homoiconicity.

Were Clojure to cease, I would immediately endeavor to replicate its strengths in another language. That's a primary language to me. ;-)

More from Hacker News. I figure this might be of interest to folks working on parallel systems. I'll let KirinDave kick us off with:

Go scales quite well across multiple cores iff you decompose the problem in a way that’s amenable to Go’s strategy. Same with Erlang. No one is making “excuses”. It’s important to understand these problems. Not understanding concurrency, parallelism, their relationship, and Amdahl’s Law is what has Node.js in such trouble right now.

Ryah responds:

Trouble? Node.js has linear speedup over multiple cores for web servers. See http://nodejs.org/docs/v0.8.4/api/cluster.html for more info.

It's parallel in the same sense that any POSIX program is: Node pays a higher cost than real parallel VMs in serialization across IPC boundaries, not being able to take advantage of atomic CPU operations on shared data structures, etc. At least it did last time I looked. Maybe they're doing some shm-style magic/semaphore stuff now. Still going to pay the context switch cost.

this is the sanest and most pragmatic way server a web server from multiple threads

Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.

it’s all serialization - but that’s not a bottleneck for most web servers.

I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.

i’d love to hear your context-switching free multicore solution.

I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.

A Node.js example

Here are two programs, one in Node.js, and one in Clojure, which demonstrate message passing and (for Clojure) an atomic compare-and-set operation.

Node.js: https://gist.github.com/3200829

Clojure: https://gist.github.com/3200862

Note that I picked really small messages–integers–to give Node the best possible serialization advantage.

$ time node cluster.js Finished with 10000000 real 3m30.652s user 3m17.180s sys 1m16.113s

Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?

$ pidstat -w | grep node 12:13:24 PM PID cswch/s nvcswch/s Command 11:47:47 AM 25258 48.22 2.11 node 11:47:47 AM 25260 48.34 1.99 node

100 context switches per second.

$ strace -cf node cluster.js Finished with 1000000 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.03 5.219237 31 168670 nanosleep 1.63 0.087698 0 347937 61288 futex 1.01 0.054567 0 1000007 1 epoll_wait 0.20 0.010581 0 1000006 write 0.11 0.005863 0 1000005 recvmsg

OK, so every send requires a call to write(), and every read takes a call to epoll_wait() and recvmsg(). It takes 3.5 syscalls to send a message. We're spending a lot of time in usleep, and roughly 34% of messages involved futex–which I'm hoping means the Node authors did their IPC properly instead of polling streams.

[Edit: Thanks @joedamato, I was forgetting -f]

The JVM

Now let's take a look at that Clojure program, which uses 2 threads passing messages over a pair of LinkedTransferQueues. It uses 97% of each core easily. Note that the times here include ~1 second of jvm startup.

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar queue 10000000 "Elapsed time: 53116.427613 msecs" real 0m54.213s user 1m16.401s sys 0m6.028s

Why is this version over 3 times faster? Well mostly because it's not serializing and isn't javascript–but on top of that, it causes only 11 context switches per second:

$ pidstat -tw -p 26537 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:52:03 AM TGID TID cswch/s nvcswch/s Command 11:52:03 AM 26537 - 0.00 0.00 java 11:52:03 AM - 26540 0.01 0.00 |__java 11:52:03 AM - 26541 0.01 0.00 |__java 11:52:03 AM - 26544 0.01 0.00 |__java 11:52:03 AM - 26549 0.01 0.00 |__java 11:52:03 AM - 26551 0.01 0.00 |__java 11:52:03 AM - 26552 2.16 4.26 |__java 11:52:03 AM - 26553 2.10 4.33 |__java

And queues are WAY slower than compare-and-set, which involves basically no context switching:

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar atom 10000000 "Elapsed time: 999.805116 msecs" real 0m2.092s user 0m2.700s sys 0m0.176s $ pidstat -tw -p 26717 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:54:49 AM TGID TID cswch/s nvcswch/s Command 11:54:49 AM 26717 - 0.00 0.00 java 11:54:49 AM - 26720 0.00 0.01 |__java 11:54:49 AM - 26728 0.01 0.00 |__java 11:54:49 AM - 26731 0.00 0.02 |__java 11:54:49 AM - 26732 0.00 0.01 |__java

It's harder to interpret strace here because the JVM startup involves a fair number of syscalls. Subtracting the cost to run the program with 0 iterations, we can obtain the marginal cost of each message: roughly 1 futex per 24,000 ops. I suspect the futex calls here are related to the fact that the main thread and most of the clojure future pool are hanging around doing nothing. The work itself is basically free of kernel overhead.

TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.

As KirinDave notes, different languages afford different types of concurrency strategies–and some offer a more powerful selection than others. Pick the language and libraries which match your problem best.

Most applications have configuration: how to open a connection to the database, what file to log to, the locations of key data files, etc.

Configuration is hard to express correctly. It’s dynamic because you don’t know the configuration at compile time–instead it comes from a file, the network, command arguments, etc. Config is almost always implicit, because it affects your functions without being passed in as an explicit parameter. Most languages address this in two ways:

Globals

Global variables are accessible in every scope, so they make great implicit parameters for functions.

module App API_SERVER = "api3" end def save(record) http_put(APP::API_SERVER, record) end

Classes are often global, so you can also attach config to that class’s eigenclass, singleton object, or what have you:

class App def self.config; @config; end end App.config.api_server = "api3" App.config.api_server

Erlang apps often handle config with a globally-named module:

{ok, Server} = app_config:get(api_server),

The global variable model is concise and simple; it’s what you should reach for right away. Every thread sees the same values. In fact, all code everywhere sees the same values. Yet there are shortcomings: what if you’re writing a library? What about tests, where you might call the same function with several different configurations? What if you’re running more than one copy of your application concurrently?

Object graph traversal

An advanced OOP programmer may solve the global problem by putting configuration into instances. The application sets up a graph of instances, each with the configuration it needs to do its job.

class App def initialize(config) @api_client = App::APIClient config[:api_server] @logger = Logger.new config[:logger] end end

… and so forth. What if the APIClient needs to use the logger? You could keep a pointer to the application around:

class APIClient def initialize(app, config) @app = app @server = config[:server] end def get @app.logger.log "getting" end end

And traverse the graph of objects in your application. This basically amounts to passing a configuration parameter into every constructor, but has the added benefit of letting you look up other objects in the Application: maybe other local services you might need. It’s a good way to let different components work together cleanly without making their dependencies explicit: the Application doesn’t need to know exactly what services an APIClient needs. Hoorah, encapsulation! It’s also thread-safe: you can create as many applications concurrently as you like, and they won’t step on each other.

On the other hand, you do a lot of traversing, and since these are instance variables, there’s no way to refer to them within other functions, like class methods. It’s also more difficult to test, since you have to stand up all the dependencies (mocked or otherwise) in order to create an object.

At this point, someone else reading this article is screaming “dependency injection frameworks” and pulling out XML. But before we pull out DI, let’s back up and think.

Backing up for a second

What we really want from configuration is to take functions like this:

f(config, x) = g(config, x * 2) g(config, y) = h(config, y + 1) h(config, z) = config + z

… and express them like this:

f(x) = g(x*2) g(y) = h(y+1) h(z) = config + z

We want the config variable to become implicit so that f and g are simplified. f and g do depend on config–but config may be irrelevant to their internal definition, and explicitly tracking every parameter dependency in the system can be exhausting. These implicit variables are known as dynamic scope in programming languages: variables which are bound in every function in a call stack, but are not explicit in their signatures. More particularly, we want two properties:

  1. The variable is bound only within and below the binding expression. When control returns from the binding expression, the variable reverts to its previous value.

  2. The variable is bound only for the thread that created it, and threads created from the bound scope; that is to say, two parallel invocations of f() can have different values of config. This lets us run, say, two copies of an application at the same time.

In Scala, one kind of implicit scope is provided by implicit parameters, which allow enclosing scope to carry down (at least) one level, to functions which have arguments of the same name and type, and which are tagged as “implicit”. (Well, at least, I think that’s what they do; A Tour of Scala: Implicit Parameters is beyond my mortal comprehension). Implicit parameters don’t carry across threads, which makes it a little tough to defer operations using, say, futures.

In Java, one might consider an InheritableThreadLocal for the task. That gives us the thread isolation property, provided that one remembers to clean up the thread local appropriately at the end of the binding context. Many Java libraries use this to provide, say, request context in a web app. Scala neatly wraps this construct with DynamicVariable, a mutable, thread-local, thread-inherited object which is bound only while a given closure is running. Since Scala doesn’t actually have dynamic scope, we still need to access the DynamicVariable object statically. No problem: we can bind it to a singleton object, just like the Ruby examples earlier:

class App { def start() { App.config.withValue(someConfigStructure) { httpServer.run(); } } } object App { val config = new DynamicVariable[MyConfig]; } class HttpServer { def run() { listen(App.config.value.httpPort) } }

There’s a bit of a wart in that we need to call config.value() in order to get the currently bound value, but the semantics are sound, the code is readable, and there’s no extraneous bookkeeping.

Dynamic scope

In languages that support dynamic scope (Most Lisps, Perl, Haskell (sort of)), we can express this directly:

(ns app.config) (def ^:dynamic config nil) (ns app.core) (defn start [] (binding [app.config/config some-config-structure] (http-server/run))) (ns app.http-server (:use app.config)) (defn run [] (listen (:http-port config)))

One of the arguments against dynamic scope is that it can lead to name capture: a dynamic binding for “config” could break a function deep in someone else’s code that used that variable name. Clojure uses namespaces to separate vars, neatly allowing us to write either “app.config/config”, or, having included app.config, use the short name “config”. Other code remains unaffected.

Dynamic var bindings in Clojure have a root value (shared between all threads), and an overrideable thread-local value. However, not all Clojure closures close over dynamic vars! New threads do not inherit the dynamic frames of their parents by default: only future, bound-fn, and friends capture their dynamic scope. (Thread. (fn [] …)) will run with fresh (root) dynamic bindings. Use (bound-fn) where you want to preserve the current dynamic bindings between threads, and (fn) where you wish to reset them.

Thread-inheritable dynamic vars in Clojure

Alternatively, we could adopt Scala’s approach: define a new kind of reference, backed by an InheritableThreadLocal:

(defn thread-inheritable "Creates a dynamic, thread-local, thread-inheritable object, with initial value 'value'. Set with (.set x value), read with (deref x)." [value] (doto (proxy [InheritableThreadLocal IDeref] [] (deref [] (.get this))) (.set value)))

That proxy expression creates a new InheritableThreadLocal which also implements IDeref, Clojure’s interface for dereferenceable things like vars, refs, atoms, agents, etc. Now we just need a macro to set the local within some scope.

(defn- set-dynamic-thread-vars! "Takes a map of vars to values, and assigns each." [bindings-map] (doseq [[v value] bindings-map] (.set v value))) (defmacro inheritable-binding "Creates new bindings for the (already-existing) dynamic thread-inherited vars, with the supplied initial values. Executes exprs in an implict do, then re-establishes the bindings that existed before. Bindings are made sequentially, like let." [bindings & body] `(let [inner-bindings# (hash-map ~@bindings) outer-bindings# (into {} (for [[k# v#] inner-bindings#] [k# (deref k#)]))] (try (set-dynamic-thread-vars! inner-bindings#) ~@body (finally (set-dynamic-thread-vars! outer-bindings#)))))

Now we can define a new var–say config, and rebind it dynamically.

(def config (thread-inheritable :default)) (prn "Initially" @config) (inheritable-binding [config :inside] ; In any functions we call, (deref config) will be :inside. (prn "Inside" @config) ; We can safely evaluate multiple bindings in parallel. It's the ; many-worlds hypothesis in action! (inheritable-binding [config :future] (future (prn "Future" @config))) ; Unlike regular ^:dynamic vars, bindings are inherited in child threads. (inheritable-binding [config :thread] (Thread. (fn [] (prn "In unbound thread" @config)))))

More realistically, one might write:

(defmacro with-config [m & body] `(inheritable-binding [config ~m] ~@body)) (defn start-server [] (listen (:port @config))) (with-config {:port 2} (start-server))

Voilà! Mutable, thread-safe, thread-inherited, implicit variables.

It’s worth noting that these variables are not a part of the dynamic binding, so they won’t be captured by (bound-fn). If you want to pass closures between existing threads, use ^:dynamic and (bound-fn). If you want your bindings to follow thread inheritance, use this bind-dynamic approach.

Closing thoughts

With all this in mind, remember LOGO? That little language has more in common with Lisp than you might think, though that discussion is, shall we say… out of this article’s scope.

TO RUNHTTPSERVER LISTEN :PORT END TO STARTAPP MAKE "PORT 8080 RUNHTTPSERVER END
Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.