Microsoft released this little gem today, fixing a bug which allowed remote code execution on all Windows Vista, 6, and Server 2008 versions.

...allow remote code execution if an attacker sends a continuous flow of specially crafted UDP packets to a closed port on a target system.

Meanwhile, in an aging supervillain’s cavernous lair…

Continue reading (63 words)

Major thanks to John Muellerleile (@jrecursive) for his help in crafting this.

Actually, don’t expose pretty much any database directly to untrusted connections. You’re begging for denial-of-service issues; even if the operations are semantically valid, they’re running on a physical substrate with real limits.

Riak, for instance, exposes mapreduce over its HTTP API. Mapreduce is code; code which can have side effects; code which is executed on your cluster. This is an attacker’s dream.

Continue reading (619 words)

As a part of the exciting series of events (long story…) around our riak cluster this week, we switched over to riak-pipe mapreduce. Usually, when a node is down mapreduce times shoot through the roof, which causes slow behavior and even timeouts on the API. Riak-pipe changes that: our API latency for mapreduce-heavy requests like feeds and comments fell from 3-7 seconds to a stable 600ms. Still high, but at least tolerable.

mapred.png

[Update] I should also mention that riak-pipe MR throws about a thousand apparently random, recoverable errors per day. Things like map_reduce_error with no explanation in the logs, or {“lineno”:466,“message”:“SyntaxError: syntax error”,“source”:“()”} when the source is definitely not “()”. Still haven’t figured out why, but it seems vaguely node-dependent.

Continue reading (121 words)

One of the hard-won lessons of the last few weeks has been that inexplicable periodic latency jumps in network services should be met with an investigation into named.

dns_latency.png

API latency has been wonky the last couple weeks; for a few hours it will rise to roughly 5 to 10x normal, then drop again. Nothing in syslog, no connection table issues, ip stats didn’t reveal any TCP/IP layer difficulties, network was solid, no CPU, memory, or disk contention, no obviously correlated load on other hosts. Turns out it was Bind getting overwhelmed (we have, er, nontrivial DNS demands) and causing local domain resolution to slow down. For now I’m just pushing everything out in /etc/hosts, but will probably drop a local bind9 on every host as a cache.

Continue reading (141 words)

AWS::S3 is not threadsafe. Hell, it’s not even reusable; most methods go through a class constant. To use it in threaded code, it’s necessary to isolate S3 operations in memory. Fork to the rescue!

def s3(key, data, bucket, opts)
  begin
    fork_to do
      AWS::S3::Base.establish_connection!(
        :access_key_id => KEY,
        :secret_access_key => SECRET
      )
      AWS::S3::S3Object.store key, data, bucket, opts
    end
  rescue Timeout::Error
    raise SubprocessTimedOut
  end
end

def fork_to(timeout = 4)
  r, w, pid = nil, nil, nil
  begin
    # Open pipe
    r, w = IO.pipe

    # Start subprocess
    pid = fork do
      # Child
      begin
        r.close

        val = begin
          Timeout.timeout(timeout) do
            # Run block
            yield
          end
        rescue Exception => e
          e
        end

        w.write Marshal.dump val
        w.close
      ensure
        # YOU SHALL NOT PASS
        # Skip at_exit handlers.
        exit!
      end
    end

    # Parent
    w.close

    Timeout.timeout(timeout) do
      # Read value from pipe
      begin
        val = Marshal.load r.read
      rescue ArgumentError => e
        # Marshal data too short
        # Subprocess likely exited without writing.
        raise Timeout::Error
      end

      # Return or raise value from subprocess.
      case val
      when Exception
        raise val
      else
        return val
      end
    end
  ensure
    if pid
      Process.kill "TERM", pid rescue nil
      Process.kill "KILL", pid rescue nil
      Process.waitpid pid rescue nil
    end
    r.close rescue nil
    w.close rescue nil
  end
end

There’s a lot of bookkeeping here. In a nutshell we’re forking and running a given block in a forked subprocess. The result of that operation is returned to the parent by a pipe. The rest is just timeouts and process accounting. Subprocesses have a tendency to get tied up, leaving dangling pipes or zombies floating around. I know there are weak points and race conditions here, but with robust retry code this approach is suitable for production.

Continue reading (312 words)

In distributed systems, one frequently needs a set of n nodes to come to a consensus on a particular coordinating or master node, referred to as the leader. Leader election protocols are used to establish this. Sure, you could do the Swedish or the Silverback, but there’s a whole world of consensus algorithms out there. For instance:

The Agent Smith

Each node injects its neighbors with a total copy of its own state and identity, taking over operations on that node. Convergence is reached when all nodes are identical.

Continue reading (710 words)