One of the hard-won lessons of the last few weeks has been that inexplicable periodic latency jumps in network services should be met with an investigation into named.

dns_latency.png

API latency has been wonky the last couple weeks; for a few hours it will rise to roughly 5 to 10x normal, then drop again. Nothing in syslog, no connection table issues, ip stats didn't reveal any TCP/IP layer difficulties, network was solid, no CPU, memory, or disk contention, no obviously correlated load on other hosts. Turns out it was Bind getting overwhelmed (we have, er, nontrivial DNS demands) and causing local domain resolution to slow down. For now I'm just pushing everything out in /etc/hosts, but will probably drop a local bind9 on every host as a cache.

If anyone has experience with production DNS resolver caching, would appreciate your input.

John Mullerleile, Phil Kulak, and I gave a talk tonight, entitled "Scaling at Showyou."

stack.png

I gave an overview of the Showyou architecture, including our use of Riak, Solr, and Redis; strategies for robust systems; and our comprehensive monitoring system. You may want to check out:

Phil talked a little bit about the importer, including our use of Node.js and some nice stats.

John dropped lots of juicy details regarding his exciting projects, including a new Riak backend which binds together Solr, LevelDB, and a distributed processing system we're calling Fabric. Fast parallelized key listing, range queries, full-text search, geospatial queries, etc. In Riak. Yes, you heard that right.

Oh, and as a part of Fabric we've got a distributed queue with replicated failover and transactions, built on top of Hazelcast. Exposed over protocol buffers. We've got some polishing to do before that gets released, but when it does, should be worthy of another talk.

Copyright © 2015 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.