Context switches and serialization in Node

2012-07-29

More from Hacker News. I figure this might be of interest to folks working on parallel systems. I’ll let KirinDave kick us off with:

Go scales quite well across multiple cores iff you decompose the problem in a way that's amenable to Go's strategy. Same with Erlang.
No one is making “excuses”. It’s important to understand these problems. Not understanding concurrency, parallelism, their relationship, and Amdahl’s Law is what has Node.js in such trouble right now.

Ryah responds:

Trouble? Node.js has linear speedup over multiple cores for web servers. See http://nodejs.org/docs/v0.8.4/api/cluster.html for more info.

It’s parallel in the same sense that any POSIX program is: Node pays a higher cost than real parallel VMs in serialization across IPC boundaries, not being able to take advantage of atomic CPU operations on shared data structures, etc. At least it did last time I looked. Maybe they’re doing some shm-style magic/semaphore stuff now. Still going to pay the context switch cost.

this is the sanest and most pragmatic way server a web server from multiple threads

Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.

it's all serialization - but that's not a bottleneck for most web servers.

I disagree, especially for a format like JSON. In fact, every web app server I’ve dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren’t going to be doing computationally expensive tasks in Node, so messaging performance is paramount.

i'd love to hear your context-switching free multicore solution.

I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.

A Node.js example

Here are two programs, one in Node.js, and one in Clojure, which demonstrate message passing and (for Clojure) an atomic compare-and-set operation.

Node.js: https://gist.github.com/3200829

Clojure: https://gist.github.com/3200862

Note that I picked really small messages–integers–to give Node the best possible serialization advantage.

$ time node cluster.js 
Finished with 10000000

real 3m30.652s
user 3m17.180s
sys  1m16.113s

Note the high sys time: that’s IPC. Node also uses only 75% of each core. Why?

$ pidstat -w | grep node
12:13:24 PM       PID   cswch/s nvcswch/s  Command
11:47:47 AM     25258     48.22      2.11  node
11:47:47 AM     25260     48.34      1.99  node

100 context switches per second.

$ strace -cf node cluster.js 
Finished with 1000000
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.03    5.219237          31    168670           nanosleep
  1.63    0.087698           0    347937     61288 futex
  1.01    0.054567           0   1000007         1 epoll_wait
  0.20    0.010581           0   1000006           write
  0.11    0.005863           0   1000005           recvmsg

OK, so every send requires a call to write(), and every read takes a call to epoll_wait() and recvmsg(). It takes 3.5 syscalls to send a message. We’re spending a lot of time in usleep, and roughly 34% of messages involved futex–which I’m hoping means the Node authors did their IPC properly instead of polling streams.

[Edit: Thanks @joedamato, I was forgetting -f]

The JVM

Now let’s take a look at that Clojure program, which uses 2 threads passing messages over a pair of LinkedTransferQueues. It uses 97% of each core easily. Note that the times here include ~1 second of jvm startup.

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar queue
10000000
"Elapsed time: 53116.427613 msecs"

real	0m54.213s
user	1m16.401s
sys	0m6.028s

Why is this version over 3 times faster? Well mostly because it’s not serializing and isn’t javascript–but on top of that, it causes only 11 context switches per second:

$ pidstat -tw -p 26537
Linux 3.2.0-3-amd64 (azimuth) 	07/29/2012 	_x86_64_	(2 CPU)

11:52:03 AM      TGID       TID   cswch/s nvcswch/s  Command
11:52:03 AM     26537         -      0.00      0.00  java
11:52:03 AM         -     26540      0.01      0.00  |__java
11:52:03 AM         -     26541      0.01      0.00  |__java
11:52:03 AM         -     26544      0.01      0.00  |__java
11:52:03 AM         -     26549      0.01      0.00  |__java
11:52:03 AM         -     26551      0.01      0.00  |__java
11:52:03 AM         -     26552      2.16      4.26  |__java
11:52:03 AM         -     26553      2.10      4.33  |__java

And queues are WAY slower than compare-and-set, which involves basically no context switching:

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar atom
10000000
"Elapsed time: 999.805116 msecs"

real	0m2.092s
user	0m2.700s
sys	0m0.176s

$ pidstat -tw -p 26717
Linux 3.2.0-3-amd64 (azimuth) 	07/29/2012 	_x86_64_	(2 CPU)

11:54:49 AM      TGID       TID   cswch/s nvcswch/s  Command
11:54:49 AM     26717         -      0.00      0.00  java
11:54:49 AM         -     26720      0.00      0.01  |__java
11:54:49 AM         -     26728      0.01      0.00  |__java
11:54:49 AM         -     26731      0.00      0.02  |__java
11:54:49 AM         -     26732      0.00      0.01  |__java

It’s harder to interpret strace here because the JVM startup involves a fair number of syscalls. Subtracting the cost to run the program with 0 iterations, we can obtain the marginal cost of each message: roughly 1 futex per 24,000 ops. I suspect the futex calls here are related to the fact that the main thread and most of the clojure future pool are hanging around doing nothing. The work itself is basically free of kernel overhead.

TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It’s a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.

As KirinDave notes, different languages afford different types of concurrency strategies–and some offer a more powerful selection than others. Pick the language and libraries which match your problem best.

rektide on 2012-07-29

I’d love an in depth review on where the attempted multi-threading in node branch- isolates- went wrong, where the worst complexities were. http://groups.google.com/group/nodejs/msg/6b8b8a487d2ab817

Hopefully that work can live on again in the future.

Jesper Louis Andersen on 2012-07-29

You were also pondering whether or not a swtch invokes a TLB flush or not. The answer is “it depends” :) It is architecture specific and implementation specific. On older x86 hardware, TLB flushes do happen. On newer hardware it can be avoided if you implement it the right way. If you are running virtualization, then all this changes, naturally.

TLB shootdowns happen when one core decides it needs to change address mappings for something which is shared between cores. For instance to lock it for exclusivity. So if the memory space is shared among several processes, this becomes a problem.

As for the discussion itself, stateless servers are boring. You want problems that requires keeping state in between requests - and also have that state invoke operations on itself based on events or timers. And you want it on several physical machines next to each other for redundancy. These are way harder to solve, and I will hypothesize Node.js is a bad fit for those.

Aphyr on 2012-07-29

Regarding node isolates, I think that it would probably be less difficult to port node.js to a better VM than to modify the VM to support real or faked parallelism. Node does a few things right: smart packaging, a rich library of single-purpose tools at hand, a simple-to-understand concurrency model, etc. It also does some important things spectacularly wrong: being single-threaded, inheriting Javascript’s laughable type system, offering syntactically heavy closures as the end-all-be-all of nonlinear programming, etc. Since JS and Dart are single-threaded by specification, I have serious doubts about the semantics of parallel extensions to the core language.

Re: TLB shootdowns, thanks! This is honestly out of my depth, and I’ve read conflicting explanations as to what exactly triggers them. If I understand correctly, modern Linux kernels only need to reach an interprocessor barrier for TLB flushes when memory maps change. I assume that v8 makes relatively few allocations, and watching /proc/interrupts appears to confirm this: there are basically no TLB shootdowns for any of my message-passing examples above.

I agree that stateless servers are boring, but this is a.) good, because boring things are easy to understand, write, and analyze; and b.) many things can be accomplished with stateless components. Because shared state is complicated and expensive, I try to isolate the stateful part of my systems from the stateless ones. The stateless parts may well be good candidates for Node.js. That said, you’re correct to observe that stateless servers invariably offload their state-sharing to another system: a database, queue, internal services, third party APIs, and so on.

TJ on 2012-07-30

You might want to check this out too https://github.com/visionmedia/super-sockets you’ll get better throughput than node’s built-in json stuff

Aphyr on 2012-07-30

TJ, it’s great that folks are working on high-performance socket message passing–but I suspect you may not have read the post.

Tony Arcieri on 2012-07-30

Oracle will be shipping an InvokeDynamic-based implementation of JavaScript with JDK8. It shouldn’t be too hard to reimplement the Node I/O and HTTP APIs using Java NIO and run Node.js applications on the JVM once that’s complete.

Aphyr on 2012-07-30

Bascule, I think this is the best possible outcome. It’ll be really nice for folks who have invested time in learning Node to be able to take advantage of parallelism and keep using the same patterns and libraries. I imagine the work that’s being done on isolates will come into play as well.

Ashwin Jayaprakash on 2012-07-31

No RSS feed?

Aphyr on 2012-07-31

I’ve been kinda nuked by health issues this year, only got halfway to finishing aphyr.com. :-( Lots of folks have asked for a feed; it’s a high priority for me.

Pooria on 2012-09-03

That’s a very good analysis. Thank you.

But there’s a problem: The Clojure code looks “literally” like gibberish to me. Now I happen to know Node, but even if I didn’t, I would very easily understand what the first gist is doing. I read the first two chapters of some Clojure book (or maybe online tutorial) very recently (and created like 10 simple programs with it) and yet the second gist makes absolutely no sense to me. I can understand fragments of it, but…

I’m not saying the syntax is everything. But the barrier to entrance to Node-land is much lower than that of Erlang’s or Clojure’s.

And good luck with those “health issues”.

Aphyr on 2012-10-29

Yeah, folks do tend to best understand the languages they speak. I find Chinese pretty hard to follow, too!

That said, this is nontrivial Clojure code. It uses several parts of the stdlib including the STM and futures, has no comments at all, uses the same functions to do multiple things at once, and the variable names are extremely short. Why? Because this gist is a transcript of my REPL session–I wrote this code live and never intended it to be read. Production Clojure tends to look a little different. ;-)

Maybe some narration will help.

https://gist.github.com/3200862

The code lives in a namespace, which I’ve called “messagepassing.core”. It’s going to use the standard java library’s LinkedTransferQueue. We also set a constant ’m': the number of messages to exchange.

The main function can do two things: test the use of queues, or the use of atoms. It just looks at the first command-line argument to decide what to do.

queue-test defines a function “bounce”, which takes a number from one queue, increments it, and puts it back into the other queue. It keeps doing this until the number is bigger than m. In the node.js code this function is also called bounce, but it’s not recursive because Node is calling the function as a part of its IO loop. In fact you can’t express any other type of flow control in Node.

Then queue-test creates a pair of queues (q1 and q2) and a pair of workers (w1 and w2). The workers are just independent threads (futures) which bounce a number between queues. @w2 expands to (deref w2), and dereferencing a future means “wait for the future’s return value”. So in the main thread, @w2 waits until the test is complete. Then we clean up the other worker thread.

Atom-test uses the STM, so there’s no real node analogy here. It defines a function called bump, which increments an atomic reference ‘a’ until it reaches m. swap! is a clojure core function which changes the value of an atomic reference using a function of its current value. Since swap! also returns the value it set, we can use (when) to determine whether the threshold is reached–without race conditions.

The test creates a single atomic reference x to the number one, then spawns a pair of worker threads w1 and w2 which bump x until it reaches the threshold m. We resolve both futures, which blocks until each has determined that the atom’s value is now equal to m.

Hope that helps. :)

Ryan on 2013-02-05

Soooo…Node.js is cancer. I feel bad for the naive idiots who followed the hype.

Context switches and serialization in Node

A Node.js example

The JVM

Post a Comment