Context switches and serialization in Node
More from Hacker News. I figure this might be of interest to folks working on parallel systems. I'll let KirinDave kick us off with:
Go scales quite well across multiple cores iff you decompose the problem in a way that’s amenable to Go’s strategy. Same with Erlang. No one is making “excuses”. It’s important to understand these problems. Not understanding concurrency, parallelism, their relationship, and Amdahl’s Law is what has Node.js in such trouble right now.
Trouble? Node.js has linear speedup over multiple cores for web servers. See http://nodejs.org/docs/v0.8.4/api/cluster.html for more info.
It's parallel in the same sense that any POSIX program is: Node pays a higher cost than real parallel VMs in serialization across IPC boundaries, not being able to take advantage of atomic CPU operations on shared data structures, etc. At least it did last time I looked. Maybe they're doing some shm-style magic/semaphore stuff now. Still going to pay the context switch cost.
this is the sanest and most pragmatic way server a web server from multiple threads
Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.
it’s all serialization - but that’s not a bottleneck for most web servers.
I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.
i’d love to hear your context-switching free multicore solution.
I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.
A Node.js example
Here are two programs, one in Node.js, and one in Clojure, which demonstrate message passing and (for Clojure) an atomic compare-and-set operation.
Note that I picked really small messages–integers–to give Node the best possible serialization advantage.
$ time node cluster.js Finished with 10000000 real 3m30.652s user 3m17.180s sys 1m16.113s
Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?
$ pidstat -w | grep node 12:13:24 PM PID cswch/s nvcswch/s Command 11:47:47 AM 25258 48.22 2.11 node 11:47:47 AM 25260 48.34 1.99 node
100 context switches per second.
$ strace -cf node cluster.js Finished with 1000000 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.03 5.219237 31 168670 nanosleep 1.63 0.087698 0 347937 61288 futex 1.01 0.054567 0 1000007 1 epoll_wait 0.20 0.010581 0 1000006 write 0.11 0.005863 0 1000005 recvmsg
OK, so every send requires a call to write(), and every read takes a call to epoll_wait() and recvmsg(). It takes 3.5 syscalls to send a message. We're spending a lot of time in usleep, and roughly 34% of messages involved futex–which I'm hoping means the Node authors did their IPC properly instead of polling streams.
[Edit: Thanks @joedamato, I was forgetting -f]
Now let's take a look at that Clojure program, which uses 2 threads passing messages over a pair of LinkedTransferQueues. It uses 97% of each core easily. Note that the times here include ~1 second of jvm startup.
$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar queue 10000000 "Elapsed time: 53116.427613 msecs" real 0m54.213s user 1m16.401s sys 0m6.028s
$ pidstat -tw -p 26537 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:52:03 AM TGID TID cswch/s nvcswch/s Command 11:52:03 AM 26537 - 0.00 0.00 java 11:52:03 AM - 26540 0.01 0.00 |__java 11:52:03 AM - 26541 0.01 0.00 |__java 11:52:03 AM - 26544 0.01 0.00 |__java 11:52:03 AM - 26549 0.01 0.00 |__java 11:52:03 AM - 26551 0.01 0.00 |__java 11:52:03 AM - 26552 2.16 4.26 |__java 11:52:03 AM - 26553 2.10 4.33 |__java
And queues are WAY slower than compare-and-set, which involves basically no context switching:
$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar atom 10000000 "Elapsed time: 999.805116 msecs" real 0m2.092s user 0m2.700s sys 0m0.176s $ pidstat -tw -p 26717 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:54:49 AM TGID TID cswch/s nvcswch/s Command 11:54:49 AM 26717 - 0.00 0.00 java 11:54:49 AM - 26720 0.00 0.01 |__java 11:54:49 AM - 26728 0.01 0.00 |__java 11:54:49 AM - 26731 0.00 0.02 |__java 11:54:49 AM - 26732 0.00 0.01 |__java
It's harder to interpret strace here because the JVM startup involves a fair number of syscalls. Subtracting the cost to run the program with 0 iterations, we can obtain the marginal cost of each message: roughly 1 futex per 24,000 ops. I suspect the futex calls here are related to the fact that the main thread and most of the clojure future pool are hanging around doing nothing. The work itself is basically free of kernel overhead.
TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.
As KirinDave notes, different languages afford different types of concurrency strategies–and some offer a more powerful selection than others. Pick the language and libraries which match your problem best.