Context switches and serialization in Node

More from Hacker News. I figure this might be of interest to folks working on parallel systems. I'll let KirinDave kick us off with:

Go scales quite well across multiple cores iff you decompose the problem in a way that’s amenable to Go’s strategy. Same with Erlang. No one is making “excuses”. It’s important to understand these problems. Not understanding concurrency, parallelism, their relationship, and Amdahl’s Law is what has Node.js in such trouble right now.

Ryah responds:

Trouble? Node.js has linear speedup over multiple cores for web servers. See http://nodejs.org/docs/v0.8.4/api/cluster.html for more info.

It's parallel in the same sense that any POSIX program is: Node pays a higher cost than real parallel VMs in serialization across IPC boundaries, not being able to take advantage of atomic CPU operations on shared data structures, etc. At least it did last time I looked. Maybe they're doing some shm-style magic/semaphore stuff now. Still going to pay the context switch cost.

this is the sanest and most pragmatic way server a web server from multiple threads

Threads and processes both require a context switch, but on posix systems the thread switch is considerably less expensive. Why? Mainly because the process switch involves changing the VM address space, which means all that hard-earned cache has to be fetched from DRAM again. You also pay a higher cost in synchronization: every message shared between processes requires crossing the kernel boundary. So not only do you have a higher memory use for shared structures and higher CPU costs for serialization, but more cache churn and context switching.

it’s all serialization - but that’s not a bottleneck for most web servers.

I disagree, especially for a format like JSON. In fact, every web app server I've dug into spends a significant amount of time on parsing and unparsing responses. You certainly aren't going to be doing computationally expensive tasks in Node, so messaging performance is paramount.

i’d love to hear your context-switching free multicore solution.

I claimed no such thing: only that multiprocess IPC is more expensive. Modulo syscalls, I think your best bet is gonna be n-1 threads with processor affinities taking advantage of cas/memory fence capabilities on modern hardware.

A Node.js example

Here are two programs, one in Node.js, and one in Clojure, which demonstrate message passing and (for Clojure) an atomic compare-and-set operation.

Node.js: https://gist.github.com/3200829

Clojure: https://gist.github.com/3200862

Note that I picked really small messages–integers–to give Node the best possible serialization advantage.

$ time node cluster.js Finished with 10000000 real 3m30.652s user 3m17.180s sys 1m16.113s

Note the high sys time: that's IPC. Node also uses only 75% of each core. Why?

$ pidstat -w | grep node 12:13:24 PM PID cswch/s nvcswch/s Command 11:47:47 AM 25258 48.22 2.11 node 11:47:47 AM 25260 48.34 1.99 node

100 context switches per second.

$ strace -cf node cluster.js Finished with 1000000 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.03 5.219237 31 168670 nanosleep 1.63 0.087698 0 347937 61288 futex 1.01 0.054567 0 1000007 1 epoll_wait 0.20 0.010581 0 1000006 write 0.11 0.005863 0 1000005 recvmsg

OK, so every send requires a call to write(), and every read takes a call to epoll_wait() and recvmsg(). It takes 3.5 syscalls to send a message. We're spending a lot of time in usleep, and roughly 34% of messages involved futex–which I'm hoping means the Node authors did their IPC properly instead of polling streams.

[Edit: Thanks @joedamato, I was forgetting -f]

The JVM

Now let's take a look at that Clojure program, which uses 2 threads passing messages over a pair of LinkedTransferQueues. It uses 97% of each core easily. Note that the times here include ~1 second of jvm startup.

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar queue 10000000 "Elapsed time: 53116.427613 msecs" real 0m54.213s user 1m16.401s sys 0m6.028s

Why is this version over 3 times faster? Well mostly because it's not serializing and isn't javascript–but on top of that, it causes only 11 context switches per second:

$ pidstat -tw -p 26537 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:52:03 AM TGID TID cswch/s nvcswch/s Command 11:52:03 AM 26537 - 0.00 0.00 java 11:52:03 AM - 26540 0.01 0.00 |__java 11:52:03 AM - 26541 0.01 0.00 |__java 11:52:03 AM - 26544 0.01 0.00 |__java 11:52:03 AM - 26549 0.01 0.00 |__java 11:52:03 AM - 26551 0.01 0.00 |__java 11:52:03 AM - 26552 2.16 4.26 |__java 11:52:03 AM - 26553 2.10 4.33 |__java

And queues are WAY slower than compare-and-set, which involves basically no context switching:

$ time java -jar target/messagepassing-0.1.0-SNAPSHOT-standalone.jar atom 10000000 "Elapsed time: 999.805116 msecs" real 0m2.092s user 0m2.700s sys 0m0.176s $ pidstat -tw -p 26717 Linux 3.2.0-3-amd64 (azimuth) 07/29/2012 _x86_64_ (2 CPU) 11:54:49 AM TGID TID cswch/s nvcswch/s Command 11:54:49 AM 26717 - 0.00 0.00 java 11:54:49 AM - 26720 0.00 0.01 |__java 11:54:49 AM - 26728 0.01 0.00 |__java 11:54:49 AM - 26731 0.00 0.02 |__java 11:54:49 AM - 26732 0.00 0.01 |__java

It's harder to interpret strace here because the JVM startup involves a fair number of syscalls. Subtracting the cost to run the program with 0 iterations, we can obtain the marginal cost of each message: roughly 1 futex per 24,000 ops. I suspect the futex calls here are related to the fact that the main thread and most of the clojure future pool are hanging around doing nothing. The work itself is basically free of kernel overhead.

TL;DR: node.js IPC is not a replacement for a real parallel VM. It allows you to solve a particular class of parallel problems (namely, those which require relatively infrequent communication) on multiple cores, but shared state is basically impossible and message passing is slow. It's a suitable tool for problems which are largely independent and where you can defer the problem of shared state to some other component, e.g. a database. Node is great for stateless web heads, but is in no way a high-performance parallel environment.

As KirinDave notes, different languages afford different types of concurrency strategies–and some offer a more powerful selection than others. Pick the language and libraries which match your problem best.

rektide
rektide, on

I’d love an in depth review on where the attempted multi-threading in node branch- isolates- went wrong, where the worst complexities were. http://groups.google.com/group/nodejs/msg/6b8b8a487d2ab817

Hopefully that work can live on again in the future.

Jesper Louis Andersen
Jesper Louis Andersen, on

You were also pondering whether or not a swtch invokes a TLB flush or not. The answer is “it depends” :) It is architecture specific and implementation specific. On older x86 hardware, TLB flushes do happen. On newer hardware it can be avoided if you implement it the right way. If you are running virtualization, then all this changes, naturally.

TLB shootdowns happen when one core decides it needs to change address mappings for something which is shared between cores. For instance to lock it for exclusivity. So if the memory space is shared among several processes, this becomes a problem.

As for the discussion itself, stateless servers are boring. You want problems that requires keeping state in between requests - and also have that state invoke operations on itself based on events or timers. And you want it on several physical machines next to each other for redundancy. These are way harder to solve, and I will hypothesize Node.js is a bad fit for those.

Aphyr
Aphyr, on

Regarding node isolates, I think that it would probably be less difficult to port node.js to a better VM than to modify the VM to support real or faked parallelism. Node does a few things right: smart packaging, a rich library of single-purpose tools at hand, a simple-to-understand concurrency model, etc. It also does some important things spectacularly wrong: being single-threaded, inheriting Javascript’s laughable type system, offering syntactically heavy closures as the end-all-be-all of nonlinear programming, etc. Since JS and Dart are single-threaded by specification, I have serious doubts about the semantics of parallel extensions to the core language.

Re: TLB shootdowns, thanks! This is honestly out of my depth, and I’ve read conflicting explanations as to what exactly triggers them. If I understand correctly, modern Linux kernels only need to reach an interprocessor barrier for TLB flushes when memory maps change. I assume that v8 makes relatively few allocations, and watching /proc/interrupts appears to confirm this: there are basically no TLB shootdowns for any of my message-passing examples above.

I agree that stateless servers are boring, but this is a.) good, because boring things are easy to understand, write, and analyze; and b.) many things can be accomplished with stateless components. Because shared state is complicated and expensive, I try to isolate the stateful part of my systems from the stateless ones. The stateless parts may well be good candidates for Node.js. That said, you’re correct to observe that stateless servers invariably offload their state-sharing to another system: a database, queue, internal services, third party APIs, and so on.

TJ
TJ, on

You might want to check this out too https://github.com/visionmedia/super-sockets you’ll get better throughput than node’s built-in json stuff

Aphyr
Aphyr, on

TJ, it’s great that folks are working on high-performance socket message passing–but I suspect you may not have read the post.

Tony Arcieri
Tony Arcieri, on

Oracle will be shipping an InvokeDynamic-based implementation of JavaScript with JDK8. It shouldn’t be too hard to reimplement the Node I/O and HTTP APIs using Java NIO and run Node.js applications on the JVM once that’s complete.

Aphyr
Aphyr, on

Bascule, I think this is the best possible outcome. It’ll be really nice for folks who have invested time in learning Node to be able to take advantage of parallelism and keep using the same patterns and libraries. I imagine the work that’s being done on isolates will come into play as well.

Ashwin Jayaprakash
Ashwin Jayaprakash, on

No RSS feed?

Aphyr
Aphyr, on

I’ve been kinda nuked by health issues this year, only got halfway to finishing aphyr.com. :-( Lots of folks have asked for a feed; it’s a high priority for me.

Pooria
Pooria, on

That’s a very good analysis. Thank you.

But there’s a problem: The Clojure code looks “literally” like gibberish to me. Now I happen to know Node, but even if I didn’t, I would very easily understand what the first gist is doing. I read the first two chapters of some Clojure book (or maybe online tutorial) very recently (and created like 10 simple programs with it) and yet the second gist makes absolutely no sense to me. I can understand fragments of it, but…

I’m not saying the syntax is everything. But the barrier to entrance to Node-land is much lower than that of Erlang’s or Clojure’s.

And good luck with those “health issues”.

Aphyr
Aphyr, on

Yeah, folks do tend to best understand the languages they speak. I find Chinese pretty hard to follow, too!

That said, this is nontrivial Clojure code. It uses several parts of the stdlib including the STM and futures, has no comments at all, uses the same functions to do multiple things at once, and the variable names are extremely short. Why? Because this gist is a transcript of my REPL session–I wrote this code live and never intended it to be read. Production Clojure tends to look a little different. ;-)

Maybe some narration will help.

https://gist.github.com/3200862

The code lives in a namespace, which I’ve called “messagepassing.core”. It’s going to use the standard java library’s LinkedTransferQueue. We also set a constant ’m': the number of messages to exchange.

The main function can do two things: test the use of queues, or the use of atoms. It just looks at the first command-line argument to decide what to do.

queue-test defines a function “bounce”, which takes a number from one queue, increments it, and puts it back into the other queue. It keeps doing this until the number is bigger than m. In the node.js code this function is also called bounce, but it’s not recursive because Node is calling the function as a part of its IO loop. In fact you can’t express any other type of flow control in Node.

Then queue-test creates a pair of queues (q1 and q2) and a pair of workers (w1 and w2). The workers are just independent threads (futures) which bounce a number between queues. @w2 expands to (deref w2), and dereferencing a future means “wait for the future’s return value”. So in the main thread, @w2 waits until the test is complete. Then we clean up the other worker thread.

Atom-test uses the STM, so there’s no real node analogy here. It defines a function called bump, which increments an atomic reference ‘a’ until it reaches m. swap! is a clojure core function which changes the value of an atomic reference using a function of its current value. Since swap! also returns the value it set, we can use (when) to determine whether the threshold is reached–without race conditions.

The test creates a single atomic reference x to the number one, then spawns a pair of worker threads w1 and w2 which bump x until it reaches the threshold m. We resolve both futures, which blocks until each has determined that the atom’s value is now equal to m.

Hope that helps. :)

Ryan
Ryan, on

Soooo…Node.js is cancer. I feel bad for the naive idiots who followed the hype.

Post a Comment

Please avoid writing anything here unless you are a computer: This is also a trap:

Supports github-flavored markdown for [links](http://foo.com/), *emphasis*, _underline_, `code`, and > blockquotes. Use ```clj on its own line to start a Clojure code block, and ``` to end the block.

Copyright © 2014 Kyle Kingsbury.
Non-commercial re-use with attribution encouraged; all other rights reserved.
Comments are the property of respective posters.