25

I've got a long running node.js process that sometimes jumps to 100% CPU and stops responding to requests. The most recent time it did this I attached strace to the process, and here's what I saw:

Process 19523 attached - interrupt to quit
gettimeofday({1394558574, 234192}, NULL) = 0
gettimeofday({1394558574, 235381}, NULL) = 0
gettimeofday({1394558574, 306460}, NULL) = 0
mmap(0x3edab823a000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3edab823a000
munmap(0x3edab823a000, 811008)          = 0
munmap(0x3edab8400000, 237568)          = 0
mmap(0x3edab8300000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3edab8300000
gettimeofday({1394558574, 316435}, NULL) = 0
gettimeofday({1394558574, 316576}, NULL) = 0
gettimeofday({1394558574, 316677}, NULL) = 0
gettimeofday({1394558574, 316775}, NULL) = 0
gettimeofday({1394558574, 316879}, NULL) = 0
gettimeofday({1394558574, 316978}, NULL) = 0
gettimeofday({1394558574, 317439}, NULL) = 0
gettimeofday({1394558574, 317537}, NULL) = 0
gettimeofday({1394558574, 318139}, NULL) = 0
gettimeofday({1394558574, 318234}, NULL) = 0
gettimeofday({1394558574, 318343}, NULL) = 0
gettimeofday({1394558574, 318437}, NULL) = 0
gettimeofday({1394558574, 318530}, NULL) = 0
gettimeofday({1394558574, 318624}, NULL) = 0
gettimeofday({1394558574, 319135}, NULL) = 0
gettimeofday({1394558574, 319648}, NULL) = 0
gettimeofday({1394558574, 319769}, NULL) = 0
gettimeofday({1394558574, 319975}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 322266}, NULL) = 0
gettimeofday({1394558574, 322426}, NULL) = 0
gettimeofday({1394558574, 322520}, NULL) = 0
gettimeofday({1394558574, 322759}, NULL) = 0
gettimeofday({1394558574, 322853}, NULL) = 0
gettimeofday({1394558574, 322995}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 417614}, NULL) = 0
gettimeofday({1394558575, 386566}, NULL) = 0
gettimeofday({1394558575, 387704}, NULL) = 0
gettimeofday({1394558575, 463410}, NULL) = 0
mmap(0x24cfd260f000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x24cfd260f000
munmap(0x24cfd260f000, 987136)          = 0
munmap(0x24cfd2800000, 61440)           = 0
mmap(0x24cfd2700000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x24cfd2700000

That's a lot of gettimeofday calls, and not much else! What might cause node.js to get stuck like this?

UPDATE: I upgraded from an older version of node.js to 10.29 (I think) and this went away. I've just upgraded to 10.33 and the problem has come back. I've made a little more progress on debugging it this time out. First strace:

$ sudo strace -p 11812 -c
Process 11812 attached - interrupt to quit
Process 11812 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000192           0      2780           gettimeofday
  0.00    0.000000           0      1390           getrusage
  0.00    0.000000           0        31           futex
  0.00    0.000000           0      1390           clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00    0.000192                  5591           total

From Node.js: How to attach to a running process and to debug the server with a console? I found out about attaching the node debugger to a running process, and here's where I got with that:

$ sudo kill -s SIGUSR1 11812
$ sudo node debug -p 11812
connecting... ok
break in timers.js:79
  77 }
  78
  79 function listOnTimeout() {
  80   var msecs = this.msecs;
  81   var list = this;
debug> bt
#0 timers.js:79:23

So it looks like it is definitely related to a timer, but I don't know how to move further up the stack to find out where in my code the problem is getting triggered.

Community
  • 1
  • 1
Ben Dowling
  • 17,187
  • 8
  • 87
  • 103
  • Are your using "time condition variables"? https://www.npmjs.org/package/condvar – Armel Larcier Mar 27 '14 at 14:57
  • could you add the code that you're using to the question? – Alexej Magura Mar 27 '14 at 15:27
  • @ArmelLarcier I'm not using that package, no. – Ben Dowling Mar 27 '14 at 16:12
  • 1
    @AlexejMagura it's a large application, so I can't post it all, and I've no idea where to start tracking back this to the responsible code. – Ben Dowling Mar 27 '14 at 16:12
  • Have you tried using dtrace as well ? An example is located [here](http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/). – Appleman1234 Mar 28 '14 at 13:27
  • 2
    I'd recommend using a profiler to find out where the leak is coming from instead of manually searching by changing code. Take a look at: github.com/felixge/node-memory-leak-tutorial and / or github.com/node-inspector/node-inspector – Timothy Strimple Mar 28 '14 at 17:14
  • @TimothyStrimple what makes you think this is related to a memory leak? – Ben Dowling Mar 28 '14 at 20:47
  • 2
    Have you tried to run all your code thru JSHint or some other analysis tool? I suspect it is a never ending loop. – Badri Narayan Manicka Mar 29 '14 at 03:16
  • @BadriManicka no I'll try that, but it isn't a loop because it eventually goes back to normal – Ben Dowling Mar 29 '14 at 06:42
  • 1
    Just thinking about this in general terms from what has been said so far: do you have any sort of a queue which is getting backed up with 'retry' sort of events which do eventually get processed? The fact that you say it eventually returns to normal just makes me think that somewhere (in your code or a lib) may be a bottleneck of things which are each contending for/checking on the availability of some sort of resource. Might be insignificant in normal operation, but if something backs up... Are there any clues around when it crops up in terms of usage volumes, etc? – barry-johnson Mar 30 '14 at 17:35
  • I saw this too, in both testing and production, same strace. The server was not exactly non-responsive, but greatly slowed, so it _looked_ non-responsive. – Andras Dec 01 '14 at 23:52

6 Answers6

8

My guess is that someone manually implemented a blocking "setTimeout". This could happen if someone doesn't want to release control of the main JS thread in an effort to prevent a potential race condition.

For obvious reasons this is a bad practice for production code, but I've used it on occasion in debugging to force asynchronous processes to execute in a specific order.

You could look for something silly like this:

var mockAsyncFunction = function (cb) {
    setTimeout(function () {
        cb(null, 'dummy_result')
    }, 5000);
};

var myResult = null;
mockAsyncFunction(function (err, result) {
    myResult = result;
});
var timeOut = 10000; // timeout in 10 sec.
var timeStart = new Date().getTime();


while (1) {
    if (new Date().getTime() - timeStart > 10000) {
        break;
    }
}
console.log('DONE');

or something more nefarious with nextTick recursion like:

var timeStart = new Date().getTime();

var recurseUntilDone = function () {
    if (new Date().getTime() - timeStart < 10000) {
        process.nextTick(recurseUntilDone);
    } else {
        console.log('Done recursing');
    }
};

recurseUntilDone();
mjhm
  • 16,497
  • 10
  • 44
  • 55
4

Use node-inspector to be able to pause your code when the CPU is 100% - my bet is also on some badly implemented loop checking state until a certain time has passed, but they can be hard to find.

Attach the debugger with --debug when starting node (ie. node index.js --debug) and in a separate window run node-inspector. Use Chrome to connect to your debug session (the url is outputted from node-inspector command) and wait until the problem occurs and pause the execution and you should be able to find it.

Christian Landgren
  • 13,127
  • 6
  • 35
  • 31
2

We saw this too, in both production and development. Also looking for an answer. We started investigating the node run-time, but the problem is so infrequent it's not getting any priority. Because it's completely cpu bound, no system calls, it's hard capture with an strace.

It won't be timed loop checking state, because a new Date().getTime() loop does not make any calls to gettimeofday (in node v0.10.29; it just does a long series of nanosleeps in one pid and just futex calls in the other. Clever, actually). Same for Date.now().

Andras
  • 2,995
  • 11
  • 17
  • Interesting. I've just confirmed that both setInterval and setTimeout call clock_gettime rather than gettimeofday. – Ben Dowling Dec 02 '14 at 00:21
  • There aren't that many references to it in the node.js code: https://github.com/joyent/node/search?utf8=%E2%9C%93&q=gettimeofday&type=Code – Ben Dowling Dec 02 '14 at 00:23
  • yeah, I grepped too, LinuxSemaphore::Wait looked possible but I got bogged down trying to trace up from there, too many places to check (I have the sources git cloned locally) – Andras Dec 02 '14 at 00:29
  • I am facing same issue? @Andras can you please provide leads to solve this issue if solved for you ? – Sagar Feb 17 '16 at 13:47
0

don't know if it's related, but here's a simple way to crush a node process: use an array as a queue when it has data in it. The symptoms are very very slow processing rate, and 100% cpu use.

//a = [];
a = new Array();

for (i=0; i<87370; i++) a.push({});

for (i=0; i<1000000; i++) {
    a.shift();
    a.push({});
    if (i%1000 === 0) {
    process.stdout.write(".");
    //global.gc();
    }
}

// node v0.10.29: 10k:  0.15 sec
//        80k:  0.17 sec
//        87369: 0.17 sec
//            87370: instant 41k, then .9k per second
//        87400: Array: instant 41k, then .9k per second
//        87400: []: instant 28k, then .9k per second
//        87400: Array, pushing 1 or 'x': .9k per second from the start
// node v0.11.13: 10k: 1.94 sec
//        16683: 3.87 sec
//        16884: uniform 237.16 sec

I ran perf record -F 997 -g node-v0.11.13 --perf-basic-prof script.js but pretty much all the time is ascribed to Builtin_ArrayShift (5-6 different hex locations called from ~)

Note that the behavior has a toggle point, below which it's fast, above it's slow. The 0.10 and 0.11 behavior is different; 0.11 seems to have two toggle points (3 different speeds of dots). The toggle points don't seem to stay the same between groups of runs.

Does not matter if global.gc() is invoked or not (with the --expose_gc switch). Does not matter if the push/shift is in a continuable or in a blocking loop like above.

Edit: what's weird is that this is both data and loop count dependent. Shorter loops run faster (even though they're chunked into 1k batches).

Also, when this condition exists, shift/push to other (empty) lists is also slowed to the same .9k ops/sec (node v0.10.29), but curiously they can be overlapped: 10 concurrent shift/append loops will each insert .9k items/sec, ie a total of 9k/sec. (So there are resources for 10x more throughput -- an internal throttle then?)

Edit: not an internal throttle; the slow thread is using a solid chunk of 0.8 - 1.2 ms of cpu, and all other threads are blocked until it finishes. Since concurrent threads must yield periodically (I used setImmediate), they keep getting blocked behind the cpu hog and make correspondingly slow progress.

Andras
  • 2,995
  • 11
  • 17
0

explicit delete from larger hashes or arrays, like cleanup of an internal data container, could also cause symptoms like this. The inherent node operation can be abysmally slow:

h = {}
for (i=0; i<200000; i++) h[i] = i;     // 25,000,000 / sec
for (i=0; i<1000; i++) delete h[i];    // 11,000 / sec
for (i=0; i<200000; i++) delete h[i];  // 7,700 / sec

a = new Array();
for (i=0; i<200000; i++) a[i] = i;     // 50,000,000 / sec
for (i=0; i<1000; i++) delete a[i];    // 10,000 / sec
for (i=0; i<200000; i++) delete a[i];  // 8,000 / sec
// and out of curiousity...
for (i=0; i<200000; i++) a[i];         // 250,000,000 / sec
for (i=1; i<200000; i++) a[i-1] = a[i]; // 180,000,000 / sec

deleting from back to front is 2x faster, but that's still very slow. (note: assigning into a new Array is 50m/s, but into [] is only 25m/s)

The slowdown is O(n^2): doubling the array size to 400k quadruples the runtime. Each delete operation is O(n) of the size of the hash/array.

The strace of a 400k run (100 seconds) shows no sytem calls to speak of, and does not match the gettimeofday() pattern seen above.

...
14:08:36.327886 munmap(0x44200000, 802816) = 0
14:08:36.328145 munmap(0xf7100000, 1183744) = 0
14:08:36.328483 clock_gettime(CLOCK_REALTIME, {1418584116, 328495018}) = 0
14:08:36.328540 clock_gettime(CLOCK_REALTIME, {1418584116, 328551680}) = 0
14:08:36.328581 clock_gettime(CLOCK_REALTIME, {1418584116, 328592628}) = 0
14:08:36.328633 futex(0x8e79a7c, FUTEX_WAKE_PRIVATE, 1) = 1
14:09:03.130336 clock_gettime(CLOCK_REALTIME, {1418584143, 130406483}) = 0
14:09:03.130479 clock_gettime(CLOCK_REALTIME, {1418584143, 130503357}) = 0
14:09:03.130560 futex(0x8e79a7c, FUTEX_WAKE_PRIVATE, 1) = 1
14:09:37.090127 clock_gettime(CLOCK_REALTIME, {1418584177, 90195108}) = 0
14:09:37.090271 clock_gettime(CLOCK_REALTIME, {1418584177, 90296760}) = 0
14:09:37.090355 futex(0x8e79a7c, FUTEX_WAKE_PRIVATE, 1) = 1
14:10:17.588552 mmap2(0x3f302000, 4231168, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3f302000
14:10:17.588694 munmap(0x3f302000, 1040384) = 0
14:10:17.588768 munmap(0x3f709000, 8192) = 0
...
Andras
  • 2,995
  • 11
  • 17
0

If anyone is still in this issue, I encounter this issue also (to such a heavy state that my "repeat" node starts to drop instruction every second).

IN MY CASE this was caused by "serial" node - you can check it by launching htop command to monitor CPU usage and then redeploy all nodes after deleting "serial input" node.

CPU usage should immediately goes back down.