1

[Edit: I added some profiling logs and process exit traces to the code, and pasted the results at the end of the post]

I can't get Nodejs cluster to properly make use of the server's available CPUs while using port sharing. The problem is observed on two different servers (personal computer and remote VPS), as indicated below, and it has been reproduced by a commenter (see comments from jfriend00).

Node Version: v10.13.0; Windows 7 64 bits; 8 cores

Node Version: v8.1.3; CentOS 6.9; 4 cores

The bench is as follows: a worker is created for each available CPU, listens on port 8000 (using Express) and replies after a simulated processing time of 2 seconds (the process is supposed totally synced for the benching, with no async operation).

cluster_bench.js

var cluster = require('cluster');
var express = require('express');

if(cluster.isMaster) {
  var numWorkers = require('os').cpus().length; 
  console.log('[' + new Date().toISOString() + '] Master cluster setting up ' + numWorkers + ' workers...');

  for(var i = 0; i < numWorkers; i++) {
    cluster.fork();
  }

  cluster.on('online', function(worker) {
    console.log('[' + new Date().toISOString() + '] Worker ' + (''+worker.process.pid).padStart(6, ' ') + ' is online');
  });

  cluster.on('exit', function(worker, code, signal) {
    console.log('[' + new Date().toISOString() + '] Worker ' + (''+worker.process.pid).padStart(6, ' ') + ' died with code: ' + code + ', and signal: ' + signal);
    console.log('[' + new Date().toISOString() + '] Starting a new worker');
    cluster.fork();
  });
} else {
  var app = express();
  const pid = (''+process.pid).padStart(6, ' ');
  app.all('/*', function(req, res) {
    const requestNum = req.query.num;
    console.log('[' + new Date().toISOString() + '] Process ' + pid + ' handles query #' + requestNum);
    // simulates a 10s-long process
    const now = new Date().getTime();
    let waittime = 2000; // 2 seconds
    while (new Date().getTime() < now + waittime) { /* simulated synced processing */ };
    console.log('[' + new Date().toISOString() + '] Process ' + pid + ' sends answer to query #' + requestNum);
    res.send('Process ' + pid + ' says hello!');
  });
  var port = 8000;
  var server = app.listen(port, function() {
    console.log('[' + new Date().toISOString() + '] Process ' + pid + ' is listening on port ' + port);
  });
}

Now I bench this script with 32 concurrent requests:

cluster_client.js

const http = require('http');
// Send 32 concurrent calls
for(var i=0; i<32; i++) {
  let id = (''+i).padStart(2, ' ');
  console.log('[' + new Date().toISOString() + '] Request ' + id + ' launched.');
  http.get('http://localhost:8000?num=' + id, (resp) => {
    let data = '';
    resp.on('data', chunk => data += chunk );
    resp.on('end', () => console.log('[' + new Date().toISOString() + '] Request ' + id + ' - answer: ' + data) );
  });
}

I would expect my calls to be processed by batches of n-CPUs.

Yet, I observe this behavior:

  • first batch actually makes use of the n CPUs
  • later batches make use of less and less CPUs
  • last batches only make use of only one CPU, so the last requests are answered each after the other, in a very long time

For example, on a 8-core computer: first 8 requests processed at once, then another 8 at once, then 4 at once, 4 at once, 2 at once, 2 at once, 2 at once, 1 at once, 1 at once.

On Windows, if I launch my client script a second time without restarting the server, then only 2 of the existing workers will handle the 32 requests. On CentOS, the behavior is the same as the first launch (first using all CPUs, then less and less).

No 'exit' event is produced by any workers. I also placed hooks with logs on the child processes events (all events described in https://nodejs.org/api/process.html#process_process_events), and observed that no event of any kind was emitted for any of the child processes. Also, jfriend00 has checked the actual sending of all the requests by the client with WireShark (see comment), which are correctly sent.

How can this behavior be?


Here's the traces I get from the server:

[2019-08-30T20:57:42.078Z] Master cluster setting up 8 workers...
[2019-08-30T20:57:42.781Z] Worker   1732 is online
[2019-08-30T20:57:42.798Z] Worker    208 is online
[2019-08-30T20:57:42.866Z] Worker  10868 is online
[2019-08-30T20:57:43.032Z] Worker  10232 is online
[2019-08-30T20:57:43.044Z] Worker  12356 is online
[2019-08-30T20:57:43.113Z] Worker  12500 is online
[2019-08-30T20:57:43.208Z] Worker   1196 is online
[2019-08-30T20:57:43.315Z] Worker  10948 is online
[2019-08-30T20:57:43.738Z] Process   1732 is listening on port 8000
[2019-08-30T20:57:43.779Z] Process    208 is listening on port 8000
[2019-08-30T20:57:43.891Z] Process  10868 is listening on port 8000
[2019-08-30T20:57:44.018Z] Process  10232 is listening on port 8000
[2019-08-30T20:57:44.018Z] Process  12356 is listening on port 8000
[2019-08-30T20:57:44.039Z] Process  12500 is listening on port 8000
[2019-08-30T20:57:44.097Z] Process   1196 is listening on port 8000
[2019-08-30T20:57:44.159Z] Process  10948 is listening on port 8000
[2019-08-30T20:57:49.570Z] Process  10232 handles query # 5
[2019-08-30T20:57:49.571Z] Process  10948 handles query # 0
[2019-08-30T20:57:49.596Z] Process   1732 handles query #10
[2019-08-30T20:57:49.607Z] Process  10868 handles query # 7
[2019-08-30T20:57:49.610Z] Process   1196 handles query # 2
[2019-08-30T20:57:49.611Z] Process    208 handles query # 8
[2019-08-30T20:57:49.613Z] Process  12356 handles query # 6
[2019-08-30T20:57:49.628Z] Process  12500 handles query # 4
[2019-08-30T20:57:51.586Z] Process  10232 sends answer to query # 5
[2019-08-30T20:57:51.613Z] Process  10948 sends answer to query # 0
[2019-08-30T20:57:51.620Z] Process   1732 sends answer to query #10
[2019-08-30T20:57:51.655Z] Process  10868 sends answer to query # 7
[2019-08-30T20:57:51.656Z] Process   1196 sends answer to query # 2
[2019-08-30T20:57:51.666Z] Process  10232 handles query #14
[2019-08-30T20:57:51.684Z] Process    208 sends answer to query # 8
[2019-08-30T20:57:51.693Z] Process   1732 handles query #31
[2019-08-30T20:57:51.695Z] Process  10948 handles query # 1
[2019-08-30T20:57:51.734Z] Process  12356 sends answer to query # 6
[2019-08-30T20:57:51.755Z] Process  12500 sends answer to query # 4
[2019-08-30T20:57:51.770Z] Process   1196 handles query # 3
[2019-08-30T20:57:51.774Z] Process  10868 handles query #12
[2019-08-30T20:57:51.815Z] Process    208 handles query #17
[2019-08-30T20:57:51.841Z] Process  12356 handles query # 9
[2019-08-30T20:57:51.885Z] Process  12500 handles query #13
[2019-08-30T20:57:53.671Z] Process  10232 sends answer to query #14
[2019-08-30T20:57:53.697Z] Process   1732 sends answer to query #31
[2019-08-30T20:57:53.699Z] Process  10948 sends answer to query # 1
[2019-08-30T20:57:53.709Z] Process  10948 handles query #11
[2019-08-30T20:57:53.771Z] Process   1196 sends answer to query # 3
[2019-08-30T20:57:53.781Z] Process  10868 sends answer to query #12
[2019-08-30T20:57:53.780Z] Process   1196 handles query #15
[2019-08-30T20:57:53.816Z] Process    208 sends answer to query #17
[2019-08-30T20:57:53.848Z] Process  12356 sends answer to query # 9
[2019-08-30T20:57:53.856Z] Process  12356 handles query #16
[2019-08-30T20:57:53.886Z] Process  12500 sends answer to query #13
[2019-08-30T20:57:53.895Z] Process  12500 handles query #19
[2019-08-30T20:57:55.712Z] Process  10948 sends answer to query #11
[2019-08-30T20:57:55.718Z] Process  10948 handles query #18
[2019-08-30T20:57:55.782Z] Process   1196 sends answer to query #15
[2019-08-30T20:57:55.788Z] Process   1196 handles query #21
[2019-08-30T20:57:55.856Z] Process  12356 sends answer to query #16
[2019-08-30T20:57:55.862Z] Process  12356 handles query #20
[2019-08-30T20:57:55.896Z] Process  12500 sends answer to query #19
[2019-08-30T20:57:55.901Z] Process  12500 handles query #23
[2019-08-30T20:57:57.719Z] Process  10948 sends answer to query #18
[2019-08-30T20:57:57.789Z] Process   1196 sends answer to query #21
[2019-08-30T20:57:57.863Z] Process  12356 sends answer to query #20
[2019-08-30T20:57:57.889Z] Process  12356 handles query #22
[2019-08-30T20:57:57.901Z] Process  12500 sends answer to query #23
[2019-08-30T20:57:57.919Z] Process  12500 handles query #25
[2019-08-30T20:57:59.889Z] Process  12356 sends answer to query #22
[2019-08-30T20:57:59.891Z] Process  12356 handles query #24
[2019-08-30T20:57:59.920Z] Process  12500 sends answer to query #25
[2019-08-30T20:57:59.922Z] Process  12500 handles query #27
[2019-08-30T20:58:01.892Z] Process  12356 sends answer to query #24
[2019-08-30T20:58:01.894Z] Process  12356 handles query #26
[2019-08-30T20:58:01.923Z] Process  12500 sends answer to query #27
[2019-08-30T20:58:01.926Z] Process  12500 handles query #29
[2019-08-30T20:58:03.895Z] Process  12356 sends answer to query #26
[2019-08-30T20:58:03.899Z] Process  12356 handles query #28
[2019-08-30T20:58:03.926Z] Process  12500 sends answer to query #29
[2019-08-30T20:58:05.901Z] Process  12356 sends answer to query #28
[2019-08-30T20:58:05.903Z] Process  12356 handles query #30
[2019-08-30T20:58:07.904Z] Process  12356 sends answer to query #30

and from the client side:

[2019-08-30T20:57:49.293Z] Request  0 launched.
[2019-08-30T20:57:49.376Z] Request  1 launched.
[2019-08-30T20:57:49.380Z] Request  2 launched.
[2019-08-30T20:57:49.381Z] Request  3 launched.
[2019-08-30T20:57:49.384Z] Request  4 launched.
[2019-08-30T20:57:49.387Z] Request  5 launched.
[2019-08-30T20:57:49.390Z] Request  6 launched.
[2019-08-30T20:57:49.393Z] Request  7 launched.
[2019-08-30T20:57:49.396Z] Request  8 launched.
[2019-08-30T20:57:49.404Z] Request  9 launched.
[2019-08-30T20:57:49.407Z] Request 10 launched.
[2019-08-30T20:57:49.410Z] Request 11 launched.
[2019-08-30T20:57:49.413Z] Request 12 launched.
[2019-08-30T20:57:49.421Z] Request 13 launched.
[2019-08-30T20:57:49.425Z] Request 14 launched.
[2019-08-30T20:57:49.427Z] Request 15 launched.
[2019-08-30T20:57:49.435Z] Request 16 launched.
[2019-08-30T20:57:49.438Z] Request 17 launched.
[2019-08-30T20:57:49.441Z] Request 18 launched.
[2019-08-30T20:57:49.444Z] Request 19 launched.
[2019-08-30T20:57:49.452Z] Request 20 launched.
[2019-08-30T20:57:49.455Z] Request 21 launched.
[2019-08-30T20:57:49.458Z] Request 22 launched.
[2019-08-30T20:57:49.461Z] Request 23 launched.
[2019-08-30T20:57:49.469Z] Request 24 launched.
[2019-08-30T20:57:49.472Z] Request 25 launched.
[2019-08-30T20:57:49.475Z] Request 26 launched.
[2019-08-30T20:57:49.479Z] Request 27 launched.
[2019-08-30T20:57:49.487Z] Request 28 launched.
[2019-08-30T20:57:49.489Z] Request 29 launched.
[2019-08-30T20:57:49.493Z] Request 30 launched.
[2019-08-30T20:57:49.496Z] Request 31 launched.
[2019-08-30T20:57:51.637Z] Request  5 - answer: Process  10232 says hello!
[2019-08-30T20:57:51.651Z] Request  0 - answer: Process  10948 says hello!
[2019-08-30T20:57:51.654Z] Request 10 - answer: Process   1732 says hello!
[2019-08-30T20:57:51.704Z] Request  2 - answer: Process   1196 says hello!
[2019-08-30T20:57:51.727Z] Request  7 - answer: Process  10868 says hello!
[2019-08-30T20:57:51.763Z] Request  8 - answer: Process    208 says hello!
[2019-08-30T20:57:51.786Z] Request  6 - answer: Process  12356 says hello!
[2019-08-30T20:57:51.801Z] Request  4 - answer: Process  12500 says hello!
[2019-08-30T20:57:53.677Z] Request 14 - answer: Process  10232 says hello!
[2019-08-30T20:57:53.702Z] Request 31 - answer: Process   1732 says hello!
[2019-08-30T20:57:53.705Z] Request  1 - answer: Process  10948 says hello!
[2019-08-30T20:57:53.777Z] Request  3 - answer: Process   1196 says hello!
[2019-08-30T20:57:53.786Z] Request 12 - answer: Process  10868 says hello!
[2019-08-30T20:57:53.821Z] Request 17 - answer: Process    208 says hello!
[2019-08-30T20:57:53.853Z] Request  9 - answer: Process  12356 says hello!
[2019-08-30T20:57:53.891Z] Request 13 - answer: Process  12500 says hello!
[2019-08-30T20:57:55.715Z] Request 11 - answer: Process  10948 says hello!
[2019-08-30T20:57:55.786Z] Request 15 - answer: Process   1196 says hello!
[2019-08-30T20:57:55.860Z] Request 16 - answer: Process  12356 says hello!
[2019-08-30T20:57:55.898Z] Request 19 - answer: Process  12500 says hello!
[2019-08-30T20:57:57.721Z] Request 18 - answer: Process  10948 says hello!
[2019-08-30T20:57:57.792Z] Request 21 - answer: Process   1196 says hello!
[2019-08-30T20:57:57.866Z] Request 20 - answer: Process  12356 says hello!
[2019-08-30T20:57:57.903Z] Request 23 - answer: Process  12500 says hello!
[2019-08-30T20:57:59.891Z] Request 22 - answer: Process  12356 says hello!
[2019-08-30T20:57:59.922Z] Request 25 - answer: Process  12500 says hello!
[2019-08-30T20:58:01.894Z] Request 24 - answer: Process  12356 says hello!
[2019-08-30T20:58:01.925Z] Request 27 - answer: Process  12500 says hello!
[2019-08-30T20:58:03.897Z] Request 26 - answer: Process  12356 says hello!
[2019-08-30T20:58:03.928Z] Request 29 - answer: Process  12500 says hello!
[2019-08-30T20:58:05.903Z] Request 28 - answer: Process  12356 says hello!
[2019-08-30T20:58:07.906Z] Request 30 - answer: Process  12356 says hello!

As indicated, fewer and fewer queries are handled at once as the time passes. In the end, only Process #12356 still does something. Traces show no exiting from any process.

Community
  • 1
  • 1
Bob
  • 1,495
  • 1
  • 19
  • 24
  • I can't reproduce it here. But it could be a timeout issue, due to the 2 seconds of blocking. But without any proper error handling and logging, it is not possible to tell. Are the worker still running? Does every `Process ' + process.pid + ' answers...` have its corresponding output log in the `cluster_client.js`, did every `http.get` get data back or did one get an error, does `app.all('/*', function(req, res) {` reach the end of the `while`, ... – t.niese Aug 30 '19 at 18:59
  • What version of node.js? Probably want to check that the cluster_client is actually sending all the requests and not queuing them internal to the http library. Might want to put unique query parameter on each request so you can track each one better on the server: `http.get('http://localhost:8000?num=' + i , (resp) => {`. – jfriend00 Aug 30 '19 at 19:25
  • Thanks for your remarks. @t.niese I added handling of the processes exiting and profiling logs, so one can see the number of requests handled at a time. I'm very concerned, for I can reproduce this error on 2 extremely different environments. With the added traces, can you see the requests actually handled in constant numbers per batch? Any suggestion on how to diagnose the health of the workers? – Bob Aug 30 '19 at 21:10
  • @jfriend00 I reproduce this on Node v8.1.3 and v10.13.0. I completed the logs as you mention, and pasted them in the post. – Bob Aug 30 '19 at 21:16
  • Probably not relevant to this issue, but you don't call `.end()` after `res.send()` (it does the `.end()` for you), so change `res.send(...).end()` to just `res.send(...)`. – jfriend00 Aug 30 '19 at 21:35
  • @jfriend00 Noted, thank you. – Bob Aug 30 '19 at 21:54
  • I must add that launching cluster_client.js a second time (and further) makes that only 2 workers replies to the 32 requests (instead of 8 to begin with) ! – Bob Aug 30 '19 at 21:54
  • I added to each child process a hook on each process event (https://nodejs.org/api/process.html#process_process_events), with corresponding logs. No process event is emitted. – Bob Aug 30 '19 at 22:08
  • In my testing on an 8-core CPU (no hyperthreads), I see 8 requests get processed at once, then 8 at once, then 6 at once, then 4 at once, then 2, then 2, then 2. Very weird. FYI, I found it a lot easier to track if the time was logged as ms since activity started as that made it a lot easier to see which bursts of activity are in the same time window. I do not know what is causing it. – jfriend00 Aug 30 '19 at 22:08
  • 1
    I have verified with WireShark packet snooping that all 32 requests are sent immediately (no limits on http connections to the same target are hit) so the issue is apparently on the server side only. – jfriend00 Aug 31 '19 at 00:07
  • Interesting, it still doesn't happen to me at that scale, not for `v10.13.0` or `v8.1.3` (on Arch or macOS), only at a scale I would expect from the round-robin approach for distributing the requests to the workers. I get that kind of distirbution `{ '14947': 4, '14953': 4, '14954': 4, '14955': 4, '14961': 5, '14963': 3, '14974': 4, '14982': 4 }` – t.niese Aug 31 '19 at 07:28
  • @t.niese - I'm reproducing it easily on Windows 10. – jfriend00 Aug 31 '19 at 07:59
  • I can verify the problem under Win 10. It's strange that it happens for win 10 and CentOS but not for Arch and macOS. The processes itself are still running and their event loop is no stuck (e.g. intervals are still running), they just don't get further connections from the master. Afaik the socket sharing is handled entierly by libuv, so maybe the problem is down there. – t.niese Sep 01 '19 at 08:38

0 Answers0