5

I'm doing a stress test for nginx and I use nodejs backends. I found a delay with keepalive. I removed nginx from the test and I have the same problem.

I'm using:

  • ApacheBench, Version 2.3
  • Node v0.8.14.
  • Ubuntu 12.04.1 LTS
  • Express 3.0.3

The source code is:

var express = require('express');
var cluster = require('cluster');
var numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
      for (var i = 0; i < numCPUs; i++) {
        cluster.fork();
      }

      cluster.on('exit', function(worker, code, signal) {
        console.log('worker ' + worker.process.pid + ' died');
      });
} else {
    var buffer = new Buffer(1048576);
    buffer.fill("a");
    var app = express();
    app.listen(8080);
    app.get('/test', function(req, res){
        setTimeout(function () {
            res.set('Content-Type', 'text/html');
            res.send(buffer.slice(0, req.query.size))
        }, req.query.delay);
    });
}

tcpdump example without keepalive: ab -c 1 -n 10 -r "172.16.76.253:8080/test?size=1024&delay=100"

10:58:59.403876 IP 172.16.180.47.57380 > 172.16.76.253.http-alt: Flags [P.], seq 1:122, ack 1, win 15, options [nop,nop,TS val 65479762 ecr 362284218], length 121
10:58:59.403961 IP 172.16.76.253.http-alt > 172.16.180.47.57380: Flags [.], ack 122, win 29, options [nop,nop,TS val 362284218 ecr 65479762], length 0
10:58:59.504631 IP 172.16.76.253.http-alt > 172.16.180.47.57380: Flags [P.], seq 1:146, ack 122, win 29, options [nop,nop,TS val 362284243 ecr 65479762], length 145
10:58:59.504890 IP 172.16.76.253.http-alt > 172.16.180.47.57380: Flags [FP.], seq 146:1170, ack 122, win 29, options [nop,nop,TS val 362284243 ecr 65479762], length 1024
10:58:59.505727 IP 172.16.180.47.57380 > 172.16.76.253.http-alt: Flags [.], ack 146, win 17, options [nop,nop,TS val 65479787 ecr 362284243], length 0
10:58:59.505741 IP 172.16.180.47.57380 > 172.16.76.253.http-alt: Flags [.], ack 1171, win 21, options [nop,nop,TS val 65479787 ecr 362284243], length 0

the ab results:

Server Hostname:        172.16.76.253
Server Port:            8080

Document Path:          /test?size=1024&delay=100
Document Length:        1024 bytes

Concurrency Level:      1
Time taken for tests:   1.025 seconds
Complete requests:      10
Failed requests:        0
Write errors:           0
Total transferred:      11690 bytes
HTML transferred:       10240 bytes
Requests per second:    9.75 [#/sec] (mean)
Time per request:       102.530 [ms] (mean)
Time per request:       102.530 [ms] (mean, across all concurrent requests)
Transfer rate:          11.13 [Kbytes/sec] received

Connection Times (ms)
          min  mean[+/-sd] median   max
Connect:        1    1   0.1      1       1
Processing:   101  102   0.5    102     102
Waiting:      101  102   0.5    102     102
Total:        102  102   0.6    103     103
WARNING: The median and mean for the total time are not within a normal deviation
    These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%    103
  66%    103
  75%    103
  80%    103
  90%    103
  95%    103
  98%    103
  99%    103
 100%    103 (longest request)

tcpdump example with keepalive ab -c 1 -n 10 -k -r "172.16.76.253:8080/test?size=1024&delay=100"

11:00:12.567741 IP 172.16.180.47.57385 > 172.16.76.253.http-alt: Flags [P.], seq 1306:1451, ack 10567, win 26, options [nop,nop,TS val 65498053 ecr 362302509], length 145
11:00:12.567761 IP 172.16.76.253.http-alt > 172.16.180.47.57385: Flags [.], ack 1451, win 50, options [nop,nop,TS val 362302509 ecr 65498053], length 0
11:00:12.668837 IP 172.16.76.253.http-alt > 172.16.180.47.57385: Flags [P.], seq 10567:10717, ack 1451, win 50, options [nop,nop,TS val 362302534 ecr 65498053], length 150
11:00:12.706745 IP 172.16.180.47.57385 > 172.16.76.253.http-alt: Flags [.], ack 10717, win 26, options [nop,nop,TS val 65498088 ecr 362302534], length 0
11:00:12.706765 IP 172.16.76.253.http-alt > 172.16.180.47.57385: Flags [P.], seq 10717:11741, ack 1451, win 50, options [nop,nop,TS val 362302544 ecr 65498088], length 1024
11:00:12.707901 IP 172.16.180.47.57385 > 172.16.76.253.http-alt: Flags [F.], seq 1451, ack 11741, win 26, options [nop,nop,TS val 65498088 ecr 362302544], length 0
11:00:12.708141 IP 172.16.76.253.http-alt > 172.16.180.47.57385: Flags [F.], seq 11741, ack 1452, win 50, options [nop,nop,TS val 362302544 ecr 65498088], length 0

the ab results:

Server Hostname:        172.16.76.253
Server Port:            8080

Document Path:          /test?size=1024&delay=100
Document Length:        1024 bytes

Concurrency Level:      1
Time taken for tests:   1.361 seconds
Complete requests:      10
Failed requests:        0
Write errors:           0
Keep-Alive requests:    10
Total transferred:      11740 bytes
HTML transferred:       10240 bytes
Requests per second:    7.35 [#/sec] (mean)
Time per request:       136.073 [ms] (mean)
Time per request:       136.073 [ms] (mean, across all concurrent requests)
Transfer rate:          8.43 [Kbytes/sec] received

Connection Times (ms)
          min  mean[+/-sd] median   max
Connect:        0    0   0.3      0       1
Processing:   103  136  11.6    140     140
Waiting:      100  101   0.7    101     102
Total:        104  136  11.3    140     140

Percentage of the requests served within a certain time (ms)
  50%    140
  66%    140
  75%    140
  80%    140
  90%    140
  95%    140
  98%    140
  99%    140
 100%    140 (longest request)

nodejs sends response in two packets, one with header, one with data, and wait for ack after first packet.

I try to set sysctl net.ipv4.tcp_slow_start_after_idle=0 && sysctl net.ipv4.route.flush=1, but it didn't had any effect.

When using keepalive there is a 40ms extra delay. The question is: When does the extra 40ms come from? Maybe I'm doing wrong?

mrege
  • 1,821
  • 1
  • 13
  • 15
  • 1
    With these tiny sample sizes your statistics are basically meaningless. You need hundreds if not thousand of data points. It's not clear what your actual question is. – user207421 Dec 14 '12 at 20:32
  • Why are you hardcoding the number of CPU's? Probably unrelated to the issue, but you should potentially stick to the real world scenario: var numCPUs = require('os').cpus().length; – Kevin Reilly Dec 14 '12 at 21:45
  • @EJP The tests with 8000 request per seconds give the same results. When using keepalive there is a 40ms extra delay. The question is: When does the extra 40ms come from? – mrege Dec 14 '12 at 21:52
  • @KevinReilly It's hardcoded because it is just a test, and because I want to leave one CPU free for other processes – mrege Dec 14 '12 at 22:02
  • 1
    Most likely, the extra 40ms is the time it takes to tear down the connection. With keepalive, the client winds up initiating the connection teardown when it receives the response. Without it, the server initiates teardown when it sends the response. The response is sent before it's received, so there's additional time in the keepalive case. Keep alive hurts you a bit if you only send one request per connection. – David Schwartz Dec 14 '12 at 22:47
  • It seems to be a bug in nodejs: [link](https://github.com/joyent/node/issues/4415#issuecomment-11415837): "Okay, I know what's going on. node.js tries to pack the response headers and the (first chunk of) the response body into a single packet - but only if the body is a string. When it's a buffer, it falls back to the slow, two packet path". When this solved the problem, I'll share the answer. Thank – mrege Dec 17 '12 at 15:38
  • I changed the line `res.send(buffer.slice(0, req.query.size))` by `res.send(buffer.slice(0, req.query.size).toString())` and the delay disappeared. – mrege Dec 17 '12 at 15:45

1 Answers1

5

It Was a Bug. It Fixed here.

The issue was 4415

I changed res.send(buffer.slice(0, req.query.size)) by res.send(buffer.slice(0, req.query.size).toString()) as a workaround for this test.

mrege
  • 1,821
  • 1
  • 13
  • 15