CherryPy 60x as slow in benchmark with 8 requesting threads compared to 7

Question

I'm curious why when benchmarking Python web server CherryPy using ab, with -c 7 (7 concurrent threads) it can server 1500 requests/s (about what I expect), but when I change to -c 8 it drops way down to 25 requests/s. I'm running CherryPy with numthreads=10 (but it doesn't make a different if I use numthreads=8 or 20) on a 64-bit Windows machine with four cores running Python 2.6.

I'm half-suspecting the Python GIL is part of the issue, but I don't know why it only happens when I get up to 8 concurrently-requesting threads. On a four core machine I'd expect it might change at -c 4, but this is not the case.

I'm using the one-file CherryPy web server that comes with web.py, and here's the WSGI app that I'm testing against:

from web.wsgiserver import CherryPyWSGIServer

def application(environ, start_response):
    start_response("200 OK", [("Content-type", "text/plain")])
    return ["Hello World!",]

server = CherryPyWSGIServer(('0.0.0.0', 80), application, numthreads=10)
try:
    server.start()
except KeyboardInterrupt:
    server.stop()

The ab output for 7 and 8 concurrent threads is:

C:\\> ab -n 1000 -c 7 http://localhost/
...
Concurrency Level:      7
Time taken for tests:   0.670 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      130000 bytes
HTML transferred:       12000 bytes
Requests per second:    1492.39 [#/sec] (mean)
Time per request:       4.690 [ms] (mean)
Time per request:       0.670 [ms] (mean, across all concurrent requests)
Transfer rate:          189.46 [Kbytes/sec] received

C:\\> ab -n 1000 -c 8 http://localhost/
...
Concurrency Level:      8
Time taken for tests:   7.169 seconds
Complete requests:      158
Failed requests:        0
Write errors:           0
Total transferred:      20540 bytes
HTML transferred:       1896 bytes
Requests per second:    22.04 [#/sec] (mean)
Time per request:       362.973 [ms] (mean)
Time per request:       45.372 [ms] (mean, across all concurrent requests)
Transfer rate:          2.80 [Kbytes/sec] received

Tested on a Linux box, there seems to be a smaller, random, performance degradation from the -c7 to -c8. About (1800 per second to 600~1200 per second). The issue seems to happen in the last requests, and disappears when -n is updated to 10 000. Did you try that ? — Romuald Brunet, Feb 14 '11 at 15:07
Thanks, Romauld. Yes, I get exactly the same average with `-n 10000`, and there doesn't seem to be any difference in the last requests for me. How many cores does your machine have? — Ben Hoyt, Feb 14 '11 at 15:16

score 3 · Answer 1 · answered Feb 14 '11 at 20:59

On my linux box, it's due to the retransmission of a TCP packet from ab, although I'm not exactly sure why:

No.     Time        Source                Destination           Protocol Info                                                            Delta
  10682 21.218156   127.0.0.1             127.0.0.1             TCP      http-alt > 57246 [SYN, ACK] Seq=0 Ack=0 Win=32768 Len=0 MSS=16396 TSV=17307504 TSER=17306704 WS=6 21.218156
  10683 21.218205   127.0.0.1             127.0.0.1             TCP      57246 > http-alt [ACK] Seq=82 Ack=1 Win=513 Len=0 TSV=17307504 TSER=17307504 SLE=0 SRE=1 0.000049
  10701 29.306438   127.0.0.1             127.0.0.1             HTTP     [TCP Retransmission] GET / HTTP/1.0                             8.088233
  10703 29.306536   127.0.0.1             127.0.0.1             TCP      http-alt > 57246 [ACK] Seq=1 Ack=82 Win=512 Len=0 TSV=17309526 TSER=17309526 0.000098
  10704 29.308555   127.0.0.1             127.0.0.1             TCP      [TCP segment of a reassembled PDU]                              0.002019
  10705 29.308628   127.0.0.1             127.0.0.1             TCP      57246 > http-alt [ACK] Seq=82 Ack=107 Win=513 Len=0 TSV=17309526 TSER=17309526 0.000073
  10707 29.309718   127.0.0.1             127.0.0.1             TCP      [TCP segment of a reassembled PDU]                              0.001090
  10708 29.309754   127.0.0.1             127.0.0.1             TCP      57246 > http-alt [ACK] Seq=82 Ack=119 Win=513 Len=0 TSV=17309526 TSER=17309526 0.000036
  10710 29.309992   127.0.0.1             127.0.0.1             HTTP     HTTP/1.1 200 OK  (text/plain)                                   0.000238
  10711 29.310572   127.0.0.1             127.0.0.1             TCP      57246 > http-alt [FIN, ACK] Seq=82 Ack=120 Win=513 Len=0 TSV=17309527 TSER=17309526 0.000580
  10712 29.310661   127.0.0.1             127.0.0.1             TCP      http-alt > 57246 [ACK] Seq=120 Ack=83 Win=512 Len=0 TSV=17309527 TSER=17309527 0.000089

The original "GET" packet wasn't picked up by Wireshark either. For some reason, ab tries to send a request and fails, even though the TCP connection was double-ACk'd just fine. Then the client's TCP stack waits for a few seconds for a packet that was never sent to be ACK'd, and when it sees no ACK, retries and succeeds.

Personally, I wouldn't worry about it. If there's a problem, it's not one with CherryPy. It could be related to the internals of ab, the use of HTTP/1.0 instead of 1.1, the lack of keepalive, the use of localhost instead of a real socket (which simulates some realities of network traffic and ignores others), the use of Windows (wink), other traffic on the same interface, load on the CPU...the list goes on and on.

CherryPy 60x as slow in benchmark with 8 requesting threads compared to 7

1 Answers1