How to configure for many stateless requests?

Question

I am a data scientist / machine learning developer. Sometimes, I have to expose my models by providing an endpoint. I usually do this via Flask and gunicorn:

exampleproject.py:

import random
from flask import Flask
app = Flask(__name__)
random.seed(0)


@app.route("/")
def hello():
    x = random.randint(1, 100)
    y = random.randint(1, 100)
    return str(x * y)

if __name__ == "__main__":
    app.run(host='0.0.0.0')

wsgi.py:

from exampleproject import app

if __name__ == "__main__":
    app.run()

Run by

$ gunicorn --bind 0.0.0.0:5000 wsgi:app

When I benchmark this simple script, I get:

$ ab -s 30 -c 200 -n 25000 -v 1 http://localhost:5000/
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 2500 requests
Completed 5000 requests
Completed 7500 requests
Completed 10000 requests
Completed 12500 requests
Completed 15000 requests
Completed 17500 requests
Completed 20000 requests
Completed 22500 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 24941 requests completed

With less total requests, it looks fine:

$ ab -l -s 30 -c 200 -n 200 -v 1 http://localhost:5000/
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:        gunicorn/19.9.0
Server Hostname:        localhost
Server Port:            5000

Document Path:          /
Document Length:        Variable

Concurrency Level:      200
Time taken for tests:   0.084 seconds
Complete requests:      200
Failed requests:        0
Total transferred:      32513 bytes
HTML transferred:       713 bytes
Requests per second:    2380.19 [#/sec] (mean)
Time per request:       84.027 [ms] (mean)
Time per request:       0.420 [ms] (mean, across all concurrent requests)
Transfer rate:          377.87 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   1.2      2       3
Processing:     1   36  16.8     41      52
Waiting:        1   36  16.8     41      52
Total:          4   37  15.8     43      54

Percentage of the requests served within a certain time (ms)
  50%     43
  66%     51
  75%     51
  80%     52
  90%     52
  95%     52
  98%     53
  99%     53
 100%     54 (longest request)

Is there something I can change to improve the configuration for my kind of workload?

When I execute only one call of my real model, I see an answer in 0.5s. I would say an execution time of up to 1.0s is reasonable. Every call is stateless, meaning each call should be independent of other calls.

When I tried to analyze this problem, I saw a lot of TIME_WAIT:

$ netstat -nat | awk '{print $6}'  | sort | uniq -c | sort -n
      1 established)
      1 Foreign
      2 CLOSE_WAIT
      4 LISTEN
     10 SYN_SENT
     60 SYN_RECV
    359 ESTABLISHED
  13916 TIME_WAIT

How can I confirm / falsify that this is the problem? Is this in any way related to Flask / gunicorn? How does nginx relate to gunicorn?

I should probably also mention that all my requests are json objects and I only have three types of those json objects: ping, configuration and a call to the model — Martin Thoma, Jul 12 '18 at 05:36
So there is no file that needs to be served, but in principle all calls could be cached until a configuration change comes (and those are super rare compared to the ping/model calls) — Martin Thoma, Jul 12 '18 at 05:38

How to configure for many stateless requests?

0 Answers0