We have an application that is getting 20-30 requests a second. Waitress seems to be buckling under the load despite us tweaking performance vars. It doesn't crash nor give any errors. Instead, it seems to send (we assume) a ERRCONRESET to Nginx which is sending requests to it. This hypothesis is from the waitress documentation that notes when the backlog is past its limit it may send ERRCONRESETs to the requesting party. Further, Nginx returns 504s to us when waitress is under load. The python application itself continues to run seemingly fine.
We attempted to increase threads up (50 threads) and connection limits (1000) as well. We also lowered the channel_timeout and cleanup_interval (10 sec and 15sec respectively). This still showed no improvement on performance under load. Lastly we even attempted to increase the backlog to 2048. None of this has produced any significant impact.
On some level I even wonder if the new limits proscribed are being respected as running netcat shows long running connections that are not being terminated for well over 60 seconds. We're under the impression Waitress should be able to handle this load, yet it is not. To note we have scaled this up to 6 concurrent instances behind an LB to take requests and are still getting these errors.
Any feedback or performance tips would be appreciated. We are running these on pretty beefy AWS instances layered upon kubernetes. They are taking negligible CPU and RAM sources. When it does work its millisecond return times, so I cannot see any bottle necks in the code that may be contributing, onyl the fact that some how the connections and backlog are being overwhelmed.
See below for our config of waitress to start the app.
waitress.serve(app.app,
host=os.getenv('HOST', '0.0.0.0'),
port=int(os.getenv('PORT', '3000')),
expose_tracebacks=True,
connection_limit=os.getenv('CONNECTION_LIMIT', '1000'),
threads=os.getenv('THREADS', '50'),
channel_timeout=os.getenv('CHANNEL_TIMEOUT', '10'),
cleanup_interval=os.getenv('CLEANUP_INTERVAL', '30'),
backlog=os.getenv('BACKLOG', '2048'))