TL;DR
How to safely await
on function execution (takes str
and int
as arguments and doesn't require any other context) in a separate process?
Long story
I have aiohtto.web
web API that uses Boost.Python
wrapper for C++
extension, run under gunicorn
(and I plan to deploy it on Heroku), tested by locust
.
About extension: it have just one function that does non-blocking operation - takes one string (and one integer for timeout management), does some calculations with it and returns a new string. And for every input string, it is only one possible output (except timeout, but in that case, C++
exception must be raised and translated by Boost.Python to a Python-compatible one).
In short, a handler for specific URL executes the code below:
res = await loop.run_in_executor(executor, func, *args)
where executor
is the ProcessPoolExecutor
instance, and func
-function from C++ extension module. (in the real project, this code is in the coroutine method of the class, and func
- it's classmethod
that only executes C++
function and returns the result)
Error catching
When a new request arrives, I extract it's POST data by request.post()
and then storing it's data to the instance of the custom class named Call
(because I have no idea how to name it in another way). So that call
object contains all input data (string), request receiving time and unique id
that comes with the request.
Then it proceeds to class named Handler
(not the aiohttp
request handler), that passes it's input to another class' method with loop.run_in_executor
inside. But Handler
has a logging system that works like a middleware - reads id and receiving time of every incoming call
object and logging it with a message that tells you either it just starting to execute, successfully executed or get in trouble. Also, Handler
have try/except
and stores all errors inside the call
object, so that logging middleware knows what error occurred, or what output extension had returned
Testing
I have the unit test that just creates 256 coroutines with this code inside and executor that have 256 workers and it works well.
But when testing with Locust here comes a problem. I use 4 Gunicorn workers and 4 executor workers for this kind of testing. At some time application just starts to return wrong output.
My Locust's TaskSet
is configured to log every fault response with all available information: output string, error string, input string (that was returned by the application too), id. All simulated requests are the same, but id
is unique for every.
The situation is better when setting Gunicorn's max_requests
option to 100
requests, but failures still come.
Interesting thing is, that sometimes I can trigger "wrong output" period by simply stopping and starting Locust's test.
I need a 100% guarantee that my web API works as I expect.
UPDATE & solution
Just asked my teammate to review the C++ code - the problem was in global variables. In some way, it wasn't a problem for 256 parallel coroutines, but for Gunicorn was.