0

I have a very simple FASTAPI app using a websocket route hooked to a busy loop on redis using BRPOP (blocking list pop) from redis.

await ws.accept()
while True:
    queue,message = await aioredis_client.brpop(queue_name)
    ...# some stuff
    ws.send_json(response_dict)

Using latest versions of gunicorn, uvicorn, asyncio, uvloop, websockets, fastapi w/ python 3.8.5.

I've tried many many combinations of settings with gunivorn and uvicorn (wsproto vs webosckets, uvloop vs asyncio).

60% of the time it works fine. Then, for no apparent reason messages are sent out vis send_json, but browser never receives them. Restarting gunicorn does nothing, same with switching to uvicorn.

I can fix the issue by blasting the websocket with a ton of messages, then they'll all start flowing again. Help.

I've thrown a sleep in the busy loop, does nothing.

EDIT: I upgraded python to 3.10 and I get a stack now:

  File "/home/amalizzio/venv/lib/python3.10/site-packages/starlette/websockets.py", line 163, in send_text
    await self.send({"type": "websocket.send", "text": data})
  File "/home/amalizzio/venv/lib/python3.10/site-packages/starlette/websockets.py", line 85, in send
    await self._send(message)
  File "/home/amalizzio/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in sender
    await send(message)
  File "/home/amalizzio/venv/lib/python3.10/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 327, in asgi_send
    await self.send(data)  # type: ignore[arg-type]
  File "/home/amalizzio/venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 635, in send
    await self.ensure_open()
  File "/home/amalizzio/venv/lib/python3.10/site-packages/websockets/legacy/protocol.py", line 935, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: sent 1012 (service restart); no close frame received

EDIT: I switched from websockets to wsproto and I get a similar stack, pasting:

    await ws.send_text(j)
  File "/home/amalizzio/venv/lib/python3.10/site-packages/starlette/websockets.py", line 163, in send_text
    await self.send({"type": "websocket.send", "text": data})
  File "/home/amalizzio/venv/lib/python3.10/site-packages/starlette/websockets.py", line 85, in send
    await self._send(message)
  File "/home/amalizzio/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in sender
    await send(message)
  File "/home/amalizzio/venv/lib/python3.10/site-packages/uvicorn/protocols/websockets/wsproto_impl.py", line 322, in send
    output = self.conn.send(
  File "/home/amalizzio/venv/lib/python3.10/site-packages/wsproto/__init__.py", line 64, in send
    data += self.connection.send(event)
  File "/home/amalizzio/venv/lib/python3.10/site-packages/wsproto/connection.py", line 107, in send
    raise LocalProtocolError(
wsproto.utilities.LocalProtocolError: Event Message(data='...lots of json...', frame_finished=True, message_finished=True) cannot be sent in state ConnectionState.LOCAL_CLOSING.

EDIT I switched from websockets to long polling and I have the same issue I can catch http closing in uvicorn TRACE.

The busy loop on blocking redis is not compatible with starlette or something when i have multiple tabs making connections to the same webserver (plenty of workers). It looks like a thread loses connection to the browser and new threads are spawned, however the list pops and sends the result into the ether! I can see this just be printing before/after popping and magically it's happens 1, then 2, then 3, then 4...

I took off the busy loop and handle all messages straight-through off the pop. I still miss the first message sometimes so I put a blocking wait of 25ms which hasn't broken yet. As a last resort should the queue blocking wait hit the timeout I scan a backup of the queue to see if any messages weren't delivered, so it's definitely reliable now.

I will try to find time to file a bug report. Note: I did switch to async/wait with aioredis and it made absolutely no difference at all.

Mike
  • 23
  • 4
  • 1
    Have you tried using an alternative websocket-client running on a server or different OS to see if it can be related to your browser code, to a firewall timing out NAT-ed connections or something similar? – MatsLindh May 05 '23 at 22:50
  • @MatsLindh I used python websockets library (localhost->localhost on WSL/Ubuntu) and the behavior is identical to in the browser, connectin opens, either no messages at all, or every-other-message gets through, but if you hit it a bunch of times, they start to come in regularly, then it will stop...randomly, all before connection closes. Behavior is same with my colleagues on their machines. Same when deployed via docker/kube/GCP. – Mike May 06 '23 at 20:25
  • 1
    I'm spitballing ideas here: Seeing as `brpop` is used and you're not awaiting it - can it be that your resources gets starved? i.e. you're stuck waiting without the coroutine giving up time to other calls? (seems like you've already attached a debugger since you're saying that you're getting to `send_json` - have you tried going further down into Starlette to see what happens there? I'd also try to see if it behaves differently with another Python-version (but you might already have done that in docker/kube/etc.) as asyncio has received a lot more attention since 3.8.x. – MatsLindh May 06 '23 at 21:45
  • I switched to aioredis to do an await brpop, unfortunately, had no impact. I'm trying to avoid going deep into starlette if you or someone else has seen this before and had any magic settings. I will try with newer python before rolling up the sleeves. I really appreciate your suggestions. – Mike May 07 '23 at 19:49
  • 1
    If you have a small, self-contained example that would be useful; it's hard to say anything more without actual code for me at least :/ Dropping the call to redis (and anything else you're doing in the function) could also be a useful debugging strategy – MatsLindh May 07 '23 at 19:54
  • I upgraded to the 3.10 per your suggestion and I get an exception now (3.10 overall has been more stable than 3.8, it's getting harder to reproduce). Stack is pasted in body. – Mike May 07 '23 at 20:31
  • 1
    I have a workaround for now, catch the exception and push back onto the queue., it gets picked up by the socket in open state. – Mike May 09 '23 at 01:22
  • @MatsLindh workaround barely worked, i went back to long polling, same issue -- faux-client-disconnect, i switched to async/await and no difference. found it though, somehow the busy loop on the blocking pop is causing starlette to have multiple ghost connections, what's happening is the pop is successful, and the response is returned to a thread that is no longer connected to the browser. Anyway I really don't have time to really get into it, I ended up dropping the busy loop and returning every pop + a queue backup to check if block times out, this should work fine for now. thx. – Mike May 20 '23 at 00:45
  • Thanks for the update and digging further! – MatsLindh May 20 '23 at 18:32

0 Answers0