Python: dispatched on-line processing of high volume real-time datastreams

Question

I have some high volume datastreams coming in on different websockets (sensor data - several TBs per month), for which

I want to guarantee that all data is stored, even during high load.

So I want to dispatch the data to my database and real-time processing module (e.g. GUI, ML predictions etc), in a way that

it buffers the datastreams, in case the processing of said modules is too slow, and so that these can 'catch-up' when the load decreases.

What I tried: python threads with queues (from Queue module or threading module) but if it's blocking, I can't ensure the data is not congested and if its non blocking (e.g. asyncio.Queue) I get race conditions and things blow up.

So maybe I should use some kind of callback methods but I don't know what to look for. I hope the question is not too vague. If anybody had a pointer to what I could try, optimally using python only, that would really help me a lot, even if its just an idea.

[Apache kafka](https://kafka.apache.org/) is a good streaming tool and you can use it in python by [faust](https://faust.readthedocs.io/en/latest/) library. — ahmadgh74, Feb 02 '21 at 06:26

score 0 · Answer 1 · answered Feb 02 '21 at 11:20

So maybe I should use some kind of callback methods but I don't know what to look for.

Looks like you need a Future.

What I tried: python threads with queues (from Queue module or threading module) but if it's blocking, I can't ensure the data is not congested and if its non blocking (e.g. asyncio.Queue) I get race conditions and things blow up.

You can try to use non-blocking methods of Queue:

and query database when queue.Empty or queue.Full exception caught.

That looks very promising, thanks a lot! I will give it a try and report back here. — hellovertex, Feb 03 '21 at 12:19

Python: dispatched on-line processing of high volume real-time datastreams

1 Answers1