How to Process Serial Data in Parallel from a Single Data Source?

Question

Problem:

I am downloading financial data from a server and processing this data afterwards. I do gather data for multiple stocks at the same time.

I need that the data downloader and the data processer to be processed in parallel (being that the data processer itself will be comprised of several processes).

I absolutely need the data for each stock to be processed in a serialized fashion, but in case I have more than 1 stock I must process the stocks in a parallel fashion.

My understanding of the problem:

From what I gather I need a way to get from a single source a method to transfer this data to parallel processes that identify beforehand which data (according to stock id) goes to each process (each stock has its own process).

I have tried a few different approaches without success up to now, I just need to get stuck of this error:

RuntimeError: Queue objects should only be shared between processes through inheritance

Possible Solution

The next thing I will try to implement is using a multiprocessing.Manager().dict() with collections.deque or multiprocessing.Queue() or list() as elements and create a dict for the mp.Process() instances (for each stock).

It is important that those data structures can be dinamically allocated as I might change stocks on runtime.

Question

What is a performant way to approch this problem?

Intuitively it seems that there is a better way than using multiprocessing.Manager().dict() to do this task, but I haven't found it yet. Is there such thing?

A few solutions that could help your use case. [Dask](http://dask.pydata.org/en/latest/): Dask is a flexible parallel computing library for analytic computing. or [Activeeon](https://try.activeeon.com/): similar thing, more graphical, more languages. You can also find some cloud solutions on AWS, Azure or GCP like serverless functions. — XYZ123, Dec 05 '17 at 19:54

How to Process Serial Data in Parallel from a Single Data Source?

0 Answers0