I am considering using Ray for a simple implementation of parallel processing of data:
- there are massive amounts of data items to be processed which become available through a stream / iterator. Each item is of significant size
- a function should be run on each of the items and will produce a result of significant size
- the processed data should get passed on in a stream or get stored in some kind of sink that can only accept a certain amount of data within some period of time
I want to find out if this is something that can be done in Ray.
Currently I have the following simple implementation based on pythons multiprocessing library:
- one process reads the stream and passes items on to a queue which will block after k items (so that the memory needed for the queue will not exceed some limit)
- there are several worker processes which will read from the input queue and process the items. The processed items are passed on to a result queue, which is again of limited size
- another process reads the result queue to pass on the items
With this, as soon as the workers cannot process any more items, the queue will block and no attempt is made to pass more work on to the workers. If the sink process cannot store more items, the result queue will block which will in turn block the workers which will in turn block the input queue until the writer process can write more results again.
So does Ray have abstractions to do something like this? How would I make sure that only a certain amount of work can be passed on to the workers and how can I have something like the single-process output function and make sure that the workers cannot flood that function with so many results that the memory/storage is exhausted?