How to process massive amounts of data in parallel without using up memory with Python Ray?

Question

I am considering using Ray for a simple implementation of parallel processing of data:

there are massive amounts of data items to be processed which become available through a stream / iterator. Each item is of significant size
a function should be run on each of the items and will produce a result of significant size
the processed data should get passed on in a stream or get stored in some kind of sink that can only accept a certain amount of data within some period of time

I want to find out if this is something that can be done in Ray.

Currently I have the following simple implementation based on pythons multiprocessing library:

one process reads the stream and passes items on to a queue which will block after k items (so that the memory needed for the queue will not exceed some limit)
there are several worker processes which will read from the input queue and process the items. The processed items are passed on to a result queue, which is again of limited size
another process reads the result queue to pass on the items

With this, as soon as the workers cannot process any more items, the queue will block and no attempt is made to pass more work on to the workers. If the sink process cannot store more items, the result queue will block which will in turn block the workers which will in turn block the input queue until the writer process can write more results again.

So does Ray have abstractions to do something like this? How would I make sure that only a certain amount of work can be passed on to the workers and how can I have something like the single-process output function and make sure that the workers cannot flood that function with so many results that the memory/storage is exhausted?

score 4 · Answer 1 · answered May 08 '19 at 23:14

4

There is an experimental streaming API for Ray, which you might find useful: https://github.com/ray-project/ray/tree/master/python/ray/experimental/streaming

It provides basic constructs for streaming data sources, custom operators, and sinks. You can also set a maximum memory footprint for your application by bounding the queue sizes.

Can you maybe share some additional information about your application?

What type of data are we talking about? How big is a single data item in bytes?

answered May 08 '19 at 23:14

jliagos

41
1

While this answers the question, the couple of lines at the bottom would be better left as a comment. Once you earn enough [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to leave comments on other users' posts to seek clarification from the poster of the question. – Hoppeduppeanut May 09 '19 at 00:17

score 0 · Answer 2 · answered Jul 31 '20 at 17:53

For this use-case I recommend Ray's parallel iterators. First, you would make a generator which takes large objects from your streaming generator (see ray.util.iter.from_iterators()) and chain operations on those items (see .for_each()). Crucially, the intermediate objects (which can themselves be large) are evicted from memory as soon as they are consumed by the next function in the chain, preventing you from running out of memory.

Finally, you can control the execution on the queue until your data sink is ready using however you want with the .take() method.

How to process massive amounts of data in parallel without using up memory with Python Ray?

2 Answers2