Submit dask arrays to distributed client while using results at the same time

Question

I have dask arrays that represents frames of a video and want to create multiple video files. I'm using the imageio library which allows me to "append" the frames to an ffmpeg subprocess. So I may have something like this:

my_frames = [[arr1f1, arr1f2, arr1f3], [arr2f1, arr2f2, arr2f3], ...]

So each internal list represents the frames for one video (or product). I'm looking for the best way to send/submit frames to be computed while also writing frames to imageio as they complete (in order). To make it more complicated the internal lists above are actually generators and can be 100s or 1000s of frames. Also keep in mind that because of how imageio works I think it needs to exist in one single process. Here is a simplified version of what I have working so far:

for frame_arrays in frames_to_write:
    # 'frame_arrays' is [arr1f1, arr2f1, arr3f1, ...]
    future_list = _client.compute(frame_arrays)
    # key -> future
    future_dict = dict(zip(frame_keys, future_list))

    # write the current frame
    # future -> key
    rev_future_dict = {v: k for k, v in future_dict.items()}
    result_iter = as_completed(future_dict.values(), with_results=True)
    for future, result in result_iter:
        frame_key = rev_future_dict[future]
        # get the writer for this specific video and add a new frame
        w = writers[frame_key]
        w.append_data(result)

This works and my actual code is reorganized from the above to submit the next frame while writing the current frame so there is some benefit I think. I'm thinking of a solution where the user says "I want to process X frames at a time" so I send 50 frames, write 50 frames, send 50 more frames, write 50 frames, etc.

My questions after working on this for a while:

When does result's data live in local memory? When it is returned by the iterator or when it is completed?
Is it possible to do something like this with the dask-core threaded scheduler so a user doesn't have to have distributed installed?
Is it possible to adapt how many frames are sent based on number of workers?
Is there a way to send a dictionary of dask arrays and/or use as_completed with the "frame_key" being included?
If I load the entire series of frames and submit them to the client/cluster I would probably kill the scheduler right?
Is using get_client() followed by Client() on ValueError the preferred way of getting the client (if not provided by the user)?
Is it possible to give dask/distributed one or more iterators that it pulls from as workers become available?
Am I being dumb? Overcomplicating this?

Note: This is kind of an extension to this issue that I made a while ago, but is slightly different.

Of course after writing this I now found this http://distributed.dask.org/en/latest/queues.html I'll see if I can get this to work. — djhoese, Feb 05 '19 at 15:13
I am unable to use computer/scatter/gather with a Queue to get fully computed numpy arrays out when dask arrays are used as inputs. — djhoese, Feb 05 '19 at 16:03
Sanity check: are things computed when `client.compute` is called or when `as_completed` (or similar) is called? — djhoese, Feb 05 '19 at 19:46

score 0 · Accepted Answer · answered Feb 09 '19 at 14:27

After following a lot of the examples here I got the following:

    try:
        # python 3
        from queue import Queue
    except ImportError:
        # python 2
        from Queue import Queue
    from threading import Thread

    def load_data(frame_gen, q):
        for frame_arrays in frame_gen:
            future_list = client.compute(frame_arrays)
            for frame_key, arr_future in zip(frame_keys, future_list):
                q.put({frame_key: arr_future})
        q.put(None)

    input_q = Queue(batch_size if batch_size is not None else 1)
    load_thread = Thread(target=load_data, args=(frames_to_write, input_q,))
    remote_q = client.gather(input_q)
    load_thread.start()

    while True:
        future_dict = remote_q.get()
        if future_dict is None:
            break

        # write the current frame
        # this should only be one element in the dictionary, but this is
        # also the easiest way to get access to the data
        for frame_key, result in future_dict.items():
            # frame_key = rev_future_dict[future]
            w = writers[frame_key]
            w.append_data(result)
        input_q.task_done()

    load_thread.join()

This answers most of my questions that I had and seems to work the way I want in general.

Submit dask arrays to distributed client while using results at the same time

1 Answers1