I have dask arrays that represents frames of a video and want to create multiple video files. I'm using the imageio
library which allows me to "append" the frames to an ffmpeg subprocess. So I may have something like this:
my_frames = [[arr1f1, arr1f2, arr1f3], [arr2f1, arr2f2, arr2f3], ...]
So each internal list represents the frames for one video (or product). I'm looking for the best way to send/submit frames to be computed while also writing frames to imageio
as they complete (in order). To make it more complicated the internal lists above are actually generators and can be 100s or 1000s of frames. Also keep in mind that because of how imageio
works I think it needs to exist in one single process. Here is a simplified version of what I have working so far:
for frame_arrays in frames_to_write:
# 'frame_arrays' is [arr1f1, arr2f1, arr3f1, ...]
future_list = _client.compute(frame_arrays)
# key -> future
future_dict = dict(zip(frame_keys, future_list))
# write the current frame
# future -> key
rev_future_dict = {v: k for k, v in future_dict.items()}
result_iter = as_completed(future_dict.values(), with_results=True)
for future, result in result_iter:
frame_key = rev_future_dict[future]
# get the writer for this specific video and add a new frame
w = writers[frame_key]
w.append_data(result)
This works and my actual code is reorganized from the above to submit the next frame while writing the current frame so there is some benefit I think. I'm thinking of a solution where the user says "I want to process X frames at a time" so I send 50 frames, write 50 frames, send 50 more frames, write 50 frames, etc.
My questions after working on this for a while:
- When does
result
's data live in local memory? When it is returned by the iterator or when it is completed? - Is it possible to do something like this with the dask-core threaded scheduler so a user doesn't have to have distributed installed?
- Is it possible to adapt how many frames are sent based on number of workers?
- Is there a way to send a dictionary of dask arrays and/or use as_completed with the "frame_key" being included?
- If I load the entire series of frames and submit them to the client/cluster I would probably kill the scheduler right?
- Is using
get_client()
followed byClient()
onValueError
the preferred way of getting the client (if not provided by the user)? - Is it possible to give dask/distributed one or more iterators that it pulls from as workers become available?
- Am I being dumb? Overcomplicating this?
Note: This is kind of an extension to this issue that I made a while ago, but is slightly different.