I cannot understand if what I want to do in Dask is possible...
Currently, I have a long list of heavy files. I am using multiprocessing library to process every entry of the list. My function opens and entry, operates on it, saves the result in a binary file to disk, and returns None. Everything works fine. I did this essentially to reduce RAM usage.
I would like to do "the same" in Dask, but I cannot figure out how to save binary data in parallel. In my mind, it should be something like:
for element in list:
new_value = func(element)
new_value.tofile('filename.binary')
where there can only be N elements loaded at once, where N is the number of workers, and each element is used and forgotten at the end of each cycle.
Is it possible?
Thanks a lot for any suggestion!