1

I cannot understand if what I want to do in Dask is possible...

Currently, I have a long list of heavy files. I am using multiprocessing library to process every entry of the list. My function opens and entry, operates on it, saves the result in a binary file to disk, and returns None. Everything works fine. I did this essentially to reduce RAM usage.

I would like to do "the same" in Dask, but I cannot figure out how to save binary data in parallel. In my mind, it should be something like:

for element in list:
    new_value = func(element)
    new_value.tofile('filename.binary')

where there can only be N elements loaded at once, where N is the number of workers, and each element is used and forgotten at the end of each cycle.

Is it possible?

Thanks a lot for any suggestion!

Rob
  • 14,746
  • 28
  • 47
  • 65
nick
  • 49
  • 7

1 Answers1

1

That does sound like a feasible task:

from dask import delayed, compute

@delayed
def myfunc(element):
    new_value = func(element)
    new_value.tofile('filename.binary') # you might want to
    # change the destination for each element...
    
delayeds = [myfunc(e) for e in list]
results = compute(delayeds)

If you want fine control over tasks, you might want to explicitly specify the number of workers by starting a LocalCluster:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=3)
client = Client(cluster)

There is a lot more that can be done to customize the settings/workflow, but perhaps the above will work for your use case.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46