Is it possible to select workers for specific tasks in Dask?

Question

I have a process I'm running on my Kubernetes cluster with Dask that consists of two map-reduce phases, but both maps across the nodes download potentially numerous large files to each worker. In order to avoid having two different machines process the same subset of files on the two different map steps, is it possible to deterministically select which workers get which arguments for the same jobs? Conceptually, what I want might be something like:

workers : List = client.get_workers();
#                       ^^^^^^^^^^^
filenames : List[str] = get_filenames(); # input data to process

# map each file to a specific worker
file_to_worker = { filename : workers[hash(filename) % len(workers)] for filename in filenames }

# submit each file, specifying which worker should be assigned the task
futures = [client.submit(my_func, filename, worker=file_to_worker[filename]) for filename in filenames]
#                                           ^^^^^^

Something like this would allow me to direct different steps of computation for the same files to the same nodes, eliminating any need to do a second caching of files.

score 3 · Answer 1 · answered May 28 '20 at 18:22

yes, you can submit functions to specific workers:

c.run(func, workers=[WorkerA, WorkerB, WorkerC])

You can also attach metadata resources to workers and submit with those definition instead of the specific hostnames:

data = [client.submit(load, fn) for fn in filenames]
processed = [client.submit(process, d, resources={'GPU': 1}) for d in data]
final = client.submit(aggregate, processed, resources={'MEMORY': 70e9})

For setup info look at the resource docs

Is it possible to select workers for specific tasks in Dask?

1 Answers1

Linked