I have a process I'm running on my Kubernetes cluster with Dask that consists of two map-reduce phases, but both maps across the nodes download potentially numerous large files to each worker. In order to avoid having two different machines process the same subset of files on the two different map steps, is it possible to deterministically select which workers get which arguments for the same jobs? Conceptually, what I want might be something like:
workers : List = client.get_workers();
# ^^^^^^^^^^^
filenames : List[str] = get_filenames(); # input data to process
# map each file to a specific worker
file_to_worker = { filename : workers[hash(filename) % len(workers)] for filename in filenames }
# submit each file, specifying which worker should be assigned the task
futures = [client.submit(my_func, filename, worker=file_to_worker[filename]) for filename in filenames]
# ^^^^^^
Something like this would allow me to direct different steps of computation for the same files to the same nodes, eliminating any need to do a second caching of files.