I am running Dask on an eight-node Kubernetes cluster with my manifest specifying one scheduler replica and eight worker replicas. My code is processing 80 files of about equal size, and I wanted to see how performance scales from one worker to eight. I'm doing something roughly like this:
client: distributed.client.Client = get_client()
workers = client.scheduler_info()['workers']
worker_ips: List[str] = list(workers.keys())
my_files: List[str] = ["list", "of", "files", "to", "be", "processed", "..."]
# This dictionary maps a worker ip to a uniform subset of my_files
files_per_worker = {
"worker_ip1" : ["list", "to", "..."], # files for worker1 only
"worker_ip2" : ["of", "be"], # files for worker2 only
"worker_ip3" : ["files", "processed"] # files for worker3 only
}
# Send each worker a subset of the work
futures = [client.submit(do_work, subset_of_files, workers=[ip])
for (ip, subset_of_files) in files_per_worker.items()]
# Get results from each node, blocking until completion, and reducing partial results into final version
result = finalize_partial_results([f.result() for f in futures])
The simplified summary of results is:
- One node is the slowest (not surprising)
- Five nodes is the fastest (taking about 25% as long as one node)
- Six nodes spikes (about 80% longer than five nodes, barely better than half the time for one node)
- Seven and more nodes are pretty flat - there's not much incremental performance gain. But also, it doesn't get back to the performance of five nodes.
I would have thought that eight - one worker per physical node - would be optimal, but this isn't the case. I even tested this with different input datasets of varying sizes; five nodes is always the best with a big jump at six nodes.
What might be causing this, and how can I avoid this performance degradation? As far as I can tell, each worker_ip
represents a physical node, so work should be uniformly shared across the subset of workers selected.