Dask scheduler behavior while reading/retrieving large datasets

Question

This is a follow-up to this question.

I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a backing Lustre filesystem.

Problem 1:

ds = DataSlicer(dataset) # ~600 GB dataset
dask_array = dask.array.from_array(ds, chunks=(13507, -1, -1), name=False) # ~22 GB chunks
dask_array = client.persist(dask_array)

When inspecting the Dask status dashboard, I see all 28 tasks being assigned to and processed by one worker while the other workers do nothing. Additionally, when every task has finished processing and the tasks are all in the "In memory" state, only 22 GB of RAM (i.e. the first chunk of the dataset) is actually stored on the cluster. Access to the indices within the first chunk are fast, but any other indices force a new round of reading and loading the data before the result returns. This seems contrary to my belief that .persist() should pin the complete dataset across the memory of the workers once it finishes execution. In addition, when I increase the chunk size, one worker often runs out of memory and restarts due to it being assigned multiple huge chunks of data.

Is there a way to manually assign chunks to workers instead of the scheduler piling all of the tasks on one process? Or is this abnormal scheduler behavior? Is there a way to ensure that the entire dataset is loaded into RAM?

Problem 2

I found a temporary workaround by treating each chunk of the dataset as its own separate dask array and persisting each one individually.

dask_arrays = [da.from_delayed(lazy_slice, shape, dtype, name=False) for \
               lazy_slice, shape in zip(lazy_slices, shapes)]
for i in range(len(dask_arrays)):
    dask_arrays[i] = client.persist(dask_arrays[i])

I tested the bandwidth from persisted and published dask arrays to several parallel readers by calling .compute() on different chunks of the dataset in parallel. I could never achieve more than 2 GB/s aggregate bandwidth from the dask cluster, far below our network's capabilities.

Is the scheduler the bottleneck in this situation, i.e. is all data being funneled through the scheduler to my readers? If this is the case, is there a way to get in-memory data directly from each worker? If this is not the case, what are some other areas in dask I may be able to investigate?

Dask scheduler behavior while reading/retrieving large datasets

Problem 1:

Problem 2

0 Answers0