dask.compute(...) is expected to be a blocking call. However when I have nested dask.compute, and the inner one does I/O (like dask.dataframe.read_parquet), the inner dask.compute is not blocking. Here's a pseudo code example:
import dask, distributed
def outer_func(name):
files = find_files_for_name(name)
df = inner_func(files).compute()
# do work with df
return result
def inner_func(files):
tasks = [ dask.dataframe.read_parquet(f) for f in files ]
tasks = dask.dataframe.concat(tasks)
return tasks
client = distributed.Client(scheduler_file=...)
results = dask.compute([ dask.delay(outer_func)(name) for name in names ])
If I started 2 workers with 8 processes each, like:
dask-worker --scheduler-file $sched_file --nprocs 8 --nthreads 1
, then I would expect at most 2 x 8 concurrent inner_func running because inner_func(files).compute() should be blocking. However, what I observed was that within one worker process, as soon as it starts the read_parquet step, there could be another inner_func(files).compute() starts running. So in the end there could be multiple inner_func(files).compute() running, and sometime it could cause out-of-memory error.
Is this expected behavior? If so, is any way to enforce one inner_func(files).compute() per worker process?