I am constructing a very large DAG in dask to submit to the distributed scheduler, where nodes operate on dataframes which themselves can be quite large. One pattern is that I have about 50-60 functions that load data and construct pandas dataframes that are several hundred MB each (and logically represent partitions of a single table). I would like to concatenate these into a single dask dataframe for downstream nodes in the graph, while minimizing the data movement. I link the tasks like this:
dfs = [dask.delayed(load_pandas)(i) for i in disjoint_set_of_dfs]
dfs = [dask.delayed(pandas_to_dask)(df) for df in dfs]
return dask.delayed(concat_all)(dfs)
where
def pandas_to_dask(df):
return dask.dataframe.from_pandas(df).to_delayed()
and I have tried various concat_all
implentations, but this seems reasonable:
def concat_all(dfs):
dfs = [dask.dataframe.from_delayed(df) for df in dfs]
return dask.dataframe.multi.concat(dfs, axis='index', join='inner')
All the pandas dataframes are disjoint on their index and sorted / monotonic.
However, I'm getting killed workers dying on this concat_all
function (cluster manager is killing them for exceeding their memory budgets) even though the memory budget on each is actually reasonably large and I wouldn't expect it to be moving data around. I'm reasonably certain that I always slice to a reasonable subset of data before calling compute()
within graph nodse that use the dask dataframe.
I am playing with --memory-limit
without success so far. Am I approaching the problem correctly at least? Are there considerations I'm missing?