4

I am constructing a very large DAG in dask to submit to the distributed scheduler, where nodes operate on dataframes which themselves can be quite large. One pattern is that I have about 50-60 functions that load data and construct pandas dataframes that are several hundred MB each (and logically represent partitions of a single table). I would like to concatenate these into a single dask dataframe for downstream nodes in the graph, while minimizing the data movement. I link the tasks like this:

dfs = [dask.delayed(load_pandas)(i) for i in disjoint_set_of_dfs]
dfs = [dask.delayed(pandas_to_dask)(df) for df in dfs]
return dask.delayed(concat_all)(dfs)

where

def pandas_to_dask(df):
    return dask.dataframe.from_pandas(df).to_delayed()

and I have tried various concat_all implentations, but this seems reasonable:

def concat_all(dfs):
    dfs = [dask.dataframe.from_delayed(df) for df in dfs]
    return dask.dataframe.multi.concat(dfs, axis='index', join='inner')

All the pandas dataframes are disjoint on their index and sorted / monotonic.

However, I'm getting killed workers dying on this concat_all function (cluster manager is killing them for exceeding their memory budgets) even though the memory budget on each is actually reasonably large and I wouldn't expect it to be moving data around. I'm reasonably certain that I always slice to a reasonable subset of data before calling compute() within graph nodse that use the dask dataframe.

I am playing with --memory-limit without success so far. Am I approaching the problem correctly at least? Are there considerations I'm missing?

Adam Klein
  • 476
  • 1
  • 4
  • 13

1 Answers1

5

Given your list of delayed values that compute to pandas dataframes

>>> dfs = [dask.delayed(load_pandas)(i) for i in disjoint_set_of_dfs]
>>> type(dfs[0].compute())  # just checking that this is true
pandas.DataFrame

Pass them to the dask.dataframe.from_delayed function

>>> ddf = dd.from_delayed(dfs)

By default this will run the first computation in order to determine metadata (column names, dtypes, etc. that are important for dask.dataframe). You can avoid this by constructing an example dataframe and passing it to the meta= keyword.

>>> meta = pd.DataFrame({'value': [1.0], 'name': ['foo'], 'id': [0]})
>>> ddf = dd.from_delayed(dfs, meta=meta)

This example notebook may also be helpful.

Generally you will never need to call dask functions from within other dask functions (as you were doing by delaying the from_pandas call). Dask.dataframe functions are themselves already lazy and don't need to be delayed further.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks for your quick response. I observe that dd.from_delayed(dfs) immediately evaluates `dfs[0]` in order to extract metadata. For some reason, this is causing problems for me. Is there another way to defer this evaluation until the graph is fully constructed? I will try to put together a repro. – Adam Klein Jun 07 '17 at 03:34
  • You can provide an example dataframe to the `meta=` keyword. I'll add an example in the answer. – MRocklin Jun 07 '17 at 12:25