How to efficiently join multiple dask dataframes

Question

I have 33 multi-partition dataframes. All have their metadata. They were all made with fastparquet. The structure looks something like:

- 20190101.parquet
 - _common_metadata
 - _metadata
 - part.0.parquet
 - ....
 - part.n.parquet
- 20190102.parquet
 - _common_metadata
 - _metadata
 - part.0.parquet
 - ....
 - part.n.parquet
- 20190103.parquet
 - _common_metadata
 - _metadata
 - part.0.parquet
 - ....
 - part.n.parquet

I would like to join these all together.

I currently have:

dfs = []
for date in dates:
    df = dd.read_parquet(f'{date}.parquet', engine='fastparquet')
    dfs.append(df)
df = dd.concat(dfs)

This returns a dask dataframe called "concat" with 129,294 tasks.

I then am trying to write this out:

df.to_parquet('out.parquet', engine='fastparquet')

This last call never starts work. That is: * my notebook cell is running * dask system page shows a growing number of file descriptors and then flattens * dask system page shows increasing memory and then still increasing but more slowly * but tasks do not appear in the task stream

I have waited for up to 1 hour.

(Running on dask 2.3.0)

If I pin dask and distributed to 2.1.0 then tasks appear in "Progress" after 52s and start running and appear in "Task Stream" after another ~2min. — birdsarah, Aug 20 '19 at 00:12

score 0 · Answer 1 · answered Aug 29 '19 at 14:36

0

I sincerely hope that all of these have a sorted index column along which you are joining them. Otherwise this is likely to be very expensive.

If they do have such a column, you might want to call it out explicitly.

answered Aug 29 '19 at 14:36

MRocklin

55,641
23
163
235

They do not. However, this problem is https://github.com/dask/dask/issues/5321 - the graph never arrives at the workers. – birdsarah Aug 31 '19 at 01:29

score 0 · Answer 2 · answered Mar 04 '20 at 12:27

You can just pass an array of filenames to fastparquet and it will read them as one and you can load them into a dask or pandas dataframe.

this is how i read a directory of parquet files scattered onto a dask cluster

output = ["some list of files..."]
df = client.scatter(dd.read_parquet(output,engine="fastparquet").reset_index().compute())

How to efficiently join multiple dask dataframes

2 Answers2