-2

I have 33 multi-partition dataframes. All have their metadata. They were all made with fastparquet. The structure looks something like:

- 20190101.parquet
 - _common_metadata
 - _metadata
 - part.0.parquet
 - ....
 - part.n.parquet
- 20190102.parquet
 - _common_metadata
 - _metadata
 - part.0.parquet
 - ....
 - part.n.parquet
- 20190103.parquet
 - _common_metadata
 - _metadata
 - part.0.parquet
 - ....
 - part.n.parquet

I would like to join these all together.

I currently have:

dfs = []
for date in dates:
    df = dd.read_parquet(f'{date}.parquet', engine='fastparquet')
    dfs.append(df)
df = dd.concat(dfs)

This returns a dask dataframe called "concat" with 129,294 tasks.

I then am trying to write this out:

df.to_parquet('out.parquet', engine='fastparquet')

This last call never starts work. That is: * my notebook cell is running * dask system page shows a growing number of file descriptors and then flattens * dask system page shows increasing memory and then still increasing but more slowly * but tasks do not appear in the task stream

I have waited for up to 1 hour.

(Running on dask 2.3.0)

birdsarah
  • 1,165
  • 8
  • 20
  • If I pin dask and distributed to 2.1.0 then tasks appear in "Progress" after 52s and start running and appear in "Task Stream" after another ~2min. – birdsarah Aug 20 '19 at 00:12

2 Answers2

0

I sincerely hope that all of these have a sorted index column along which you are joining them. Otherwise this is likely to be very expensive.

If they do have such a column, you might want to call it out explicitly.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • They do not. However, this problem is https://github.com/dask/dask/issues/5321 - the graph never arrives at the workers. – birdsarah Aug 31 '19 at 01:29
0

You can just pass an array of filenames to fastparquet and it will read them as one and you can load them into a dask or pandas dataframe.

this is how i read a directory of parquet files scattered onto a dask cluster

output = ["some list of files..."]
df = client.scatter(dd.read_parquet(output,engine="fastparquet").reset_index().compute())