I have 33 multi-partition dataframes. All have their metadata. They were all made with fastparquet. The structure looks something like:
- 20190101.parquet
- _common_metadata
- _metadata
- part.0.parquet
- ....
- part.n.parquet
- 20190102.parquet
- _common_metadata
- _metadata
- part.0.parquet
- ....
- part.n.parquet
- 20190103.parquet
- _common_metadata
- _metadata
- part.0.parquet
- ....
- part.n.parquet
I would like to join these all together.
I currently have:
dfs = []
for date in dates:
df = dd.read_parquet(f'{date}.parquet', engine='fastparquet')
dfs.append(df)
df = dd.concat(dfs)
This returns a dask dataframe called "concat" with 129,294 tasks.
I then am trying to write this out:
df.to_parquet('out.parquet', engine='fastparquet')
This last call never starts work. That is: * my notebook cell is running * dask system page shows a growing number of file descriptors and then flattens * dask system page shows increasing memory and then still increasing but more slowly * but tasks do not appear in the task stream
I have waited for up to 1 hour.
(Running on dask 2.3.0)