When I save a dask dataframe with valid divisions, he divisions are not present when reading back
df.divisions # ['a', 'b', 'c', ...]
df.to_parquet('frame.pq', engine=engine, write_index=True, compute=True)
df2 = dask.dataframe.read_parquet('frame.pq', engine='pyarrow')
df2.divisions # [None, None, ...]
How can I can the divisions to be preserve in df2
?
Do I need to
- save
df
differently? - read
df2
differently? - somehow recover the divisions after reading
df2
?
Following @mdurant's suggestion in the comments I have found that divisions are preserved when using engine ='fastparquet'
.
Unfortunately fastparquet is having trouble serializing my data.
Why would pyarrow lose the division information when fastparquet does not.