0

When I save a dask dataframe with valid divisions, he divisions are not present when reading back

df.divisions # ['a', 'b', 'c', ...]
df.to_parquet('frame.pq', engine=engine, write_index=True, compute=True)
df2 = dask.dataframe.read_parquet('frame.pq', engine='pyarrow')
df2.divisions # [None, None, ...]

How can I can the divisions to be preserve in df2?

Do I need to

  • save df differently?
  • read df2 differently?
  • somehow recover the divisions after reading df2?

Following @mdurant's suggestion in the comments I have found that divisions are preserved when using engine ='fastparquet'. Unfortunately fastparquet is having trouble serializing my data.

Why would pyarrow lose the division information when fastparquet does not.

Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
  • Have you also tried with `engine='fastparquet'`? – mdurant Jan 05 '18 at 14:04
  • @mdurant saving and reading with fastpaquet does preserve the division information. Do you have an idea why pyarrow loses it? – Daniel Mahler Jan 06 '18 at 03:45
  • I only wrote the fastparquet side of this; certainly the information on the max/mins of the index columns will be available in the parquet metadata. Also, fastparquet can read *some* list-like things, depending on the nesting of the schema. For writing, J/BSON encoding is easiest, but not well supported in other parquet frameworks. – mdurant Jan 06 '18 at 16:55
  • @mdurant I have a column that contains lists of 300 floats, ie all lists are the same length with no nesting. It still makes fastparquet unhappy (the reason I swiched to pyarrow). – Daniel Mahler Jan 08 '18 at 08:11
  • By "nesting", I meant the structure of the schema, which will depend on the tool you used to write the data. – mdurant Jan 08 '18 at 13:55
  • Our experiments indicate that fastparquet retrieves the first index value of the partition as division value, and not the division as it was set on the dask dataframe. Is this correct by design? @mdurant – Paul-Armand Verhaegen Feb 11 '19 at 12:44
  • On the the max/min of values within the row-groups (partitions) gets stored, not the original division values - parquet files do not hold any dask-specific information. – mdurant Feb 11 '19 at 14:27

0 Answers0