0

When generating the parquet file from the same csv file the Dask generated a parquet file with many small files (over 200 files with the size of 3MB) and the R Sergeant generated 2 .parquet files with 520 MB and 280 MB).
We tried to use fastparquet.write with the row_group_offset keyword but had no success. Using the partition_on in Dask added a set of partitions, but within each partition there are many sub .parquet files (hundreds or even thousands) .

How can we control the size of the parquet files in Python and in R?

skibee
  • 1,279
  • 1
  • 17
  • 37

1 Answers1

0

fastparquet, the default parquet writer for dask, will make at least one parquet file per chunk of input data, or more if you use partition_on or row_group_offset - these also act on the input data chunks one at a time. The number of chunks you will have will by default be equal to the number of CSV files.

In order to decrease the number of chunks, you must reshuffle your data (this can be expensive, and so only done when explicit required), e.g.,

df = df.repartition(npartitions=10)

before writing. Also, you can try the above with force=True, if necessary.

Note, that in many cases it makes sense to do this repartition/rechunk operation in combination with setting an index (set_index) and semi-sorting the data, which can produce better performance for later queries.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thank you for your reply - I managed to repartition the files - Now I'm trying to understand what are the considerations for the size of the partition. The [documentation in fastparquet](https://fastparquet.readthedocs.io/en/latest/details.html#partitions-and-row-groups) does not explain what do to incase the file does not have a high-cardinality situation. – skibee Aug 06 '17 at 06:33
  • Such thing often depend on the usage case. Smaller partitions on an index that's useful for selection, or a categorical where you are likely to want only some values lead to not having to read all the data; but bigger partitions are always more efficient to read. Make sure the partition size *in memory* is always much smaller than working RAM, especially for parallel. The HDFS block size, typically 128MB, is aiming at this rule-of-thumb. – mdurant Aug 06 '17 at 16:44