When generating the parquet file from the same csv file
the Dask
generated a parquet file with many small files (over 200 files with the size of 3MB) and the R Sergeant
generated 2 .parquet
files with 520 MB and 280 MB).
We tried to use fastparquet.write
with the row_group_offset
keyword but had no success. Using the partition_on
in Dask
added a set of partitions, but within each partition there are many sub .parquet
files (hundreds or even thousands) .
How can we control the size of the parquet files in Python and in R?