Reading CSV files then writing to parquet, is there a way to save 128 MB parquet blocks?
My current code is:
filtredDf
.repartition(96, "mypart")
.write
.option("compression", "snappy")
.option("parquet.block.size", 32 * 1024 * 1024)
.mode(SaveMode.Append)
.partitionBy("mypart")
.parquet(targetDirectory)
parquet.block.size doesn't seem to have any effect. At each run it creates a single parquet file. As I understand it, I should play with .repartition and .coalesce to define the number of created files, but this suppose me to know the size of the data I am writing...
What is the good practice with it?