2

Reading CSV files then writing to parquet, is there a way to save 128 MB parquet blocks?

My current code is:

filtredDf
    .repartition(96, "mypart")
    .write
    .option("compression", "snappy")
    .option("parquet.block.size", 32 * 1024 * 1024)
    .mode(SaveMode.Append)
    .partitionBy("mypart")
    .parquet(targetDirectory)

parquet.block.size doesn't seem to have any effect. At each run it creates a single parquet file. As I understand it, I should play with .repartition and .coalesce to define the number of created files, but this suppose me to know the size of the data I am writing...

What is the good practice with it?

Rolintocour
  • 2,934
  • 4
  • 32
  • 63
  • 1
    Have a look here (`parquet.block.size`) [https://stackoverflow.com/questions/27194333/how-to-split-parquet-files-into-many-partitions-in-spark](https://stackoverflow.com/questions/27194333/how-to-split-parquet-files-into-many-partitions-in-spark) – Aydin K. Sep 20 '18 at 14:11

1 Answers1

3

If you are targeting a specific size for better concurrency and/or data locality, then parquet.block.size is indeed the right setting. Even though it does not limit the file size, it limits the row group size inside the Parquet files. Each of these blocks can be processed independently from each other and if stored on HDFS, data locality can also be taken advantage of.

To inspect the inner structure of a Parquet file, you can use the parquet-tools meta command.

Zoltan
  • 2,928
  • 11
  • 25