0

I need to a way to control the output file size when saving txt/json to S3 using java/scala.

e.g. I would like a rolling file size of 10 mb, how can i control this using dataframe code,

  1. I have experimented with spark.sql.files.maxPartitionBytes. This does not give accurate control. e.g if I set spark.sql.files.maxPartitionBytes=32MB The output files are of size 33 mb.

  2. Other option is to use reparition, df.rdd.reparition(n) this will create n files. The values of n = size of inputfile/roll file size e.g input file size=200 mb, roll size=32 mb, n = 200/32 = 7. Creates 6 files of size 32mb and 1 one 8 mb file.

Appreciate any thoughts about controlling the output file size.

Thanks

vindev
  • 2,240
  • 2
  • 13
  • 20
Vms
  • 199
  • 2
  • 11

0 Answers0