I need to a way to control the output file size when saving txt/json
to S3
using java/scala
.
e.g. I would like a rolling file size of 10 mb, how can i control this using dataframe code,
I have experimented with
spark.sql.files.maxPartitionBytes
. This does not give accurate control. e.g if I setspark.sql.files.maxPartitionBytes=32MB
The output files are of size 33 mb.Other option is to use reparition,
df.rdd.reparition(n)
this will create n files. The values of n = size of inputfile/roll file size e.g input file size=200 mb, roll size=32 mb, n = 200/32 = 7. Creates 6 files of size 32mb and 1 one 8 mb file.
Appreciate any thoughts about controlling the output file size.
Thanks