Using Spark dataframe how to control output file size when saving text or json to S3

Asked Jan 16 '18 at 04:54

Active Jan 16 '18 at 05:22

Viewed 848 times

I need to a way to control the output file size when saving txt/json to S3 using java/scala.

e.g. I would like a rolling file size of 10 mb, how can i control this using dataframe code,

I have experimented with spark.sql.files.maxPartitionBytes. This does not give accurate control. e.g if I set spark.sql.files.maxPartitionBytes=32MB The output files are of size 33 mb.
Other option is to use reparition, df.rdd.reparition(n) this will create n files. The values of n = size of inputfile/roll file size e.g input file size=200 mb, roll size=32 mb, n = 200/32 = 7. Creates 6 files of size 32mb and 1 one 8 mb file.

Appreciate any thoughts about controlling the output file size.

Thanks

edited Jan 16 '18 at 05:22

vindev

asked Jan 16 '18 at 04:54

Vms

So what is the question? – Shaido Jan 16 '18 at 05:18
The first option does not do the job and the second is not efficient. – Vms Jan 16 '18 at 05:56

0 Answers0