How can I control the number of output files written from Spark DataFrame?

Question

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

Yields many files some are large, and some are even 0 bytes.

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

There's `coalesce`/`repartition` for the first part and nothing clean and easy for the second. You should probably use the bash command `split` for that. — philantrovert, Jun 05 '18 at 13:28
@philantrovert that's not true, since spark 2.2 you can use `maxRecordsPerFile`, e.g. `df.write.option("maxRecordsPerFile", 10000)..`, see e.g. http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ — Raphael Roth, Jun 05 '18 at 17:43
@RaphaelRoth Thanks a lot for this! I didn't know about this at all. This is extremely useful. — philantrovert, Jun 05 '18 at 17:51

score 4 · Accepted Answer · answered Jun 05 '18 at 13:42

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

For Datasets with no wide dependencies you can control input using reader specific parameters
For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
Independent of the lineage you can coalesce or repartition.

is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

No. With built-in writers it is strictly 1:1 relationship.

since spark 2.2 you can use maxRecordsPerFile, e.g. df.write.option("maxRecordsPerFile", 10000), see http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ — Raphael Roth, Jun 05 '18 at 17:44

score 4 · Answer 2 · answered Jun 05 '18 at 13:47

4

you can use size estimator :

import org.apache.spark.util.SizeEstimator
val size  = SizeEstimator.estimate(df)

an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce

answered Jun 05 '18 at 13:47

Firas Sghari

56
3

Thanks for your help. The SizeEstimator refers to rows? – DigitalFailure Jun 05 '18 at 14:55
It refers To the objet size in byte . – Firas Sghari Jun 05 '18 at 21:54

How can I control the number of output files written from Spark DataFrame?

2 Answers2

Linked