5

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

Yields many files some are large, and some are even 0 bytes.

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

DigitalFailure
  • 167
  • 2
  • 10
  • There's `coalesce`/`repartition` for the first part and nothing clean and easy for the second. You should probably use the bash command `split` for that. – philantrovert Jun 05 '18 at 13:28
  • 3
    @philantrovert that's not true, since spark 2.2 you can use `maxRecordsPerFile`, e.g. `df.write.option("maxRecordsPerFile", 10000)..`, see e.g. http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ – Raphael Roth Jun 05 '18 at 17:43
  • @RaphaelRoth Thanks a lot for this! I didn't know about this at all. This is extremely useful. – philantrovert Jun 05 '18 at 17:51
  • @RaphaelRoth That's exactly what I was looking for! Thanks. – DigitalFailure Jun 06 '18 at 05:40

2 Answers2

4

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

  • For Datasets with no wide dependencies you can control input using reader specific parameters
  • For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
  • Independent of the lineage you can coalesce or repartition.

is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

No. With built-in writers it is strictly 1:1 relationship.

  • 3
    since spark 2.2 you can use maxRecordsPerFile, e.g. df.write.option("maxRecordsPerFile", 10000), see http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/ – Raphael Roth Jun 05 '18 at 17:44
4

you can use size estimator :

import org.apache.spark.util.SizeEstimator
val size  = SizeEstimator.estimate(df)

an next you you can adapt the number of files according to the size of the dataframe with repatition or coalesce