Spark: repartition output by key

Question

I'm trying to output records using the following code:

spark.createDataFrame(asRow, struct)
      .write
      .partitionBy("foo", "bar")
      .format("text")
      .save("/some/output-path")

I don't have a problem when the data is small. However when I'm processing ~600GB input, I am writing around 290k files and that includes small files per partition. Is there a way we could control the number of output files per partition? Because right now I am writing a lot of small files and it's not good.

Do you use HDFS as the file system? If so you can merge text like this: https://stackoverflow.com/questions/42433869/merge-csv-files-in-one-file. — wind, May 02 '18 at 06:53
You can take a look here (it's for parquet but should be the same for all formats): https://stackoverflow.com/questions/34789604/dataframe-partitionby-to-a-single-parquet-file-per-partition — Shaido, May 02 '18 at 06:58
Thank you very much @wind and Shaido for the answer. I'm doing alot of transformation with my input that's why I need to write and do it in Spark. The problem I am facing right now is having a lot of small files per partition. The block size of our HDFS cluster is 128MB, so it's better if 1 file per partition is near or more than the block size. But as of the moment I don't know if there's an available dataframe function to do that. — minyo, May 02 '18 at 07:29

score 1 · Answer 1 · answered May 02 '18 at 08:08

Having lots of files is the expected behavior as each partition (resulting in whatever computation you had before the write) will write to the partitions you requested the relevant files

If you wish to avoid that you need to repartition before the write:

spark.createDataFrame(asRow, struct)
      .repartition("foo","bar")
      .write
      .partitionBy("foo", "bar")
      .format("text")
      .save("/some/output-path")

score 0 · Answer 2 · answered May 02 '18 at 06:56

You have multiple files per partition because each node writes output to its own file. That means that the only way how to have only single file per partition is to re-partition data before writing. Please note, that that will be very inefficient because data repartition will cause shuffling on your data.

Spark: repartition output by key

2 Answers2