I'm trying to output records using the following code:
spark.createDataFrame(asRow, struct)
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")
I don't have a problem when the data is small. However when I'm processing ~600GB input, I am writing around 290k files and that includes small files per partition. Is there a way we could control the number of output files per partition? Because right now I am writing a lot of small files and it's not good.