Note: terminology gets confusing with this topic, since the term "partition" can be used to refer to both a shuffle partition (the data processed by a single spark task), and a hive partition (the segregated output files which allow for fast filter pushdowns). I've tried in this answer to always use the term "partition" to refer to hive partitions, and avoid talking about shuffle partitions in favor of talking about spark tasks, which are interchangeable for the purposes of this discussion.
How data is written with hive partitioning
When writing data with hive partitioning, each task in the final job stage will write out one file per partition present in that task's data, no hashing is involved here.
Assuming (as mentioned) the data is exactly equally distributed across all combinations of partitionBy columns, variations in file size can therefore only be caused by individual partitions being split across multiple tasks.
e.g. if I have two partitions A and B, both with 128mb of data, but the data for both is randomly distributed between two final-stage tasks (suppose you ran df.repartition()
before writing to get a random distribution), then you would end up with 4 files of 64mb each. 2 for each partition.
Note: this can get much worse! If you had 200 final stage tasks, you would end up with 400 tiny files in this example!
How to avoid problems
To avoid these file size/count problems, you should always ensure the data is suitably arranged in the output tasks by shuffling manually by the partitionBy columns before writing.
PARTITION_COLS = ["col1", "col2", "col3"]
out.write_dataframe(df.repartition(*PARTITION_COLS), partitionBy=PARTITION_COLS)
This will ensure that all the data for each partition is localized in a single task, and you end up with exactly one file per partition. Assuming, as we have, that the partitions are all equally sized, you will therefore have uniformly sized output files.
Note: you CAN get hash collisions in the repartition()
here. That can mean that some tasks may process multiple partitions, and some none. But it will not affect output file sizes. At worse some tasks will take longer to compute than others.
You may wonder why Foundry does not perform this partition for you automatically. This is because in certain advanced cases (in particular when working with extremely large data) it may be desirable to have finer control over how exactly this repartition is performed (tweaking the number of output tasks, or adding a random salt to split up the hive partitions etc).
A note on partitionByRange
partitionByRange()
is another tool you can use to control the layout of the shuffle partitions going into the final stage of your job.
This is unlikely to help you in this case. While it will ensure you get uniformly sized data in each output task, the boundaries will not line up nicely with the boundaries of your hive partitions. This means that each hive partition will likely end up in at least two unequal pieces. So calling partitionByRange()
before writing may actually make your problem worse.