Spark writeDataFrame with partitioningByRange in Foundry

Question

To be able to use Spark Partition Pruning in Palantir Foundry we need to use transforms.api.IncrementalTransformOutput.write_dataframe() with partitionBy=['col1', 'col2', 'col_N'] as described here.

When I do this on an incremental dataset where the data is absolutely evenly distributed over the partitionBy columns, I get vastly varying file sizes, from 128MB to 6MB. There are only 24 different combinations of partitionBy columns and I suspect the reason for the different file sizes is hash colisions when partitioning, I assume Foundry is using partitionBy() in this case.

I would rather use partitionByRange(), as both of my columns contain orderable values and I assume that this will result in a much more uniform file size as described here.

Is there a way to achieve this in Foundry?

Unfortunately it is not possible to use bucketing to achieve a more even distribtion as Foundry does not permit bucketing with incremental datasets.

I would ignore this problem, but this is a incremental data set growing over time, hence it is important to achieve roughly a file size which is somehow in the 128 MB range.

For this I need to get a predictable file size from Spark partitioning. Then I can play with the partitionBy attributes and/or use bucketing additionally to fine-tune the sizes.

To increase the file size of small files, you can 1) repartition before writing with partitionBy, 2) lower the cardinality of your partitions or 3) rely on bucketing instead . To limit the maximum size, either use bucketing (set the number of buckets) or use `maxRecordsPerFile` — 00schneider, Jun 13 '22 at 20:51
thanks for the hint, but this does not really address my question. I am trying to achieve a uniform file size. Currently the variance is to high. — twinkle2, Jun 14 '22 at 06:46

Ryan Norris · Accepted Answer · 2022-06-27T10:25:23.407

Note: terminology gets confusing with this topic, since the term "partition" can be used to refer to both a shuffle partition (the data processed by a single spark task), and a hive partition (the segregated output files which allow for fast filter pushdowns). I've tried in this answer to always use the term "partition" to refer to hive partitions, and avoid talking about shuffle partitions in favor of talking about spark tasks, which are interchangeable for the purposes of this discussion.

How data is written with hive partitioning

When writing data with hive partitioning, each task in the final job stage will write out one file per partition present in that task's data, no hashing is involved here.

Assuming (as mentioned) the data is exactly equally distributed across all combinations of partitionBy columns, variations in file size can therefore only be caused by individual partitions being split across multiple tasks.

e.g. if I have two partitions A and B, both with 128mb of data, but the data for both is randomly distributed between two final-stage tasks (suppose you ran df.repartition() before writing to get a random distribution), then you would end up with 4 files of 64mb each. 2 for each partition.

Note: this can get much worse! If you had 200 final stage tasks, you would end up with 400 tiny files in this example!

How to avoid problems

To avoid these file size/count problems, you should always ensure the data is suitably arranged in the output tasks by shuffling manually by the partitionBy columns before writing.

PARTITION_COLS = ["col1", "col2", "col3"]
out.write_dataframe(df.repartition(*PARTITION_COLS), partitionBy=PARTITION_COLS)

This will ensure that all the data for each partition is localized in a single task, and you end up with exactly one file per partition. Assuming, as we have, that the partitions are all equally sized, you will therefore have uniformly sized output files.

Note: you CAN get hash collisions in the repartition() here. That can mean that some tasks may process multiple partitions, and some none. But it will not affect output file sizes. At worse some tasks will take longer to compute than others.

You may wonder why Foundry does not perform this partition for you automatically. This is because in certain advanced cases (in particular when working with extremely large data) it may be desirable to have finer control over how exactly this repartition is performed (tweaking the number of output tasks, or adding a random salt to split up the hive partitions etc).

A note on partitionByRange

partitionByRange() is another tool you can use to control the layout of the shuffle partitions going into the final stage of your job.

This is unlikely to help you in this case. While it will ensure you get uniformly sized data in each output task, the boundaries will not line up nicely with the boundaries of your hive partitions. This means that each hive partition will likely end up in at least two unequal pieces. So calling partitionByRange() before writing may actually make your problem worse.

Thank you Ryan, this was really helpful. Do you know by any chance if there are plans to implement bucketing for an incremental transformation? — twinkle2, Jun 30 '22 at 19:33
I don't know of any plans for that, however you may find [these docs on projections](https://www.palantir.com/docs/foundry/optimizing-pipelines/projections-overview/) to be interesting. You can use a projection on top of an incremental dataset to have foundry maintain a properly bucketed version of the data behind the scenes for you, and then automatically use it in joins where appropriate. — Ryan Norris, Jul 01 '22 at 04:38

Spark writeDataFrame with partitioningByRange in Foundry

1 Answers1

How data is written with hive partitioning

How to avoid problems

A note on partitionByRange