Conditional Partitioning of Dataframe

Question

I have a 2-column dataframe such as:

col1 | col2
------------
a1   |   b1
------------
a2   |   b1
------------
a3   |   b2
------------
a1   |   b2
------------
a1   |   b3
------------

I partition this dataframe using a random number generation:

df = df.withColumn("part", (rand() * num_partitions).cast("int"))
df.write.partitionBy("part").mode("overwrite").parquet("/address/")

However, with this partitioning, there is no guarantee that all rows where col1=a1 will be allocated in one partition. Is there any way to have this guarantee while partitioning the dataframe?

@Kombajnzbożowy because the distribution of values in `col1` is very skewed. For some values, we only have few rows, for others we could have up to a million. Hence, partitions will not be equally sized. — A.M., Sep 30 '22 at 20:49
Well, in that case allocating all rows with col1=a1 into one partition is something to avoid, no? — Kombajn zbożowy, Sep 30 '22 at 21:06
That should be fine as long as the size of all partitions are roughly the same. We don't want to have one partition with size 10^6 and another with size 10. However, if all partitions have size 10^6 we should be fine. — A.M., Sep 30 '22 at 22:19
I guess you would need to count `col1` values and manually assign partition numbers if you want both grouping and equal sizes. — bzu, Oct 01 '22 at 09:58
And what is the total count of the dataset and number of distinct col1 values? — Kombajn zbożowy, Oct 01 '22 at 11:05

score 0 · Answer 1 · answered Oct 01 '22 at 14:56

You can repartition the dataset on part such as repartition(num_partitions, "part"), this will reduce the skew along your partition column col1. After while writing you will specify col1 in the partitionBy expression.

df.write.partitionBy("col1").mode("overwrite").parquet("/address/")

Conditional Partitioning of Dataframe

1 Answers1