0

I have a 2-column dataframe such as:

col1 | col2
------------
a1   |   b1
------------
a2   |   b1
------------
a3   |   b2
------------
a1   |   b2
------------
a1   |   b3
------------

I partition this dataframe using a random number generation:

df = df.withColumn("part", (rand() * num_partitions).cast("int"))
df.write.partitionBy("part").mode("overwrite").parquet("/address/")

However, with this partitioning, there is no guarantee that all rows where col1=a1 will be allocated in one partition. Is there any way to have this guarantee while partitioning the dataframe?

A.M.
  • 1,757
  • 5
  • 22
  • 41
  • Why not just `partitionBy("col1")`? – Kombajn zbożowy Sep 30 '22 at 20:42
  • @Kombajnzbożowy because the distribution of values in `col1` is very skewed. For some values, we only have few rows, for others we could have up to a million. Hence, partitions will not be equally sized. – A.M. Sep 30 '22 at 20:49
  • Well, in that case allocating all rows with col1=a1 into one partition is something to avoid, no? – Kombajn zbożowy Sep 30 '22 at 21:06
  • That should be fine as long as the size of all partitions are roughly the same. We don't want to have one partition with size 10^6 and another with size 10. However, if all partitions have size 10^6 we should be fine. – A.M. Sep 30 '22 at 22:19
  • I guess you would need to count `col1` values and manually assign partition numbers if you want both grouping and equal sizes. – bzu Oct 01 '22 at 09:58
  • And what is the total count of the dataset and number of distinct col1 values? – Kombajn zbożowy Oct 01 '22 at 11:05

1 Answers1

0

You can repartition the dataset on part such as repartition(num_partitions, "part"), this will reduce the skew along your partition column col1. After while writing you will specify col1 in the partitionBy expression.

df.write.partitionBy("col1").mode("overwrite").parquet("/address/")
Nithish
  • 3,062
  • 2
  • 8
  • 16