I have a 2-column dataframe such as:
col1 | col2
------------
a1 | b1
------------
a2 | b1
------------
a3 | b2
------------
a1 | b2
------------
a1 | b3
------------
I partition this dataframe using a random number generation:
df = df.withColumn("part", (rand() * num_partitions).cast("int"))
df.write.partitionBy("part").mode("overwrite").parquet("/address/")
However, with this partitioning, there is no guarantee that all rows where col1=a1
will be allocated in one partition. Is there any way to have this guarantee while partitioning the dataframe?