1

I created a synthetic dataset and I trying to experiment with repartitioning based on a one column. The objective is to end up with a balanced (equal size) number of partitions, but I cannot achieve this. Is there a way it could be done, preferably without resorting to RDDs and saving the dataframe?

Example code:

from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
spark = SparkSession.builder.appName('learn').getOrCreate()
import pandas as pd
import random
from pyspark.sql.types import *
nr = 500
data = {'id': [random.randint(0,5) for _ in range(nr)], 'id2': [random.randint(0,5) for _ in range(nr)]}
data = pd.DataFrame(data)
df = spark.createDataFrame(data)
# df.show()
df = df.repartition(3, 'id')
# see the different partitions
for ipart in range(3):
    print(f'partition {ipart}')
    def fpart(partition_idx, iterator, target_partition_idx=ipart):
       if partition_idx == target_partition_idx:
            return iterator
        else:
            return iter(())
    res = df.rdd.mapPartitionsWithIndex(fpart)
    res = res.toDF(schema=schema)
    # res.show(n=5, truncate=False)
    print(f"number of rows {res.count()}, unique ids {res.select('id').drop_duplicates().toPandas()['id'].tolist()}")

It produces:

partition 0
number of rows 79, unique ids [3]
partition 1
number of rows 82, unique ids [0]
partition 2
number of rows 339, unique ids [5, 1, 2, 4]

so the partitions are clearly not balanced.

I saw in How to guarantee repartitioning in Spark Dataframe that this is explainable because assigning to partitions is based on the hash of column id modulo 3 (the number of partitions):

df.select('id', f.expr("hash(id)"), f.expr("pmod(hash(id), 3)")).drop_duplicates().show()

that produces

+---+-----------+-----------------+
| id|   hash(id)|pmod(hash(id), 3)|
+---+-----------+-----------------+
|  3|  519220707|                0|
|  0|-1670924195|                1|
|  1|-1712319331|                2|
|  5| 1607884268|                2|
|  4| 1344313940|                2|
|  2| -797927272|                2|
+---+-----------+-----------------+

but I find this strange. The point of specifying the column in the repartition function is to somehow split the values of id to different partitions. If the column id had more unique values than 6 in this example it would work better, but still.

Is there a way to achieve this?

karpan
  • 421
  • 1
  • 5
  • 13

0 Answers0