Pyspark repartition behavior

Question

I'm trying to understand the repartition() behavior in SQL context. I got a dataframe that got 178 rows. One the column is an unique id related to the data. In my dataframe, I know that I have 2 rows for each unique Id.

I want to be able to control the number of record in each partition that I got. In my case I want to have 89 partitions with 2 records inside.

Following the documentation (http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html?highlight=repartition#pyspark.sql.DataFrame.repartition) I'm doing the following :

df = spark.read \
        .parquet("my_data_path") \
        .repartition(89, "Id") 
        .withColumn('result_col', some_udf("data"))

df.persist()

df.write.format("org.elasticsearch.spark.sql").mode('append').save()

But back in the SparkUI while running the job I can see that the repartition is bad.

summary]([![https://postimg.cc/3d948XPV tasks

So there is something I'm understanding wrong about the repartiton with column. I tried to add some salt to my Id column but nothing change at all. My question is how I can control the number of record per partition and if I can do using repartition() ?

Thanks everyone

Mmhh, you are meaning that I got is this case : `Hash partitioner is neither injective nor surjective. Multiple keys can be assigned to a single partition and some partitions can remain empty.` ? — Lenjyco, Apr 11 '19 at 12:47
You don't need to repartitions by any key, just give a number ,and the rest will handle for you. — howie, Apr 11 '19 at 20:32
Already tried this and in result a got a bad repartition, my aim is really to control the number of records in each partition. — Lenjyco, Apr 12 '19 at 07:55
Hash partitioner will not do that. You can achieve that with a custom partitioner where you control the partition each key gets assigned to and that is available only for RDDs afaik. — Traian, Apr 12 '19 at 11:00

score 1 · Accepted Answer · answered Apr 12 '19 at 14:02

Found the solution, giving for people who looking for it.

The solution was to leave SQL context to use RDD function :


df = spark.read \
        .parquet("my_data_path") \

# We create a window in order to add index to our rows
w = Window.orderBy("random_field_sort")

# Add index
df = df.withColumn("index", row_number().over(w) % my_repartition_value))

schema = df.schema

# Use your index as Key in order to create a RDD of Key;Value
df = df.rdd.map(lambda x: (x["index"], (x)))

# The main point for the repartiton with the partitionBy 
# & revert back the structur of the data
rdd = df.partitionBy(my_repartition_value).map(lambda x: x[1])

# Good to go
df = spark.createDataFrame(rdd, schema)

df = df.withColumn('result_col', some_udf("data"))

df.persist()

df.write.format("org.elasticsearch.spark.sql").mode('append').save()

Pyspark repartition behavior

1 Answers1