I'm trying to understand the repartition()
behavior in SQL context. I got a dataframe that got 178 rows. One the column is an unique id related to the data. In my dataframe, I know that I have 2 rows for each unique Id.
I want to be able to control the number of record in each partition that I got. In my case I want to have 89 partitions with 2 records inside.
Following the documentation (http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html?highlight=repartition#pyspark.sql.DataFrame.repartition) I'm doing the following :
df = spark.read \
.parquet("my_data_path") \
.repartition(89, "Id")
.withColumn('result_col', some_udf("data"))
df.persist()
df.write.format("org.elasticsearch.spark.sql").mode('append').save()
But back in the SparkUI while running the job I can see that the repartition is bad.
So there is something I'm understanding wrong about the repartiton with column. I tried to add some salt to my Id
column but nothing change at all. My question is how I can control the number of record per partition and if I can do using repartition()
?
Thanks everyone