I have a dataframe which has 500 partitions and is shuffled. I want to repartition it based on one column say 'city' But the city column is extremely skewed as it has only three possible values. So when I repartition based on column city, even if I specify 500 number of partitions, only three are getting data. Because of this I am running into performance issues. I searched on internet but could not find any suitable solution. Is there a way to repartition the dataframe uniformly across partitions based in city column. What I need is: city1 goes to say first 5 partitions, city2 goes to next 490 partitions and city3 goes to remaining 5 partitions.
Asked
Active
Viewed 1.1k times
1
-
3What is the purpose of this operation? – zero323 Jun 15 '17 at 12:03
-
When this dataframe is written to hive, we get 3 big files because only 3 partitions have data. We need to distribute the data so that manageable size files are generated when dataframe is written to hive – dipeshk Jun 15 '17 at 12:09
-
I think @zero323 already answered this one [here](https://stackoverflow.com/questions/37795488/partitioning-by-multiple-columns-in-spark-sql). You should probably use more than one column in your partitioning – tbone Jun 15 '17 at 19:42
3 Answers
3
When we've encountered data with known skew, we've used a partitioner that applies controlled randomization for the skewed values. I outline how this can be done in this answer.

Sim
- 13,147
- 9
- 66
- 95
-
If your partitioning scheme is based on randomness instead of purely based on the data, not sure how Spark (or Hive) would effectively handle such things as partition-wise joins (or partition pruning) for example (either now or in future releases). It does eliminate the need for knowing how your data is skewed. I'd be interested in your thoughts on this, as I could be missing something (perhaps Spark or Hive doesn't truly have this ability yet, as with [Oracle](https://docs.oracle.com/cd/B28359_01/server.111/b32024/part_avail.htm) – tbone Jun 19 '17 at 13:23
-
-
Actually I am doing this partition scheme at the end. The joins are happening before based on other columns. In the end I just needed to partition data near uniformly based on this skewed column. Afterwards no other operations are performed. Hence this method worked. – dipeshk Jun 19 '17 at 18:12
1
You can repartition into 500 partitions by specifying 1 or more columns (2 in this case). For example (pyspark):
file_x = "hdfs://mycluster/user/data/x"
df_x = sq.read.format("com.databricks.spark.avro").load(file_x)
print str(datetime.now()) + ": FileX partitions: " + str(df_x.rdd.getNumPartitions())
# repartition based on 2 columns
df_y = df_x.repartition(500, "CITY", "ADDRESS")
print str(datetime.now()) + ": FileY partitions: " + str(df_y.rdd.getNumPartitions())
See docs for more

tbone
- 15,107
- 3
- 33
- 40
-
2This will not necessarily solve the skew problem. It depend on how much variability is there in other columns. – Sim Jun 19 '17 at 00:56
-
@Sim yes, you need to know your own data. For example, partitioning by City and Country would probably not be a good choice. – tbone Jun 19 '17 at 10:32
0
Use DISTRIBUTE BY
clause on the dataframe.
As per your requirement, To deal with the skew, you can repartition your data using distribute by
.
For the expression to partition by, choose something that you know will evenly distribute the data.
df.distributeBy($'<expression>', 30)
In expression
, you randomize the result using some expression like city.toString().length > Randome.nextInt(<avg-city-length>)

Raktotpal Bordoloi
- 1,009
- 8
- 15
-
Using a boolean expression will result in only one of two possible values, which will make matters worse. You want an expression that will result in many different values for the same input. – Sim Jun 19 '17 at 00:59
-
Yup... That expression should give output random value so that it gets evenly distributed among partitions. ---------------->> You can give any expression, That's why, I used `RANDOM` – Raktotpal Bordoloi Jun 19 '17 at 03:28