4

I am writing application using Spark dataset API on databricks notebook.

I have 2 tables. One is 1.5billion rows and second 2.5 million. Both tables contain telecommunication data and join is done using country code and first 5 digits of a number. Output has 55 billion rows. Problem is I have skewed data(long running tasks). No matter how i repartition dataset I get long running tasks because of uneven distribution of hashed keys.

I tried using broadcast joins, tried persisting big table partitions in memory etc.....

What are my options here?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
datahack
  • 477
  • 1
  • 11
  • 32
  • sample dataset /code snippet, which you tried and not successful needed – Ram Ghadiyaram Oct 21 '17 at 08:27
  • you said you tried "broadcast joins" if its exceeding autothreshhold then it wont be broadcast hash join. internally it selects some other join (may be sortmerge or shuffle join). dont do `smallDF.join(largeDF) `it should be `largeDF.join(smallDF)` and this smallDF should fit in to memory so that it will be broadcasted to all the executors – Ram Ghadiyaram Oct 21 '17 at 08:36
  • @RamGhadiyaram I tried using explicit broadcast join. Problem is skewed data. I can not create custom partitioner on dataset, so only available way is repartition(num_part, join_key1, join_key2), where join_key1 is country code and join_key2 is 5 digits prefix. But this does not work as data becomes unevenly distributed... – datahack Oct 21 '17 at 11:46

1 Answers1

5

spark will repartition the data based on the join key, so repartitioning before the join won't change the skew (only add an unnecessary shuffle)

if you know the key that is causing the skew (usually it will be some thing like null or 0 or ""), split your data into to 2 parts - 1 dataset with the skew key, and another with the rest

and do the join on the sub datasets, and union the results

for example:

val df1 = ...
val df2 = ...
val skewKey = null

val df1Skew = df1.where($"key" === skewKey)
val df2Skew = df2.where($"key" === skewKey)

val df1NonSkew = df1.where($"key" =!= skewKey)
val df2NonSkew = df2.where($"key" =!= skewKey)

val dfSkew    = df1Skew.join(df2Skew) //this is a cross join
val dfNonSkew = df1NonSkew.join(df2NonSkew, "key")

val res = dfSkew.union(dfNonSkew)
lev
  • 3,986
  • 4
  • 33
  • 46
  • What if I don't know which key is creating the skew .Can you please have a look at this question https://stackoverflow.com/questions/49490720/how-to-handle-data-skew-in-the-spark-data-frame-for-outer-join?noredirect=1&lq=1 – Atharv Thakur Mar 27 '18 at 09:49
  • 1
    @lev This statement is important you make imo on join: spark will repartition the data based on the join key, so repartitioning before the join won't change the skew (only add an unnecessary shuffle). Is this still the case? I was doing some join experimenting thinking that more partitions would help. Is point about repartitioning in relation to column or number of partitions. I think you mean the former. Just mentioning it for clarity for others or if I am misinformed. – thebluephantom Aug 23 '18 at 07:35
  • @lev like you mentioned spark will repartition data based on the join key so how will separating the highly skewed values into a separate dataframe help? Won't joining the skewed DFs output the same number of output records for the skewed values as the original join? – ssj_100 Mar 12 '21 at 08:35