4

I want to perform a self join in order to generate candidate matching pairs. Currently, this is not working as this operation is way too slow. Unfortunately, I cannot broadcast the data frames as they are too large.

First I aggregate the number of tuples to reduce the data:

val aggregated = df.groupBy("discrete_foo", "discrete_bar").agg(sum("value"))
aggregated.repartition(7, "discrete_foo", "discrete_bar").sortWithinPartitions("discreate_foo", "discrete_bar, "baz").write.parquet.option("compression", "gzip").mode("overwrite")

This works just fine and is fast. Then, I want to perform a self join to generate candidates. I already observed that I need to generate more parallelism:

--conf spark.sql.shuffle.partitions=4000 \
--conf spark.default.parallelism=4000 \ 

Therefore, increased default and shuffle parallelism are set. Additionally, I tried to coarsen both discrete values (i.e. increase the number of items which fall into a discrete block) and thereby reduce the number of tuples. Still no luck. So I additionally tried to force a larger number of tasks by repartitioning:

val materializedAggregated= spark.read.parquet(s"path/to/file/aggregated_stuff"      )
  .repartition(4000)
val selfB = materializedAggregated
  .withColumnRenamed("baz", "other_batz")
  .withColumnRenamed("value", "other_value")

val candidates = materializedMultiSTW
  .join(selfB, Seq("discrete_foo", "discrete_bar"), "inner")
  .filter(col(FeedIdentifierColumns.imsiHash) =!= col("other_imsi_hash"))

However this also does not work & is way too slow. What further things can I do to make this query compute faster? Is there something I am missing?

Below you will see various failing attempts trying to increase the parallelism when reading the data for the self join.

I even set:

--conf spark.sql.files.maxPartitionBytes=16777216 \

to 1/8 i.e. 16 vs. 128MB, still the number of tasks which is generated is way too small i.e. only 250.

some details

The execution plan:

enter image description here

Even without this manual repartition it is way too slow, and I fear not enough partitions are created:

enter image description here

Even less tasks are processed - which most likely will make it slower:

enter image description here

How can I make sure that this initial step has a higher parallelism? Could bucketing help? But when reading the shuffled data only once - It would not really yield a speed up - right? What about the repartition step when writing the aggregated files? Should I set a higher number here? So far even when omitting it (and basically recomputing the aggregation twice) - it does not increase beyond 260 tasks.

environment

I use spark 2.3.x on HDP 3.1

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Is it possible this is due to skew in the generation of candidate comparisons? (in your case, the discrete values) For instance, if you were blocking on first name, the self join on John vs John comparisons might generate a very large number of comparisons. The SortMergeJoin mean that all of these will be computed on a single node. (interesting to see you've already tried most of the things I mention here, when i faced the same problem https://github.com/moj-analytical-services/splink_demos/blob/master/large_jobs_and_breaking_lineage.ipynb) – RobinL Apr 07 '20 at 07:13
  • Certainly, this is possible, but I think this is only a small / secondary problem. My issue is firstly: I somehow cannot tell spark to generate more tasks i.e. I cannot throw more CPUs at it to make it faster. But then you are absolutely right I will need to find a solution to the problem you are describing. – Georg Heiler Apr 07 '20 at 07:26
  • Meanwhile I have restarted the job with single CPU per executor each with 30 gigs of ram - this at least (for now) removes any need to spill to disk. But this is not the bottleneck for now - only the number of tasks. – Georg Heiler Apr 07 '20 at 07:29
  • 1
    Ah - I think i might understand now. The two things that you've done make sense, but only if you've got enough discrete values. You've done: --conf spark.sql.shuffle.partitions=4000 --conf spark.default.parallelism=4000 and the repartition(4000) However this will only work if there are at least 4000 discrete bins - i.e. more granular bins should result in more tasks, more coarse bins should result in fewer tasks – RobinL Apr 07 '20 at 07:30
  • But also in a way more tuples - but you are right, I will try to move in a different direction i.e. smaller discrete bins but more. – Georg Heiler Apr 07 '20 at 07:36
  • 1
    @RobinL I can confirm that this is correct. I still need to find the right number of parameters, but this suggestion definitely is giving me more than 250 tasks! – Georg Heiler Apr 07 '20 at 08:01
  • I observe a large amount of shuffle spill - but this most likely is due to the (stll many) candidates being generated. – Georg Heiler Apr 07 '20 at 08:44
  • @RobinL do you want to write an answer so I can accept it? – Georg Heiler Apr 07 '20 at 08:51

1 Answers1

3

The maximum number of tasks from your inner join will be equal to the number of join keys (i.e. their cardinality), irrespective of the settings for spark.sql.shuffle.partitions and spark.default.parallelism.

This is because in the SortMergeJoin, data will be shuffled using the hash of the join key. All data from each distinct join key will go to a single executor.

The problem therefore is that you do not have enough bins - they're too coarse. The maximum number of tasks you will see will be equal to the number of bins.

If you bin your data with more granularity, you should see the number of tasks increase.

RobinL
  • 11,009
  • 8
  • 48
  • 68