How to optimise pyspark approxSimilarityJoin for 2 very large data frames

Question

I want to join two huge data frame based on their similarity. I tried using approxsimilarityjoin. However the task gets stuck after some time and eventually fails.

score 0 · Answer 1 · answered Jul 19 '19 at 13:44

0

There are multiple ways of doing it:

Increase the cluster size
Use Broadcast join if one of the dataset is a lot smaller than the other
Use blocking techniques
Use Deltalakes if that options is available

answered Jul 19 '19 at 13:44

Rob

468
3
15

Thanks Rob. I am still confused since I am using spark for the first time. What do you mean by blocking techniques. Also, I cannot use broadcast join as both data frames are huge. – Dan Jul 22 '19 at 09:16

How to optimise pyspark approxSimilarityJoin for 2 very large data frames

1 Answers1