I want to join two huge data frame based on their similarity. I tried using approxsimilarityjoin. However the task gets stuck after some time and eventually fails.
Asked
Active
Viewed 319 times
1 Answers
0
There are multiple ways of doing it:
- Increase the cluster size
- Use Broadcast join if one of the dataset is a lot smaller than the other
- Use blocking techniques
- Use Deltalakes if that options is available

Rob
- 468
- 3
- 15
-
Thanks Rob. I am still confused since I am using spark for the first time. What do you mean by blocking techniques. Also, I cannot use broadcast join as both data frames are huge. – Dan Jul 22 '19 at 09:16