1

I want to join two huge data frame based on their similarity. I tried using approxsimilarityjoin. However the task gets stuck after some time and eventually fails.

Dan
  • 26
  • 1

1 Answers1

0

There are multiple ways of doing it:

  1. Increase the cluster size
  2. Use Broadcast join if one of the dataset is a lot smaller than the other
  3. Use blocking techniques
  4. Use Deltalakes if that options is available
Rob
  • 468
  • 3
  • 15
  • Thanks Rob. I am still confused since I am using spark for the first time. What do you mean by blocking techniques. Also, I cannot use broadcast join as both data frames are huge. – Dan Jul 22 '19 at 09:16