I have DF1 with ~50k records. DF2 has >5Billion records from s3 parq. I need to do a left outer join on md5 hash in both DFs but as expected it's slow and expensive.
I tried broadcast join but DF1 is quite big as well.
I was wondering what would be the best way to handle this. Should I filter DF2 on those 50k records (md5s) first and then do the join with Df1.
Thanks.