3

I am currently working on updating a table based on its existence on another table:

Ex:

Dataset A (relatively small, 300k of rows): DepartmentId, EmployeeId, Salary, Error

Dataset B (relatively huge, millions of rows): DepartmentId, EmployeeId, Salary

The logic is: 1. If A's (DepartmentId, EmployeeId) pair exists in B, then update A's salary with B's salary 2. Otherwise, write a message to A's error field

The solution I have now is doing a left outer join on A with B. Is there any other better practices for this type of problem?

Thank you in advance!

Community
  • 1
  • 1
vincwng
  • 648
  • 1
  • 6
  • 12

1 Answers1

3

For better performance, you can use broadcast hash join as mention here by @Ram Ghadiyaram

The broadcasted dataframe will be distributed in all the partition which increases the performance in joining.

DataFrame join optimization - Broadcast Hash Join

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72