Join relatively small table with large table in Spark 2.1

Question

I am currently working on updating a table based on its existence on another table:

Ex:

Dataset A (relatively small, 300k of rows): DepartmentId, EmployeeId, Salary, Error

Dataset B (relatively huge, millions of rows): DepartmentId, EmployeeId, Salary

The logic is: 1. If A's (DepartmentId, EmployeeId) pair exists in B, then update A's salary with B's salary 2. Otherwise, write a message to A's error field

The solution I have now is doing a left outer join on A with B. Is there any other better practices for this type of problem?

Thank you in advance!

score 3 · Answer 1 · answered Jul 05 '17 at 15:04

3

For better performance, you can use broadcast hash join as mention here by @Ram Ghadiyaram

The broadcasted dataframe will be distributed in all the partition which increases the performance in joining.

DataFrame join optimization - Broadcast Hash Join

Hope this helps!

answered Jul 05 '17 at 15:04

koiralo

22,594
6
51
72

does broadcast hash join work for smallTable.join(bigTable) tho? – vincwng Jul 05 '17 at 15:18
1

Yes it works for broadcasr(smallTable).join(bigTable) as well – koiralo Jul 05 '17 at 15:30

Join relatively small table with large table in Spark 2.1

1 Answers1