How to join a large Spark dataframe with a comparatively small dataframe

Asked Nov 22 '18 at 09:39

Active Mar 07 '21 at 14:26

Viewed 140 times

I am joining a Spark dataframe with 23 Million records with a dataframe having 0.5 Million records. Broadcast join doesn't seem feasible as the smaller table won't fit into the memory to be distributed over all workers. Whenever I carry out the join, Spark halts at the shuffle task and doesn't continue. How should I go on with the join?

edited Mar 07 '21 at 14:26

mck

40,932
13
35
50

asked Nov 22 '18 at 09:39

Anand Nautiyal

How many partitions do you have for your data? How long did you wait (as in an hour or so)? Just want to make sure it really halts and it's not just really slow :) – Frank Nov 22 '18 at 10:56
@Frank - The bigger dataframe had 400 partitions and the smaller one has 40 partitions. I waited for 2 hours or more but after that it threw an error that "Not able to write rows" and RemoteException. The error was that the data was not able to be written to datanodes. – Anand Nautiyal Nov 22 '18 at 12:43
@Frank - Can Repartitioning help with this case ? – Anand Nautiyal Nov 23 '18 at 04:34
How many CPU cores do you have where you execute your Spark program on? Also see https://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark and https://spark.apache.org/docs/latest/tuning.html on this. – Frank Nov 23 '18 at 19:00

How to join a large Spark dataframe with a comparatively small dataframe

0 Answers0