I am joining a Spark dataframe with 23 Million records with a dataframe having 0.5 Million records. Broadcast join doesn't seem feasible as the smaller table won't fit into the memory to be distributed over all workers. Whenever I carry out the join, Spark halts at the shuffle task and doesn't continue. How should I go on with the join?
Asked
Active
Viewed 140 times
0
-
How many partitions do you have for your data? How long did you wait (as in an hour or so)? Just want to make sure it really halts and it's not just really slow :) – Frank Nov 22 '18 at 10:56
-
@Frank - The bigger dataframe had 400 partitions and the smaller one has 40 partitions. I waited for 2 hours or more but after that it threw an error that "Not able to write rows" and RemoteException. The error was that the data was not able to be written to datanodes. – Anand Nautiyal Nov 22 '18 at 12:43
-
@Frank - Can Repartitioning help with this case ? – Anand Nautiyal Nov 23 '18 at 04:34
-
How many CPU cores do you have where you execute your Spark program on? Also see https://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark and https://spark.apache.org/docs/latest/tuning.html on this. – Frank Nov 23 '18 at 19:00