Join multiple pyspark dataframes causing broadcast timeout

Question

I am trying to perform inner join on multiple pyspark dataframes that have "date" column in all of the dataframes:

df = df_1.join(df_2, on=['date'], how='inner').join(df_3, on=['date'], how='inner').join(df_4, on=['date'], how='inner').join(df_5, on=['date'], how='inner').join(df_6, on=['date'], how='inner')

But I am getting the following broadcast timeout error:

Py4JJavaError: An error occurred while calling o5018.showString.
: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:206)
    at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:515)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:188)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:184)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:116)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:210)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:100)
    at org.apache.spark.sql.execution.CodegenSupport.consume(WholeStageCodegenExec.scala:194)
    at org.apache.spark.sql.execution.CodegenSupport.consume$(WholeStageCodegenExec.scala:149)
    at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:496)
    at org.apache.spark.sql.execution.InputRDDCodegen.doProduce(WholeStageCodegenExec.scala:483)
    at org.apache.spark.sql.execution.InputRDDCodegen.doProduce$(WholeStageCodegenExec.scala:456)
    at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:496)
    at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:496)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:95)
    at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:39)
    at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:51)
    at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:41)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:95)
    at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:39)
    at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:51)
    at org.apache.spark.sql.execution.CodegenSupport.$anonfun$produce$1(WholeStageCodegenExec.scala:95)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.CodegenSupport.produce(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.CodegenSupport.produce$(WholeStageCodegenExec.scala:90)
    at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:41)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:632)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:692)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
    at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:316)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:434)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
    at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2697)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2697)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2904)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.util.concurrent.TimeoutException
    at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:204)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:195)
    ... 89 more

Even the join on lesser no. of dataframes is not working. I tried joining them separately as well:

df = df_1.join(df_2, on=['date'], how='inner')
final_df = df.join(df_3, on=['date'], how='inner').drop(df['date'])

I also tried changing the default timeout of 300s to 6000s but it did not work. The other option is to set spark.sql.autoBroadcastJoinThreshold to -1 but I would prefer solving it some other way.

The dataframes are as follows:

    df_1
        +--------------+----------+
        |        mean_1|      date|
        +--------------+----------+
        |         96.24|2017-12-30|
        |         97.22|2018-01-01|
        +--------------+----------+
       
    df_2 
        +--------------+----------+
        |        mean_2|      date|
        +--------------+----------+
        |        116.66|2017-12-30|
        |        126.16|2018-01-01|
        +--------- ----+----------+
      
    df_3
        +--------------+----------+
        |        mean_3|      date|
        +--------------+----------+
        |         66.22|2017-12-30|
        |         98.26|2018-01-01|
        +--------------+----------+

   df_4
        +--------------+----------+
        |        mean_4|      date|
        +--------------+----------+
        |        196.45|2017-12-30|
        |        216.15|2018-01-01|
        +--------------+----------+
     
    df_5   
        +--------------+----------+
        |        mean_5|      date|
        +--------------+----------+
        |        106.29|2017-12-30|
        |         97.56|2018-01-01|
        +--------------+----------+
     
    df_6   
        +--------------+----------+
        |        mean_6|      date|
        +--------------+----------+
        |        136.66|2017-12-30|
        |        126.16|2018-01-01|
        +--------- ----+----------+

Expected output:

+----------+------+------+------+-------+-------+------+
|      date|mean_1|mean_2|mean_3| mean_4| mean_5|mean_6|
+----------+------+------+------+-------+-------+------+
|2017-12-30| 96.24|116.66| 66.22| 196.45| 106.29|136.66|
|2018-01-01| 97.22|126.16| 98.26| 216.15|  97.56|126.16|
+----------+------+------+------+-------+-------+------+

Please help.

Have you tried join on df_1 and df_2 mark it dfA, df_3 and df_4 mark it dfB, then do the join on dfA and dfB and see if it solves the issue ? — HArdRe537, Jan 28 '21 at 07:14
@SameekshaSohal are you sure you want to do a join, not a union? Could you show how the dataframes look like, and your expected output? — mck, Jan 28 '21 at 08:58
@mck i just added the dfs and the expected output. please have a look — Samiksha, Jan 28 '21 at 09:17
try following this answer: https://stackoverflow.com/a/65166462/14165730 — mck, Jan 28 '21 at 09:20
@mck thank you for the reference. I tried passing the no. of partitions that should occure -> df = df_1.repartition(1,'date').join(df_2.repartition(1, 'date'),on=['date']) but it did not work — Samiksha, Jan 28 '21 at 10:31
Right now I am working with a very small sample size where the df_1, df_2, ...., df_6 is of 2 rows and 2 columns. for this small size of df the join operation is not working. I am wondering how will i use the code for my entire dataset where the dataframes will be with million rows. — Samiksha, Jan 28 '21 at 10:41
@mck I am using dataframe created inside this for loop (https://stackoverflow.com/a/65921255/13713750). Likewise I have more dataframes that are created inside the loop and the aim is to join those dataframes — Samiksha, Jan 28 '21 at 11:12

Join multiple pyspark dataframes causing broadcast timeout

0 Answers0