0

I have scenario where I have a big dataset ( ds1) , which need to be joined with another dataset ds2( which is bit smaller than ds1 ), I am joining it by broadcast join something as below shown

Dataset<Row> result = ds1.join(broadcast(ds2))
                      .where(ds1.col_1 = ds2.col_2 and ds1.col_4 = ds2.col_6) //some join condition

Randomly it gives bellow error and job fails.

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition, true, [id=#51016]
+- *(180) LocalLimit 10001
   +- *(180) HashAggregate(keys=[benchmark_type_code#44300], functions=[], output=[benchmark_type_code#44300])
      +- Exchange hashpartitioning(benchmark_type_code#44300, 400), true, [id=#51011]
         +- *(179) HashAggregate(keys=[benchmark_type_code#44300], functions=[], output=[benchmark_type_code#44300])
            +- *(179) Project [benchmark_type_code#44300]
               +- *(179) BroadcastHashJoin [id#44359, country#45179], [id#45196, country_code#44515], Inner, BuildRight

Not able to understand what is causing error, so what is wrong here and how to fix here ? highly appreciate your help.

So failing with below Exception i.e. "because SparkContext was shut down"

 Caused by: org.apache.spark.SparkException: Job 67 cancelled because SparkContext was shut down
            at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:979)
            at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:977)
            at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
            at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:977)
            at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2257)
            at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
            at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2170)
            at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:1988)
            at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
            at org.apache.spark.SparkContext.stop(SparkContext.scala:1988)
            at org.apache.spark.SparkContext.$anonfun$new$35(SparkContext.scala:638)
            at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
            at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
            at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
            at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
            at scala.util.Try$.apply(Try.scala:213)
            at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
            at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
            at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:2154)
            at org.apache.spark.SparkContext.runJob(SparkContext.scala:2179)
            at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
            at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
            at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
            at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
            at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
            at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:120)
            at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:182)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:750)
Shasu
  • 458
  • 5
  • 22
  • @DaRkMaN Hi sir, SparkContext shuts down while doing a join , how to debug and fix the issue, can help me – Shasu Sep 27 '22 at 11:05
  • @Shaido how are you doing , any clue what am i doing wrong here and how to debug it to fix ? – Shasu Sep 27 '22 at 11:10

1 Answers1

1

Caused by: org.apache.spark.SparkException: Job 67 cancelled because SparkContext was shut down

This error is very generic and may be caused by many things but i think that in stacktrace you can find something usefull

at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:120)

Imo it may be caused by oom due to size of ds2 and i think that You may check size of broadcasted dataset and review your memory settings for driver/executors

M_S
  • 2,863
  • 2
  • 2
  • 17
  • thanks a lot , how to check size of ds here ? any easy way ? – Shasu Oct 07 '22 at 11:26
  • 1
    I know one solution if you just want to check dataset size and you cannot do it on your source side. You can cache this dataset (remember to call some action to) and then get sizeInBytes from statistics which are generated during caching with something like this spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes after caching size should also be avilable in SparkUI in storage tab – M_S Oct 07 '22 at 11:35
  • any clue on this issue https://stackoverflow.com/questions/74035832/exception-occured-while-writing-delta-format-in-aws-s3 ? – Shasu Oct 12 '22 at 02:27