0

Hi I am trying to delete records from a delta table. It is causing a broadcast timeout error from time to time. Can someone please help with this

spark.sql(s"""DELETE FROM stg.bl  WHERE concat(key,':',revision) in 
   (Select distinct concat(bl.key,':',bl.revision) from stg.bl bl left semi join
    tgt.bl tgt ON bl.key = tgt.key and bl.revision = tgt.revision)""")
org.apache.spark.SparkException: Could not execute broadcast in 300 secs. 
You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or 
disable broadcast join 
by setting spark.sql.autoBroadcastJoinThreshold to -1

Error

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
mehere
  • 1,487
  • 5
  • 28
  • 50
  • Is this the only SQL statement executed at the time? What's Spark and Delta versions? Does this happen often or just happened once? Broadcast variables use network and disk heavily, and any issues with them may lead to this exception. – Jacek Laskowski Jun 02 '21 at 10:55
  • BTW What's the data source of the broadcast side of the join? – Jacek Laskowski Jun 02 '21 at 10:56
  • This occurs intermittenly..sometime it works..sometimes this issue comes up. maybe due to network?? – mehere Jun 03 '21 at 14:37
  • the datasource for the both joins is Azure Databricks delta tables. – mehere Jun 03 '21 at 14:38
  • Yes this is the only sql statement executed at that time. Maybe the job has a dedicated linked service which create a on demand cluster with 2 worker node and cluster version 5.5.x-scala2.11 – mehere Jun 03 '21 at 14:40

1 Answers1

0

it could be a bit late, but have you tried to set broadcast timeout limit from 300 seconds to a larger number?

spark.conf.set("spark.sql.broadcastTimeout", "300")

spark.conf.set("spark.sql.broadcastTimeout", "3000")