0

In Apache Spark, there's a function called broadcast, which marks a DataFrame as small enough to be broadcast in a join. However, what if I want to do the opposite?

Even after adjusting the broadcast threshold, there are times when Spark tries to do a broadcast with DataFrames that are too large, leading to failed tasks. Is it possible to do the opposite of the broadcast function, and explicitly prevent Spark from broadcasting a specific DataFrame?

PiFace
  • 526
  • 3
  • 19
  • 2
    Does this answer your question? [How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?](https://stackoverflow.com/questions/48145514/how-to-hint-for-sort-merge-join-or-shuffled-hash-join-and-skip-broadcast-hash-j) – mazaneicha Jan 10 '23 at 21:56
  • It's essentially the same question, but without an answer. I know how to disable Brodcast Hash Join globally, as the answers to that question suggest, but I'd like to disable it for a specific DataFrame, on a specific join, not in the whole application. – PiFace Jan 11 '23 at 14:44
  • You can disable it globally and use hints to selectively broadcast what you need, or keep it enabled and use other [hints](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html#join-hints) to force desired join strategy. And unlike what you said _"Even after adjusting the broadcast threshold, there are times when Spark tries to do a broadcast..."_ that should NOT be the case, i.e. auto-broadcast should not happen if threshold is set to 0 or negative value. – mazaneicha Jan 11 '23 at 15:36
  • Sorry, I wasn't clear in that part. What I meant is: _when the threshold is adjusted to an appropriate positive value_, Spark sometimes tries to do a broadcast even if the real table size is above that threshold. – PiFace Jan 11 '23 at 15:51
  • I wasn't aware of the `DataFrame.hint` method. Why don't you post that suggestion as an answer so I can accept it? It's exactly what I was looking for. – PiFace Jan 12 '23 at 16:42

0 Answers0