-2

What will happen if I join a dataframe/RDD/dataset with itself, i.e. do a self-join, and do a broadcast of the same dataframe/RDD/dataset in the operation?

The broadcast and self-join can't work together optimally.

philipxy
  • 14,867
  • 6
  • 39
  • 83
  • Please ask 1 specific researched non-duplicate question. (Don't wonder, ask.) (Don't say you think/believe something, say it. Don't say you think/believe something when you don't & if you don't know & that's relevant, say you don't know & especially don't say you think/believe it because you don't.) Please avoid social & meta commentary in posts. Saying you searched is not helpful. Asking for off-site resources is off-topic. [ask] [Help] "Optimally" doesn't mean anything in particular. – philipxy Apr 30 '23 at 15:05
  • 2
    What's stopping you from trying? – philipxy Apr 30 '23 at 15:11
  • @philipxy, I am trying it out in the notebook on databricks, however I need more insights as to how spark treats it and all. As the answer here provides better understanding of the UI, this is what I already mentioned in the question that I'm still learning more about it and was curious about it. Hope it clarifies!! – Sanket Mehta May 01 '23 at 06:11
  • Please clarify via edits, not comments. Please act on the comments. How are you 1st stuck answering this? What has your research shown? How are you stuck understanding what presentation/documentation? [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) – philipxy May 01 '23 at 06:20

1 Answers1

0

I think that the easiest way to find out is to check SparkUI

here is sample code:

import pyspark.sql.functions as F

data = [
    {"company_id": 1, "value": 3004, "date": datetime.datetime(2020, 2, 1), "date_year": 2020},
    {"company_id": 1, "value": 3004, "date": datetime.datetime(2020, 5, 17), "date_year": 2020},
    {"company_id": 1, "value": 3004, "date": datetime.datetime(2020, 7, 27), "date_year": 2020}
]

df = spark.createDataFrame(data)

df.join(F.broadcast(df), "value", "left").show()

In SparkUI i can see this:

enter image description here

Here we can see that when you do a self join Spark is actually reading/calculating your dataset twice, so broadcast still can be used - in that case right branch will be broadcasted

M_S
  • 2,863
  • 2
  • 2
  • 17
  • Hey thanks @M_S, for the answer. Can you please share where in databricks I could see this detailed step numbered UI of spark? Thanks! – Sanket Mehta May 01 '23 at 06:21
  • When you run a cell you should find list of job just above the output. Each job has link called "view" which will bring you to SparkUI, here eveything is quite standard. You can also go to "compute" tab in your databricks, choose your cluster and open SparkUI tab – M_S May 01 '23 at 07:19