What will happen if I join a dataframe/RDD/dataset with itself, i.e. do a self-join, and do a broadcast of the same dataframe/RDD/dataset in the operation?
The broadcast and self-join can't work together optimally.
What will happen if I join a dataframe/RDD/dataset with itself, i.e. do a self-join, and do a broadcast of the same dataframe/RDD/dataset in the operation?
The broadcast and self-join can't work together optimally.
I think that the easiest way to find out is to check SparkUI
here is sample code:
import pyspark.sql.functions as F
data = [
{"company_id": 1, "value": 3004, "date": datetime.datetime(2020, 2, 1), "date_year": 2020},
{"company_id": 1, "value": 3004, "date": datetime.datetime(2020, 5, 17), "date_year": 2020},
{"company_id": 1, "value": 3004, "date": datetime.datetime(2020, 7, 27), "date_year": 2020}
]
df = spark.createDataFrame(data)
df.join(F.broadcast(df), "value", "left").show()
In SparkUI i can see this:
Here we can see that when you do a self join Spark is actually reading/calculating your dataset twice, so broadcast still can be used - in that case right branch will be broadcasted