I am learning spark using databricks community edition notebook. I have created sample data with few rows. As data is really small it should not have exchange phase in query plan. I tried broadcast too, still I see exchange phase. Do these configuration do not work on DB community edition notebook?
import org.apache.spark.sql.functions.{col,regexp_replace}
val authorBook = sc.parallelize(Seq(("a1" , "b1") , ("a1","b2") , ("a2" , "b3"),("a3" , "b4")))
val schemaColumn = Seq("author","book")
val df = authorBook.toDF(schemaColumn:_*)
val bookSold = sc.parallelize(Seq(("b1",100) , ("b2" , 500) , ("b3" , 400) , ("b4" , 500)) )
val bookSchema = Seq("book" , "sold_copy")
val dfBook = bookSold.toDF(bookSchema:_*)
///val totalBookSold = df.repartition(4,col("book")).join(dfBook.repartition(4,col("book")) , "book")
sc.broadcast(dfBook)
val totalBookSold = df.join(dfBook , "book")
totalBookSold.explain(true)
Query plan is same with broadcast and without broadcast
== Physical Plan ==
*(3) Project [book#698, author#697, sold_copy#708]
+- *(3) SortMergeJoin [book#698], [book#707], Inner
:- Sort [book#698 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(book#698, 200), [id=#2071]
: +- *(1) Project [_1#694 AS author#697, _2#695 AS book#698]
: +- *(1) Filter isnotnull(_2#695)
: +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#694, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#695]
: +- Scan[obj#693]
+- Sort [book#707 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(book#707, 200), [id=#2079]
+- *(2) Project [_1#704 AS book#707, _2#705 AS sold_copy#708]
+- *(2) Filter isnotnull(_1#704)
+- *(2) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#704, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#705]
+- Scan[obj#703]
This link resolved my issue
Broadcast not happening while joining dataframes in Spark 1.6