Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?
When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */
hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.
This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.