Does Spark SQL include a table streaming optimization for joins?

Question

Does Spark SQL include a table streaming optimization for joins and, if so, how does it decide which table to stream?

When doing joins, Hive assumes the last table is the largest one. As a join optimization, it will attempt to buffer the smaller join tables and stream the last one through. If the last table in the join list is not the largest one, Hive has the /*+ STREAMTABLE(tbl) */ hint which tells it the table that should be streamed. As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.

This question has been asked for normal RDD processing, outside of Spark SQL, here. The answer does not apply to Spark SQL where the developer has no control of explicit cache operations.

score 3 · Accepted Answer · answered Aug 21 '15 at 07:50

I have looked for an answer to this question some time ago and all I could come up with was setting a spark.sql.autoBroadcastJoinThreshold parameter, which is by default 10 MB. It will then attempt to automatically broadcast all the tables with size smaller than the limit set by you. Join order plays no role here for this setting.

If you are interestend in further improving join performance, I highly recommend this presentation.

score 1 · Answer 2 · answered Jan 23 '18 at 13:57

This is the upcoming Spark 2.3 here (RC2 is being voted for the next release).

As of v1.4.1, Spark SQL does not support the STREAMTABLE hint.

It does not in the latest (and voted to be released soon) Spark 2.3 either.

There is no support for STREAMTABLE hint, but given the recent change (in SPARK-20857 Generic resolved hint node) to build a hint framework that should be fairly easy to write.

You'd have to write some Spark optimizations and possibly physical plan(s) that would support STREAMTABLE (which seems like a lot of work) but it's possible. The tools are there.

Regarding join optimizations, in the upcoming Spark 2.3 there are two main logical optimizations:

ReorderJoin
CostBasedJoinReorder (exclusively for cost-based optimization)

Does Spark SQL include a table streaming optimization for joins?

2 Answers2