Documentation for Spark structured streaming says that - as of spark 2.3 all methods on the spark context available for static DataFrame
/DataSet
's are also available for use with structured streaming DataFrame
/DataSet
's as well. However I have yet to run across any examples of same.
Using fully formed SQL's is more flexible, expressive, and productive for me than the DSL
. In addition for my use case those SQL's are already developed and well tested for static versions. There must be some rework - in particular to use join
s in place of correlated subqueries
. However there is still much value in retaining the overall full-bodied sql structure.
The format for which I am looking to use is like this hypothetical join:
val tabaDf = spark.readStream(..)
val tabbDf = spark.readStream(..)
val joinSql = """select a.*,
b.productName
from taba
join tabb
on a.productId = b.productId
where ..
group by ..
having ..
order by .."""
val joinedStreamingDf = spark.sql(joinSql)
There are a couple of items that are not clear how to do:
Are the
tabaDf
andtabbDf
supposed to be defined viaspark.readStream
: this is my assumptionHow to declare
taba
andtabb
. Trying to usetabaDf.createOrReplaceTempView("taba") tabbDf.createOrReplaceTempView("tabb")
results in
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
All of the examples I could find are using the DSL
and/or the selectExpr()
- like the following https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
df.selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value")
or using select
:
sightingLoc
.groupBy("zip_code", window("start_time", "1 hour"))
.count()
.select(
to_json(struct("zip_code", "window")).alias("key"),
col("count").cast("string").alias("value"))
Are those truly the only options - so that the documentation saying that all methods supported on the static
dataframe/datasets are not really accurate? Otherwise: aAny pointers on how to correct the above issue(s) and use straight-up sql
with streaming would be appreciated.