8

Documentation for Spark structured streaming says that - as of spark 2.3 all methods on the spark context available for static DataFrame/DataSet's are also available for use with structured streaming DataFrame/DataSet's as well. However I have yet to run across any examples of same.

Using fully formed SQL's is more flexible, expressive, and productive for me than the DSL. In addition for my use case those SQL's are already developed and well tested for static versions. There must be some rework - in particular to use joins in place of correlated subqueries. However there is still much value in retaining the overall full-bodied sql structure.

The format for which I am looking to use is like this hypothetical join:

 val tabaDf = spark.readStream(..)
 val tabbDf = spark.readStream(..)

 val joinSql = """select a.*, 
                  b.productName 
                  from taba
                  join tabb 
                  on a.productId = b.productId
                  where ..
                  group by ..
                  having ..
                  order by .."""
 val joinedStreamingDf = spark.sql(joinSql)

There are a couple of items that are not clear how to do:

  • Are the tabaDf and tabbDf supposed to be defined via spark.readStream: this is my assumption

  • How to declare taba and tabb . Trying to use

    tabaDf.createOrReplaceTempView("taba")
    tabbDf.createOrReplaceTempView("tabb")
    

    results in

    WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

All of the examples I could find are using the DSL and/or the selectExpr() - like the following https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html

 df.selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value")

or using select :

sightingLoc
  .groupBy("zip_code", window("start_time", "1 hour"))
  .count()
  .select( 
    to_json(struct("zip_code", "window")).alias("key"),
    col("count").cast("string").alias("value")) 

Are those truly the only options - so that the documentation saying that all methods supported on the static dataframe/datasets are not really accurate? Otherwise: aAny pointers on how to correct the above issue(s) and use straight-up sql with streaming would be appreciated.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

1 Answers1

9

The streams need to be registered as temporary views using createOrReplaceTempView. AFAIK createOrReplaceView is not a part of the Spark API (perhaps you have something that provides an implicit conversions to a class with such method).

spark.readStream(..).createOrReplaceTempView("taba")
spark.readStream(..).createOrReplaceTempView("tabb")

Now the views can be accessed using pure SQL. For example, to print the output to console:

spark
  .sql(joinSql)
  .writeStream
  .format("console")
  .start()
  .awaitTermination()

Edit: After question edit, I don't see anything wrong with your code. Here is a minimal working example. Assuming a test file /tmp/foo/foo.csv

"a",1
"b",2
import org.apache.spark.sql.types._
val schema = StructType(Array(StructField("s", StringType), StructField("i", IntegerType)))
spark.readStream
  .schema(schema)
  .csv("/tmp/foo")
  .createOrReplaceTempView("df1")
spark.readStream
  .schema(schema)
  .csv("/tmp/foo")
  .createOrReplaceTempView("df2")

spark.sql("SELECT * FROM df1 JOIN df2 USING (s)")
  .writeStream
  .format("console")
  .start()
  .awaitTermination()

outputs

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+---+
|  s|  i|  i|
+---+---+---+
|  b|  2|  2|
|  a|  1|  1|
+---+---+---+
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
ollik1
  • 4,460
  • 1
  • 9
  • 20
  • Hi Ollik - maybe you missed from the question : registering the views are included already `tabaDf.createOrReplaceView("taba")` and `tabbDf.createOrReplaceView("tabb")` . And then `spark.sql(joinSql)` . Your answer has not added any new code. Do you have something similar to this structure actually working - and if so a sscce would be great – WestCoastProjects Apr 14 '19 at 16:20
  • In particular - there is no mention here about how to fix the `WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException` – WestCoastProjects Apr 14 '19 at 16:27
  • @javadba So in your snippet `createOrReplaceView` was used instead of `createOrReplaceTempView`. Note the *Temp* part. Tried that the snippet works – ollik1 Apr 14 '19 at 17:14
  • That was a typo . My actual code used/uses `createOrReplaceTempView`. But are you saying you have this working? I have a commitment for next 2.5 hours. In the meantime i'll upvote and then get back to running the code again at that time. I'd really like to see this working and to award: will see at that time. – WestCoastProjects Apr 14 '19 at 17:43
  • Added a minimal example to the answer to demonstrate it works – ollik1 Apr 14 '19 at 18:32
  • Dude you're awesome. I wonder why it is that you're not encountering that `WARN`(/issue) that I am: `WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException`. The same code is successful in registration with your in-memory catalog whereas it does ***not*** work in my environment. If you have any thoughts on that .. I'm going to award in the meantime but any further insights on how/why you're getting better results than me would be much appreciated – WestCoastProjects Apr 14 '19 at 20:33
  • After `readStream` do you use any persistence mechanisms such as `load()` and/or `cache()` ? – WestCoastProjects Apr 14 '19 at 20:37
  • @javadba It seems the exception is related to Hive https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L727 so I assume you are using hive context? Maybe the issue is in the metastore configuration. Haven't been using that myself but here's a good tutorial for that https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hive-metastore.html – ollik1 Apr 15 '19 at 05:17
  • No, no _hive_ here. This is a real problem for me. – WestCoastProjects Apr 17 '19 at 07:31