Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

Question

SparkSession
  .builder
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "C:/tmp/spark")
  .config("spark.sql.streaming.checkpointLocation", "C:/tmp/spark/spark-checkpoint")
  .appName("my-test")
  .getOrCreate
  .readStream
  .schema(schema)
  .json("src/test/data")
  .cache
  .writeStream
  .start
  .awaitTermination

While executing this sample in Spark 2.1.0 I got error. Without the .cache option it worked as intended but with .cache option i got:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[src/test/data]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65)
at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89)
at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479)
at org.apache.spark.sql.Dataset.cache(Dataset.scala:2489)
at org.me.App$.main(App.scala:23)
at org.me.App.main(App.scala)

Any idea?

Sorry, but i don't think that just not using cache is the solution. — Martin Brisiak, May 28 '17 at 07:42
Martin, feel free to participate on the comments on [SPARK-20927](https://issues.apache.org/jira/browse/SPARK-20927?focusedCommentId=16334363&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16334363) about the need of caching on streaming computations — mathieu, Jan 23 '18 at 16:28

score 21 · Accepted Answer · edited Jun 20 '20 at 09:12

Your (very interesting) case boils down to the following line (that you can execute in spark-shell):

scala> :type spark
org.apache.spark.sql.SparkSession

scala> spark.readStream.text("files").cache
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[files]
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
  at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
  at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
  at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
  at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
  at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
  at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
  at org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
  at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
  at org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
  at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
  at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
  ... 48 elided

The reason for this turned out quite simple to explain (no pun to Spark SQL's explain intended).

spark.readStream.text("files") creates a so-called streaming Dataset.

scala> val files = spark.readStream.text("files")
files: org.apache.spark.sql.DataFrame = [value: string]

scala> files.isStreaming
res2: Boolean = true

Streaming Datasets are the foundation of Spark SQL's Structured Streaming.

As you may have read in Structured Streaming's Quick Example:

And then start the streaming computation using start().

Quoting the scaladoc of DataStreamWriter's start:

start(): StreamingQuery Starts the execution of the streaming query, which will continually output results to the given path as new data arrives.

So, you have to use start (or foreach) to start the execution of the streaming query. You knew it already.

But...there are Unsupported Operations in Structured Streaming:

In addition, there are some Dataset methods that will not work on streaming Datasets. They are actions that will immediately run queries and return results, which does not make sense on a streaming Dataset.

If you try any of these operations, you will see an AnalysisException like "operation XYZ is not supported with streaming DataFrames/Datasets".

That looks familiar, doesn't it?

cache is not in the list of the unsupported operations, but that's because it has simply been overlooked (I reported SPARK-20927 to fix it).

cache should have been in the list as it does execute a query before the query gets registered in Spark SQL's CacheManager.

Let's go deeper into the depths of Spark SQL...hold your breath...

cache is persist while persist requests the current CacheManager to cache the query:

sparkSession.sharedState.cacheManager.cacheQuery(this)

While caching a query CacheManager does execute it:

sparkSession.sessionState.executePlan(planToCache).executedPlan

which we know is not allowed since it is start (or foreach) to do so.

Problem solved!

I thought that this is a bug so i reported it even sooner https://issues.apache.org/jira/browse/SPARK-20865 , i just needed to confirm my toughs. Thanks. — Martin Brisiak, May 31 '17 at 05:26
Link to master is not realy relevant because the target code can change. And I think it's what append in your links — crak, Aug 30 '17 at 10:38
@crak Correct. I should not have used master for the links. What do you think would be better? Seen links to specific versions in the past, but can't figure out how to do it on github today. Mind to offer some help? I'd appreciate. — Jacek Laskowski, Aug 30 '17 at 10:44
I was wondering too, but because your link as no value when the code change, I would recommend to target a specific commit. Your post was write at T time so maybe it will not be relevant in future spark version. I'm not chock that your post is true just at a specific date. — crak, Aug 30 '17 at 12:24
I don't really get the "problem solved"... Cache() can make sense : caching intermediate mini-batches, for various reasons : large mini-batches, complex computation re-used in many subsequent queries. So cache() may be relevant, and adding it to UnsupportedOperations does solve anything (but the clarity of the error message...) — mathieu, Jan 23 '18 at 16:08
@mathieu OK. You're right. It could be supported by it's not by design. Please note that Spark 2.3 is coming with another streaming engine (which may change things re caching). — Jacek Laskowski, Jan 23 '18 at 18:06
@jackek - are you still interested in how to link to something other than master in github ? in case you are: here it is. 1) Look at top of page that has link to 'master' version of the file. You will see the full path. to the right are 2 buttons that read 'Find File' and 'Copy Path'. To the left of the path (which is a series of clickable links to parent directories), you will see a button with the text 'master'. Now just click on that and you will see a drop down that lets you pick either a specific tag or branch to link to. Just choose one of those for a less 'volatile' link. — Chris Bedford, Jun 11 '19 at 02:09

Why does using cache on streaming Datasets fail with "AnalysisException: Queries with streaming sources must be executed with writeStream.start()"?

1 Answers1

Linked

Related