Spark Structured Streaming - This query does not support recovering from checkpoint location

Question

I am trying to do some experiment/test on the checkpoint for learning purposes.

But I am getting limited options to see the working of the internals. I am trying to read from socket.

val lines: DataFrame = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 12345)
  .load()

and do some state operations with it for which I need checkpointing

Q1. Using Checkpoint location as my local system, it is not able to read the checkpoint back and it gives an error

This query does not support recovering from checkpoint location. Delete src/testC/offsets to start over.;

It creates a new checkpoint running the query everytime. How to use my local system as checkpointing for testing/experimenting purposes?

(So I went for hdfs)

Q2. And when hdfs as checkpoint, it is creating checkpoint in my local system instead of hdfs, how to make it checkpoint to the hdfs ? (passed hdfs config btw)

df.writeStream
  .option("checkpointLocation","/mycheckLoc")
  .option("hdfs_url" -> "hdfs://localhost:9000/hdoop"),
  .option("web_hdfs_url" -> "webhdfs://localhost:9870/hdoop")

Q3. Do we need to provide checkpoint in every df.writeStream options, i.e. We can also pass in spark.sparkContext.setCheckpointDir(checkpointLocation) right?

Michael Heil · Accepted Answer · 2021-01-25T08:22:38.510

You are getting this error "This query does not support recovering from checkpoint location" because a socket readStream is not a re-playable source and hence does not allow any usage of checkpointing. You need to make sure not to use the option checkpointLocation at all in your writeStream.

Typically, you differentiate between local file system and an hdfs location by using either file:///path/to/dir or hdfs:///path/to/dir.

Make sure that you application user has all the rights to write and read these locations. Also, you may have changes the code base in which case the application cannot recover from the checkpoint files. You can read about the allowed and not allowed changes in a Structured Streaming job in the Structured Streaming Programming Guid on Recovery Semantics after Changes in a Streaming Query.

In order to make Spark aware of your HDFS you need to include two Hadoop configration files on Spark's classpath:

hdfs-site.xml which provides default behaviors for the HDFS client; and
core-site.xml which sets the default file system name.

Usually, they are stored in "/etc/hadoop/conf". To make these files visible to Spark, you need to set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files.

[Source from the book "Spark - The definitive Guide"]

"Do we need to provide checkpoint in every df.writeStream options, i.e. We can also pass in spark.sparkContext.setCheckpointDir(checkpointLocation) right?"

Theroetically, you could set the checkpoint location centrally for all queries in your SQLContext but it is highly recommend to set a unique checkpoint location for every single Stream. The Databricks blog on Structured Streaming in Production says:

"This checkpoint location preserves all of the essential information that uniquely identifies a query. Hence, each query must have a different checkpoint location, and multiple queries should never have the same location.

"As a best practice, we recommend that you always specify the checkpointLocation option."

Hi @mike, Thanks for the answer, but one issue still persist even though using local file system or hdfs. It gives an error `Exception in thread "main" org.apache.spark.sql.AnalysisException: This query does not support recovering from checkpoint location. Delete hdfs://127.0.0.1:9000/c1/offsets to start over.; `. Anything I am missing here..? — supernatural, Jan 25 '21 at 08:15
Sorry, forget about my previous comment. You can only use checkpoints for "re-playable" sources. The "socket" source is just not re-playable so you cannot use checkpoinintg here at all. — Michael Heil, Jan 25 '21 at 08:20
yeah, you are right..socket will not be working. I think kafka will be fine then — supernatural, Jan 25 '21 at 11:04
Yes, for Kafka it will work perfectly fine! Just make sure to have a checkpoint location for every output stream in case you have multiple queries within the same application. — Michael Heil, Jan 25 '21 at 11:05
Hi @mike, could you please take a sneak into this also: https://stackoverflow.com/questions/65903327/how-spark-calculates-the-window-start-time-with-given-window-interval — supernatural, Jan 26 '21 at 14:42
Hi @supernatural, sure, no problem. I answered that as well. — Michael Heil, Jan 26 '21 at 14:58

Spark Structured Streaming - This query does not support recovering from checkpoint location

1 Answers1