I am trying to do some experiment/test on the checkpoint for learning purposes.
But I am getting limited options to see the working of the internals. I am trying to read from socket.
val lines: DataFrame = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 12345)
.load()
and do some state operations with it for which I need checkpointing
Q1. Using Checkpoint location as my local system, it is not able to read the checkpoint back and it gives an error
This query does not support recovering from checkpoint location. Delete src/testC/offsets to start over.;
It creates a new checkpoint running the query everytime. How to use my local system as checkpointing for testing/experimenting purposes?
(So I went for hdfs)
Q2. And when hdfs as checkpoint, it is creating checkpoint in my local system instead of hdfs, how to make it checkpoint to the hdfs ? (passed hdfs config btw)
df.writeStream
.option("checkpointLocation","/mycheckLoc")
.option("hdfs_url" -> "hdfs://localhost:9000/hdoop"),
.option("web_hdfs_url" -> "webhdfs://localhost:9870/hdoop")
Q3. Do we need to provide checkpoint in every df.writeStream
options, i.e. We can also pass in spark.sparkContext.setCheckpointDir(checkpointLocation)
right?