Is there a way how to specify starting offset for Spark Structured File Stream Source ?
I am trying to stream parquets from HDFS:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.readStream
.parquet("/tmp/streaming/")
.writeStream
.option("checkpointLocation", "/tmp/streaming-test/checkpoint")
.format("parquet")
.option("path", "/tmp/parquet-sink")
.trigger(Trigger.ProcessingTime(1.minutes))
.start()
As I see, the first run is processing all available files detected in path, then save offsets to checkpoint location and process only new files, that is accept age and does not exist in files seen map.
I'm looking for a way, how to specify starting offset or timestamp or number of options to do not process all available files in the first run.
Is there a way I'm looking for?