3

Is there a way how to specify starting offset for Spark Structured File Stream Source ?

I am trying to stream parquets from HDFS:

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.readStream
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()

As I see, the first run is processing all available files detected in path, then save offsets to checkpoint location and process only new files, that is accept age and does not exist in files seen map.

I'm looking for a way, how to specify starting offset or timestamp or number of options to do not process all available files in the first run.

Is there a way I'm looking for?

Mikhail Dubkov
  • 1,223
  • 1
  • 12
  • 16

2 Answers2

8

Thanks @jayfah, as far as I found, we might simulate Kafka 'latest' starting offsets using following trick:

  1. Run warn-up stream with option("latestFirst", true) and option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge processing time. This way, warm-up stream will save latest file timestamp to checkpoint.

  2. Run real stream with option("maxFileAge", "0"), real sink using the same checkpoint location. In this case stream will process only newly available files.

Most probably that is not necessary for production and there is better way, e.g. reorganize data paths etc., but this way at least I found as answer for my question.

Mikhail Dubkov
  • 1,223
  • 1
  • 12
  • 16
  • i am facing similar issue... whenever i add new files to readStream directory..that time old file is geting processed.. lets say i have added file 1 first..then nothing is getting processed and nothing writing to hdfs... next time when i add file 2, the file 1 picking and its getting reflected in hdfs.... any idea how to resolve the issue ? – BigD Jan 11 '19 at 22:47
  • @Mikhail, thank you for your solution. It works for me. My challenge had been to having 4M files to process if the job were to start. This was putting a stress on my driver memory and I had to manually create the checkpoints (did not pursue much). I created two programs (one with Triggger.Once() to load batch with a range for historical and another one for streaming). I was searching for a solution to set the starting offset and your solution seems to have worked. Any reason why it was not marked as an accepted solution? Is there any other way? – venBigData Oct 17 '19 at 09:58
1

The FileStreamSource has no option to specify a starting offset.

But you could set the option of latestFirst to true to ensure that it processes the latest files first (this option is false by default)

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

 spark.readStream
  .option("latestFirst", true)
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()
bp2010
  • 2,342
  • 17
  • 34