3

I have a folder on HDFS like below containing ORC files:

/path/to/my_folder

It contains partitions:

/path/to/my_folder/dt=20190101
/path/to/my_folder/dt=20190103
/path/to/my_folder/dt=20190103
...

Now I need to process the data here using streaming. A spark.readStream.format("orc").load("/path/to/my_folder") nicely works.

However, I do not want to process the whole table, but rather only start from a certain partition onwards similar to a certain kafka offset.

How can this be implemented? I.e. how can I specify the initial state where to read from.

Spark Structured Streaming File Source Starting Offset claims that there is no such feature. Their suggestion to use: latestFirst is not desirable for my use-case, as I do not aim to build an always-on streaming application, but rather use Trigger.Once like a batch job with the nice streaming semantics of duplicate reduction and handling of late-arriving data

If this is not available, what would be a suitable workaround?

edit

Run warn-up stream with option("latestFirst", true) and option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge processing time. This way, warm-up stream will save latest file timestamp to checkpoint.

Run real stream with option("maxFileAge", "0"), real sink using the same checkpoint location. In this case stream will process only newly available files. https://stackoverflow.com/a/51399134/2587904

building from this idea, let's look at an example:

# in bash
rm -rf data
mkdir -p data/dt=20190101
echo "1,1,1" >> data/dt=20190101/1.csv
echo "1,1,2" >> data/dt=20190101/2.csv
mkdir data/dt=20190102
echo "1,2,1" >> data/dt=20190102/1.csv
echo "1,2,2" >> data/dt=20190102/2.csv
mkdir data/dt=20190103
echo "1,3,1" >> data/dt=20190103/1.csv
echo "1,3,2" >> data/dt=20190103/2.csv
mkdir data/dt=20190104
echo "1,4,1" >> data/dt=20190104/1.csv
echo "1,4,2" >> data/dt=20190104/2.csv

spark-shell --conf spark.sql.streaming.schemaInference=true

// from now on in scala
val df = spark.readStream.csv("data")
df.printSchema
val query = df.writeStream.format("console").start
query.stop

// cleanup the data and start from scratch. 
// this time instead of outputting to the console, write to file
val query = df.writeStream.format("csv")
    .option("path", "output")
    .option("checkpointLocation", "checkpoint")
val started = query.start


// in bash
# generate new data
mkdir data/dt=20190105
echo "1,5,1" >> data/dt=20190105/1.csv
echo "1,5,2" >> data/dt=20190105/2.csv
echo "1,4,3" >> data/dt=20190104/3.csv

// in scala
started.stop
// cleanup the output, start later on with custom checkpoint
//bash: rm -rf output/*
val started = query.start

// bash
echo "1,4,3" >> data/dt=20190104/4.csv
started.stop

// *****************
//bash: rm -rf output/*

Everything works as intended. The operation picks up where the checkpoint marks the last processed file. How can a checkpoint definition be generated by hands such as all files in dt=20190101 and dt=20190102 have been processed and no late-arriving data is tolerated there anymore and the processing shall continue with all the files from dt=20190103 onwards?

I see that spark generates:

  • commits
  • metadata
  • offsets
  • sources
  • _spark-metadata

files and folders. So far I only know that _spark-metadata can be ignored to set an initial state / checkpoint.

But have not yet figured out (from the other files) which minimal values need to be present so processing picks up from dt=20190103 onwards.

edit 2

By now I know that:

  • commits/0 needs to be present
  • metadata needs to be present
  • offsets needs to be present

but can be very generic:

v1
{"batchWatermarkMs":0,"batchTimestampMs":0,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}

When I tried to remove one of the already processed and committed files from sources/0/0, the query still runs but: not only the new data is processed larger than the existing committed data, but any data, in particular, the files I just removed from the log.

How can I change this behavior to only process data more current than the initial state?

edit 3

The docs (https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html) but also javadocs ;) list getOffset

The maximum offset (getOffset) is calculated by fetching all the files in path excluding files that start with _ (underscore).

That sounds interesting, but so far I have not figured out how to use it to solve my problem.

Is there a simpler way to achieve the desired functionality besides creating a custom (copy) of the FileSource?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L237

maxFileAge also sounds interesting.

I have started to work on a custom file stream source. However, fail to properly instanciate it. https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1 When calling:

val df = spark.readStream.format("org.apache.spark.sql.execution.streaming.StatefulFileStreamSource")
    .option("partitionState", "/path/to/data/dt=20190101")
    .load("data")

The operation fails with:

InstantiationException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource
  at java.lang.Class.newInstance(Class.java:427)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:196)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
  at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
  at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
  at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
  ... 53 elided
Caused by: java.lang.NoSuchMethodException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()
  at java.lang.Class.getConstructor0(Class.java:3082)
  at java.lang.Class.newInstance(Class.java:412)
  ... 59 more

Even though it is basically a copy of the original source. What is different? Why is the constructor not found from https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L196? But it works just fine for https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L42

Even:

touch -t 201801181205.09 data/dt=20190101/1.csv
touch -t 201801181205.09 data/dt=20190101/2.csv
  val df = spark.readStream
      .option("maxFileAge", "2d")
    .csv("data")

returns the whole dataset and fails to filter to the most k current days.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Not sure if you have already seen the following post. [How to create a custom streaming data source?][1] [1]: https://stackoverflow.com/questions/47604184/how-to-create-a-custom-streaming-data-source – venBigData Oct 03 '19 at 11:52
  • Indeed, I do have read this. But as far as I understand `That's the file that links the short name in format to the implementation.`, this registration is only required to use short names. If instead the whole class name is specified, I should be able to use my custom streaming text file source. But instead there is an issue with the constructor. – Georg Heiler Oct 03 '19 at 11:55

0 Answers0