I have a folder on HDFS like below containing ORC files:
/path/to/my_folder
It contains partitions:
/path/to/my_folder/dt=20190101
/path/to/my_folder/dt=20190103
/path/to/my_folder/dt=20190103
...
Now I need to process the data here using streaming.
A spark.readStream.format("orc").load("/path/to/my_folder")
nicely works.
However, I do not want to process the whole table, but rather only start from a certain partition onwards similar to a certain kafka offset.
How can this be implemented? I.e. how can I specify the initial state where to read from.
Spark Structured Streaming File Source Starting Offset claims that there is no such feature.
Their suggestion to use: latestFirst
is not desirable for my use-case, as I do not aim to build an always-on streaming application, but rather use Trigger.Once
like a batch job with the nice streaming semantics of duplicate reduction and handling of late-arriving data
If this is not available, what would be a suitable workaround?
edit
Run warn-up stream with option("latestFirst", true) and option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge processing time. This way, warm-up stream will save latest file timestamp to checkpoint.
Run real stream with option("maxFileAge", "0"), real sink using the same checkpoint location. In this case stream will process only newly available files. https://stackoverflow.com/a/51399134/2587904
building from this idea, let's look at an example:
# in bash
rm -rf data
mkdir -p data/dt=20190101
echo "1,1,1" >> data/dt=20190101/1.csv
echo "1,1,2" >> data/dt=20190101/2.csv
mkdir data/dt=20190102
echo "1,2,1" >> data/dt=20190102/1.csv
echo "1,2,2" >> data/dt=20190102/2.csv
mkdir data/dt=20190103
echo "1,3,1" >> data/dt=20190103/1.csv
echo "1,3,2" >> data/dt=20190103/2.csv
mkdir data/dt=20190104
echo "1,4,1" >> data/dt=20190104/1.csv
echo "1,4,2" >> data/dt=20190104/2.csv
spark-shell --conf spark.sql.streaming.schemaInference=true
// from now on in scala
val df = spark.readStream.csv("data")
df.printSchema
val query = df.writeStream.format("console").start
query.stop
// cleanup the data and start from scratch.
// this time instead of outputting to the console, write to file
val query = df.writeStream.format("csv")
.option("path", "output")
.option("checkpointLocation", "checkpoint")
val started = query.start
// in bash
# generate new data
mkdir data/dt=20190105
echo "1,5,1" >> data/dt=20190105/1.csv
echo "1,5,2" >> data/dt=20190105/2.csv
echo "1,4,3" >> data/dt=20190104/3.csv
// in scala
started.stop
// cleanup the output, start later on with custom checkpoint
//bash: rm -rf output/*
val started = query.start
// bash
echo "1,4,3" >> data/dt=20190104/4.csv
started.stop
// *****************
//bash: rm -rf output/*
Everything works as intended. The operation picks up where the checkpoint marks the last processed file.
How can a checkpoint definition be generated by hands such as all files in dt=20190101
and dt=20190102
have been processed and no late-arriving data is tolerated there anymore and the processing shall continue with all the files from dt=20190103
onwards?
I see that spark generates:
- commits
- metadata
- offsets
- sources
- _spark-metadata
files and folders.
So far I only know that _spark-metadata
can be ignored to set an initial state / checkpoint.
But have not yet figured out (from the other files) which minimal values need to be present so processing picks up from dt=20190103
onwards.
edit 2
By now I know that:
- commits/0 needs to be present
- metadata needs to be present
- offsets needs to be present
but can be very generic:
v1
{"batchWatermarkMs":0,"batchTimestampMs":0,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
When I tried to remove one of the already processed and committed files from sources/0/0
, the query still runs but: not only the new data is processed larger than the existing committed data, but any data, in particular, the files I just removed from the log.
How can I change this behavior to only process data more current than the initial state?
edit 3
The docs (https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html) but also javadocs ;) list getOffset
The maximum offset (getOffset) is calculated by fetching all the files in path excluding files that start with _ (underscore).
That sounds interesting, but so far I have not figured out how to use it to solve my problem.
Is there a simpler way to achieve the desired functionality besides creating a custom (copy) of the FileSource?
maxFileAge
also sounds interesting.
I have started to work on a custom file stream source. However, fail to properly instanciate it. https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1 When calling:
val df = spark.readStream.format("org.apache.spark.sql.execution.streaming.StatefulFileStreamSource")
.option("partitionState", "/path/to/data/dt=20190101")
.load("data")
The operation fails with:
InstantiationException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource
at java.lang.Class.newInstance(Class.java:427)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:196)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
... 53 elided
Caused by: java.lang.NoSuchMethodException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 59 more
Even though it is basically a copy of the original source. What is different? Why is the constructor not found from https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L196? But it works just fine for https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L42
Even:
touch -t 201801181205.09 data/dt=20190101/1.csv
touch -t 201801181205.09 data/dt=20190101/2.csv
val df = spark.readStream
.option("maxFileAge", "2d")
.csv("data")
returns the whole dataset and fails to filter to the most k current days.