I want to know how the trigger time for Streaming Dataset using join operations works for simple inner_join
s.
As far as I understand when the query starts if no org.apache.spark.sql.streaming.trigger()
is defined, the trigger will trigger as soon as possible, so I should be seeing the results as soon as possible.
I'm performing a join using 5 Datasets
, 4 Streaming and one Static, all of them from Kakfa topics.
The topic's records are in Avro
and json
format (each topic has it's own format, no mixed format records).
The sink operation is writing the result on a local directory, using Parquet format.
For simplicity, I defined the Dataset
s to read from the beginning of the topic offset, like this.
//For Streaming Datasets
sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", <local Kafka server>)
.option("subscribe", <the subscribing topic>)
.option("failOnDataLoss", false)
.option("startingOffsets", "earliest")
.load();
//For the Static Dataset
sparkSession.read()
.format("kafka")
.option("kafka.bootstrap.servers", <local Kafka server>)
.option("subscribe", <the subscribing topic>)
.option("failOnDataLoss", false)
.option("startingOffsets", "earliest")
.load();
Then to parse the topic records I use
//For json records
<dataser>.select(col("value").cast("string"))
.map(buildKafkaJsonMessageParser(), Encoders.STRING())
.select(col("value").as("<alias>"))
Where buildKafkaJsonMessageParser
just transform the record in string and ignores the magic bytes for the schema id.
//For Avro records
<dataset>.select(functions.from_avro(col("value"), joinedRecoStatusAvroSchema).as("<alias>"));
Where functions.from_avro
is from the abris
library.
Both parsers works; if I just print the topic's content they are displayed correctly.
Here's my doubt about trigger time
If I perform a join operation on two Dataset
s I can see the Parquet files being written almost immediately.
<left dataset>.join(
<right dataset>,
<join statement>
)
.select(<left dataset>.col(<some col>), <right dataset>.col(<some col>))
.writeStream()
.option("checkpointLocation", <local directory checkpoint>)
.format("parquet")
.option("path", <local directoy to store parquet files>)
.outputMode("append")
.start()
.awaitTermination();
I see the Paquet files being written in less than a minute.
But when I make a join
on all my Dataset
s the Parquet files are written after more than 30 minutes
<dataset A>.join(
<dataset B>,
<join statement>
)
.join(
<dataset C>,
<join statement>
)
.join(
<dataset D>,
<join statement>
)
.join(
<dataset D>,
<join statement>
)
.select(<references to cols>)
.writeStream()
.option("checkpointLocation", <local directory checkpoint>)
.format("parquet")
.option("path", <local directoy to store parquet files>)
.outputMode("append")
.start()
.awaitTermination();
Parquet files are written after 30 minutes al least.
Why does the Parquet files takes so longto be written?