0

I want to know how the trigger time for Streaming Dataset using join operations works for simple inner_joins.
As far as I understand when the query starts if no org.apache.spark.sql.streaming.trigger() is defined, the trigger will trigger as soon as possible, so I should be seeing the results as soon as possible.
I'm performing a join using 5 Datasets, 4 Streaming and one Static, all of them from Kakfa topics.
The topic's records are in Avro and json format (each topic has it's own format, no mixed format records).
The sink operation is writing the result on a local directory, using Parquet format.
For simplicity, I defined the Datasets to read from the beginning of the topic offset, like this.

//For Streaming Datasets
sparkSession.readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", <local Kafka server>)
                .option("subscribe", <the subscribing topic>)
                .option("failOnDataLoss", false)
                .option("startingOffsets", "earliest")
                .load();
//For the Static Dataset
sparkSession.read()
                .format("kafka")
                .option("kafka.bootstrap.servers", <local Kafka server>)
                .option("subscribe", <the subscribing topic>)
                .option("failOnDataLoss", false)
                .option("startingOffsets", "earliest")
                .load();

Then to parse the topic records I use

//For json records
<dataser>.select(col("value").cast("string"))
                .map(buildKafkaJsonMessageParser(), Encoders.STRING())
                .select(col("value").as("<alias>"))

Where buildKafkaJsonMessageParser just transform the record in string and ignores the magic bytes for the schema id.

//For Avro records
<dataset>.select(functions.from_avro(col("value"), joinedRecoStatusAvroSchema).as("<alias>"));

Where functions.from_avro is from the abris library.
Both parsers works; if I just print the topic's content they are displayed correctly.
Here's my doubt about trigger time
If I perform a join operation on two Datasets I can see the Parquet files being written almost immediately.

<left dataset>.join(
                        <right dataset>,
                        <join statement>    
                )
                .select(<left dataset>.col(<some col>), <right dataset>.col(<some col>))
                .writeStream()
                .option("checkpointLocation", <local directory checkpoint>)
                .format("parquet")
                .option("path", <local directoy to store parquet files>)
                .outputMode("append")
                .start()
                .awaitTermination();

I see the Paquet files being written in less than a minute.
But when I make a join on all my Datasets the Parquet files are written after more than 30 minutes

 <dataset A>.join(
                        <dataset B>,
                        <join statement>
                )
                .join(
                        <dataset C>,
                        <join statement>
                )
                .join(
                        <dataset D>,
                                <join statement>
                )
                .join(
                        <dataset D>,
                                <join statement>
                )
                .select(<references to cols>)
                .writeStream()
                .option("checkpointLocation", <local directory checkpoint>)
                .format("parquet")
                .option("path", <local directoy to store parquet files>)
                .outputMode("append")
                .start()
                .awaitTermination();

Parquet files are written after 30 minutes al least.
Why does the Parquet files takes so longto be written?

Koedlt
  • 4,286
  • 8
  • 15
  • 33

0 Answers0