I'm using Spark in a local .master(local)
environment using two Kafka topics with a message each, performing a join
on them.
Each job
run takes more than 15 minutes
override def execute(triggerHandler: () => Boolean): Unit = {
while (true) {
val triggerTimeMs = clock.getTimeMillis
val nextTriggerTimeMs = nextBatchTime(triggerTimeMs)
val terminated = !triggerHandler() // <- runs the job
if (intervalMs > 0) {
val batchElapsedTimeMs = clock.getTimeMillis - triggerTimeMs // <- more than 15'
if (batchElapsedTimeMs > intervalMs) {
notifyBatchFallingBehind(batchElapsedTimeMs)
}
If I just print the content of one of the topics the job takes less than a second.
My join:
Dataset<Row> leftDataset = firstTopic
.select(col("value").as("value"), col("timestamp").as("timestamp"))
.withColumn("leftId", lit("1"))
.select(
col("value").as("value"),
col("timestamp").as("leftTimestamp"),
col("leftId")
)
.withWatermark("ioTimestamp", "2 minutes");
Dataset<Row> rightDataset = secondTopic
.select(col("value").as("value"), col("timestamp").as("timestamp"))
.withColumn("rightId", lit("1"))
.select(
col("value").as("value"),
col("timestamp").as("rightTimestamp"),
col("rightId"))
.withWatermark("liTimestamp", "2 minutes");
return leftDataset.join(
rightDataset,
expr("leftId = rightId" +
"AND leftTimestamp < rightTimestamp " +
"AND rightTimestamp <= leftTimestamp + interval 1 minute")
)
.select(insertionOrderRecords.col("value"));
The rest of the configuration is standard.
I assume it's not an optimization issue since I have only 2 topics with a message each.
Why does the join
job takes more than 15 minutes when just printing a topic content takes less than a second? Am I using the watermarks
and interval
wrong?