Spark join on two small kafka topics takes > 15' per job

Question

I'm using Spark in a local .master(local) environment using two Kafka topics with a message each, performing a join on them.
Each job run takes more than 15 minutes

  override def execute(triggerHandler: () => Boolean): Unit = {
    while (true) {
      val triggerTimeMs = clock.getTimeMillis
      val nextTriggerTimeMs = nextBatchTime(triggerTimeMs)
      val terminated = !triggerHandler() // <- runs the job
      if (intervalMs > 0) {
        val batchElapsedTimeMs = clock.getTimeMillis - triggerTimeMs // <- more than 15'
        if (batchElapsedTimeMs > intervalMs) {
          notifyBatchFallingBehind(batchElapsedTimeMs)
        }

If I just print the content of one of the topics the job takes less than a second.
My join:

        Dataset<Row> leftDataset = firstTopic
                .select(col("value").as("value"), col("timestamp").as("timestamp"))
                .withColumn("leftId", lit("1"))
                .select(
                       col("value").as("value"),
                        col("timestamp").as("leftTimestamp"),
                        col("leftId")
                 )
                .withWatermark("ioTimestamp", "2 minutes");
        Dataset<Row> rightDataset = secondTopic
                .select(col("value").as("value"), col("timestamp").as("timestamp"))
                .withColumn("rightId", lit("1"))
                .select(
                        col("value").as("value"),
                        col("timestamp").as("rightTimestamp"),
                        col("rightId"))
                .withWatermark("liTimestamp", "2 minutes");
        return leftDataset.join(
                        rightDataset,
                        expr("leftId = rightId" +
                                "AND leftTimestamp < rightTimestamp " +
                                "AND rightTimestamp <= leftTimestamp + interval 1 minute")
                )
                .select(insertionOrderRecords.col("value"));

The rest of the configuration is standard.
I assume it's not an optimization issue since I have only 2 topics with a message each.
Why does the join job takes more than 15 minutes when just printing a topic content takes less than a second? Am I using the watermarks and interval wrong?

Spark join on two small kafka topics takes > 15' per job

0 Answers0