2

I have multiple spark structured streaming jobs and the usual behaviour that I see is that a new batch is triggered only when there are any new offsets in Kafka which is used as source to create streaming query.

But when I run this example which demonstrates arbitrary stateful operations using mapGroupsWithState , then I see that a new batch is triggered even if there is no new data in Streaming source. Why is it so and can it be avoided?

Update-1 I modified the above example code and remove state related operation like updating/removing it. Function simply outputs zero. But still a batch is triggered every 10 seconds without any new data on netcat server.

import java.sql.Timestamp

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming._

object Stateful {

  def main(args: Array[String]): Unit = {

    val host = "localhost"
    val port = "9999"

    val spark = SparkSession
      .builder
      .appName("StructuredSessionization")
      .master("local[2]")
      .getOrCreate()

    import spark.implicits._

    // Create DataFrame representing the stream of input lines from connection to host:port
    val lines = spark.readStream
      .format("socket")
      .option("host", host)
      .option("port", port)
      .option("includeTimestamp", true)
      .load()

    // Split the lines into words, treat words as sessionId of events
    val events = lines
      .as[(String, Timestamp)]
      .flatMap { case (line, timestamp) =>
        line.split(" ").map(word => Event(sessionId = word, timestamp))
      }

    val sessionUpdates = events
      .groupByKey(event => event.sessionId)
      .mapGroupsWithState[SessionInfo, Int](GroupStateTimeout.ProcessingTimeTimeout) {

        case (sessionId: String, events: Iterator[Event], state: GroupState[SessionInfo]) =>
          0
      }

    val query = sessionUpdates
      .writeStream
      .outputMode("update")
      .trigger(Trigger.ProcessingTime("10 seconds"))
      .format("console")
      .start()

    query.awaitTermination()
  }
}

case class Event(sessionId: String, timestamp: Timestamp)

case class SessionInfo(
                        numEvents: Int,
                        startTimestampMs: Long,
                        endTimestampMs: Long)
Michael Heil
  • 16,250
  • 3
  • 42
  • 77
conetfun
  • 1,605
  • 4
  • 17
  • 38
  • Whatever the case, batch 0 will always be triggered even if there is no data in kafka. Any subsequent triggers should not be triggered without new data in kafka. – Gopal Tiwari Jul 08 '20 at 11:19
  • Yes, but in this example new batches are being triggered even if there is no new data in streaming source. Not just batch zero but subsequent batches too. – conetfun Jul 08 '20 at 11:20
  • Strange, can you paste your code sample. – Gopal Tiwari Jul 08 '20 at 11:22
  • It is the same example that I am trying to run which I mentioned in description. I mentioned Kafka by mistake. Streaming source is netcat server. But with Kafka as source, it is same behaviour as netcat – conetfun Jul 08 '20 at 11:40
  • I think this is because of Trigger. You can try without using it. – Gopal Tiwari Jul 09 '20 at 07:25

1 Answers1

1

The reason for the empty batches showing up is the usage of Timeouts within the mapGroupsWithState call.

According to the book "Learning Spark 2.0" it says:

"The next micro-batch will call the function on this timed-out key even if there is not data for that key in that micro.batch. [...] Since the timeouts are processed during the micro-batches, the timing of their execution is imprecise and depends heavily on the trigger interval [...]."

As you have set the timeout to be GroupStateTimeout.ProcessingTimeTimeout it aligns with your trigger time of the query which is 10 seconds. The alternative would be to set the timeout based on event time (i.e. GroupStateTimeout.EventTimeTimeout).

The ScalaDocs on GroupState provide some more details:

When the timeout occurs for a group, the function is called for that group with no values, and GroupState.hasTimedOut() set to true.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
  • Mike, As I mentioned, this behaviour is only observed when using mapGroupWithState. Without that, Even with trigger in place, it doesn’t increment batch Id. A batch is still triggered every 10s but you will see same batch id every 10s if there is no new data. – conetfun Oct 09 '20 at 12:35
  • Spark version is 2.4.6 – conetfun Oct 09 '20 at 12:44
  • May be I need to better phrase the question and be specific on describing the situation. What I meant by triggering a “new” batch was that batch id getting incremented without new data – conetfun Oct 09 '20 at 12:46
  • Thanks @mike . That make sense. If you can update your answer then I can accept it as solution for posterity. – conetfun Oct 09 '20 at 12:53