10

In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data.

I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the time it takes to consume that specific dataset and then stop it. For specific reasons we decided not to use the batch version of spark.

Hence is there any timeout or something that can be used to detect that there is no more data coming it and that everything has be processed.

Thank you

MaatDeamon
  • 9,532
  • 9
  • 60
  • 127
  • The problem with Triger.Once, is that it will try to load all the data at once in the cluster before processing it, which basically amount to using spark batch mode. We want result to be available as soon as mico-batch of data are being processed – MaatDeamon Sep 25 '18 at 12:53
  • I'm not sure what does kafkaConsumer.pollTimeoutMs does exactly ? – MaatDeamon Sep 25 '18 at 12:54
  • Why do you want to stop the job? Do you want to stop the cluster to save money? – Michael West Sep 25 '18 at 13:16
  • 1
    (1) Yes money, (2) Stats (Management wants to keep stat of how long each dataset take to be fully ingested, identifying how long each steps of the pipeline takes), (3) Orachestration issue: Our pipeline is 3/4 steaming, 1/4 batch. Before launching the late batch process that close the pipeline, we want to make sure that the all streaming part is over. We could turn the late part into streaming, but that would require a lot of work, that we don't want to takle now – MaatDeamon Sep 25 '18 at 13:21
  • @MaatDeamon What did you end up doing in this scenario? – Dude0001 Jul 11 '21 at 12:12

3 Answers3

3

Structured Streaming Monitoring Options

You can use query.lastProgress to get the timestamp and build logic around that. Don't forget to save your checkpoint to a durable, persistent, available store.

Michael West
  • 1,636
  • 16
  • 23
  • 1
    Thanks for the input, just for the record, the method query.lastProgress, as to be called in a multi-threaded context isn't it ? Meaning if a streaming query is running, one will probably have AwaitTermination, meaning everything else after that won't be called unless the query terminates. Hence I always wondered how this method has to be called. Just want to confirm. – MaatDeamon Sep 25 '18 at 14:27
  • Not sure how to use query.lastProgress with query.awaitTermination You may need to consider the asychronous StreamingQueryListener object. – Michael West Sep 25 '18 at 16:40
  • check out the end of this code from Spark the Definitive Guide. It shows how to write the status of the stream to Kafka. cool! [code](https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/code/Streaming-Chapter_23_Structured_Streaming_in_Production.scala) – Michael West Sep 25 '18 at 16:51
  • Which chapter is it ? – MaatDeamon Sep 25 '18 at 18:00
  • Chapter 23 - Structure Streaming in Production – Michael West Sep 25 '18 at 18:34
  • Thanks will have a look – MaatDeamon Sep 25 '18 at 18:46
  • @MichaelWest do we really have to save the checkpoint somewhere? the kafka offsets for the topic would already be saved in the kafka for the subscription.. so why else would we need the spark streaming checkpoint file? – rogue-one Jun 03 '19 at 14:52
  • @MaatDeamon I am stuck at the same. I am not able to figure out how to use lastProgress or such monitoring options of spark. Any idea how to do it? – Gagan Oct 22 '19 at 05:40
  • Unfortunately I could not figure out an easy way – MaatDeamon Oct 22 '19 at 17:13
  • In general the only way to do that is to have some command topic and or special record with specific meaning. However with spark that might be convoluted. Depending on what you are doing I would rather switch to Kafka-stream, it is more flexible. – MaatDeamon Oct 22 '19 at 17:15
2

Putting together a couple pieces of advice:

  1. As @Michael West pointed out, there are listeners to track progress
  2. From what I gather, Structured Streaming doesn't yet support graceful shutdown

So one option is to periodically check for query activity, dynamically shutting down depending on a configurable state (when you determine no further progress can/should be made):

// where you configure your spark job...
spark.streams.addListener(shutdownListener(spark))

// your job code starts here by calling "start()" on the stream...

// periodically await termination, checking for your shutdown state
while(!spark.sparkContext.isStopped) {
  if (shutdown) {
    println(s"Shutting down since first batch has completed...")
    spark.streams.active.foreach(_.stop())
    spark.stop()
  } else {
    // wait 10 seconds before checking again if work is complete
    spark.streams.awaitAnyTermination(10000)
  }
}

Your listener can dynamically shutdown in a variety of ways. For instance, if you're only waiting on a single batch, then just shutdown after the first update:

var shutdown = false
def shutdownListener(spark: SparkSession) = new StreamingQueryListener() {
  override def onQueryStarted(_: QueryStartedEvent): Unit = println("Query started: " + queryStarted.id)
  override def onQueryTerminated(_: QueryTerminatedEvent): Unit = println("Query terminated! " + queryTerminated.id)
  override def onQueryProgress(_: QueryProgressEvent): Unit = shutdown = true
}

Or, if you need to shutdown after more complicated state changes, you could parse the json body of the queryProgress.progress to determine whether or not to shutdown at a given onQueryUpdate event firing.

ecoe
  • 4,994
  • 7
  • 54
  • 72
  • Could you elaborate on this a little further? for example what if i only wanted to run my structured streaming job for 15 min total on a 3 min trigger to process new data then dynamically shutdown after n # of minutes each day. Is this possible? – thePurplePython Mar 06 '20 at 07:49
  • Above I have a simple boolean state only for example purposes, `var shutdown`, but that can be much more complex logic. For instance, you could add another state: `var startTime = System.currentTimeMillis()`. Then, since you only want absolute time limit of 15 minutes, you might not even need a listener checking when a query completes. Even more simple would be just using the `while` statement above checking if `System` time exceeds 15 min since `startTime`. Listeners are valuable *in this specific case* when you need **graceful** shutdown only *after* a query completes so no data is lost. – ecoe Mar 06 '20 at 15:36
  • if we have multiple streaming query , can we stop a particular streaming query ? I tried to stop one but the session got terminated. – Monu Mar 15 '22 at 16:36
1

You can probably use this:-

def stopStreamQuery(query: StreamingQuery, awaitTerminationTimeMs: Long) {
    while (query.isActive) {
      try{
        if(query.lastProgress.numInputRows < 10){
          query.awaitTermination(1000)
        }
      }
      catch
      {
        case e:NullPointerException => println("First Batch")
      }
      Thread.sleep(500)
    }
  }

You can make a numInputRows variable.