1

I'm setting up a Apache Spark long-running streaming job to perform (non-parallelized) streaming using InputDStream.

What I'm trying to achieve is that when a batch on the queue takes too long (based on a user defined timeout), I want to be able to skip the batch and abandon it completely - and continue the rest of execution.

I wasn't able to find a solution to this problem within the spark API or online -- I looked into using StreamingContext awaitTerminationOrTimeout, but this kills the entire StreamingContext on timeout, whereas all I want to do is skip/kill the current batch.

I also considered using mapWithState, but this doesn't seem to apply to this use case. Finally, I was considering setting up a StreamingListener and starting a timer when the batch starts and then having the batch stop/skip/killed when reaching a certain timeout threshold, but there still doesn't seem to be a way to kill the batch.

Thanks!

  • Curious why mapWithState would not apply here. Like creating a session over the batch? Something like this? – user1452132 Jul 07 '17 at 18:52
  • Well, I'm not working with Pair DStreams. Theoretically if I was, I was also unclear about the API - if I do set a timeout on a key, would this do what I want (skip the job in the batch)? – Adam Taché Jul 07 '17 at 18:58
  • This might be difficult to achieve. The listener would give you the means to monitor the runtime of a job but I think that canceling it will prove difficult. I looked into the (job scheduler)[https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L47], and I can't see an API hook where to dismiss the results of a batch. If you *really* need this, I'm afraid that you'll need to patch the code to implement such deadline cancellation policy. – maasg Jul 09 '17 at 16:31
  • ps: Interesting question btw. – maasg Jul 09 '17 at 16:31

1 Answers1

0

I've seen some docs from yelp, but I haven't done it myself.

Using UpdateStateByKey(update_func) or mapWithState(stateSpec),

  1. Attach timeout when events are first seen and state is initialized
  2. Drop the state if it expires

    def update_function(new_events, current_state):
        if current_state is None:
            current_state = init_state()
            attach_expire_datetime(new_events)
            ......
        if is_expired(current_state):
            return None //current_state drops?
        if new_events:
            apply_business_logic(new_events, current_state)
    

This looks like that the structured streaming watermark also drops the events when they timeout, if this could apply to your jobs/stages timeout dropping.

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
chenfh5
  • 1
  • 1