5

Two Writestream to the same database sink is not happening in sequence in Spark Structured Streaming 2.2.1. Please suggest how to make them execute in sequence.

val deleteSink = ds1.writestream
  .outputMode("update")
  .foreach(mydbsink)
  .start()

val UpsertSink = ds2.writestream
  .outputMode("update")
  .foreach(mydbsink)
  .start()

deleteSink.awaitTermination()
UpsertSink.awaitTermination()

Using the above code, deleteSink is executed after UpsertSink.

Shaido
  • 27,497
  • 23
  • 70
  • 73

1 Answers1

11

If you want to have two streams running in parallel, you have to use

sparkSession.streams.awaitAnyTermination()

instead of

deleteSink.awaitTermination()
UpsertSink.awaitTermination()

In your case UpsertSink will never start unless deleteSink will be stopped or an exception thrown, as it says in the scaladoc

Waits for the termination of this query, either by query.stop() or by an exception. If the query has terminated with an exception, then the exception will be thrown. If the query has terminated, then all subsequent calls to this method will either return immediately (if the query was terminated by stop()), or throw the exception immediately (if the query has terminated with exception).

Shikkou
  • 545
  • 7
  • 22
  • 1
    My problem is two write streams to same dbsink is not sequential always .. sometimes I see upsert sink is getting executed before delete sink. – Shiva Kumar M V Jun 22 '18 at 13:43
  • 1
    Well, in the context of streaming, it doesn't really make sense to want to have a sequence, because after a while, you can't determine which query will be executed next. Maybe you should think about changing the way you approach this problem. For example, if the two DS have the same schema, you could do ds1.union(ds2) and modify 'mydbsink' to check for a certain value (for example, beforehand, add a column Action that can have two values: delete or upsert). Then you can group by, sort by key, and then use the modified dbsink. Maybe you have greater control like this. – Shikkou Jun 26 '18 at 10:32
  • Yes I had used the similar approach to fix the problem . – Shiva Kumar M V Jun 27 '18 at 12:53
  • Is there anyway to make write streams always sequential if one write stream output is an input to another write stream ? – Shiva Kumar M V Jun 27 '18 at 13:07
  • If you know exactly what is the upper limit of processing time of a micro-batch of streaming, you could do something like setting the trigger for each stream based on that. So for example, you know that the maximum processing time for the first stream is about 1 minute, you can set both triggers: trigger(Trigger.ProcessingTime("2 minutes")), with the second starting about 1 minute after. That way you know that there is a 1:1 relationship between micro-batches, and if you get the timing right, they should be "sequential". – Shikkou Jun 29 '18 at 08:26
  • But of course, this is not 100% guaranteed, as a lot can happen (no resources available, network connectivity issues, etc.). But it's as close as you can get, probably. If any of this has helped, let me know to update my answer. – Shikkou Jun 29 '18 at 08:28
  • Hello. I see you accepted my answer. Thank you, but please specify what info solved your issue, so i can update the answer, because I'm pretty sure it's not the original that did the trick. – Shikkou Jul 12 '18 at 08:09
  • 1
    Hi Alex, We had used only one Sink instead of two different sink for delete and Upsert . I had repartitioned records based on unique key and used mutable set inside slick to check and delete records from DB if the set does not contain key irrespective of action indicator and also based on the action indicator i had either upserted or deleted records from Postgresql DB . – Shiva Kumar M V Dec 04 '18 at 12:59