1

I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).

Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.

I have gone through other similar questions but wasn't able to get a possible approach. Here are possible approaches I'm considering but still have some open questions.

Possible Approach - 1

(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark. Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.

Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]

Possible Approach - 2 Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.

Let me know if any other details are needed.

jt97
  • 13
  • 3
  • How frequently does the need to do backfill arise, and for how many distinct keys? Is this an occasional burst of late data for one key, or ... ? – David Anderson Nov 01 '21 at 20:58
  • The backfill process is on-demand. We are expecting to run it a bit more frequently in the early stages (~ 4-5 months). It will reduce to about ~1-2 a month later. We have about ~10k keys. – jt97 Nov 02 '21 at 04:08
  • We have about ~10k keys. Basically the flink application has an algorithm running per key. If we want to test the algorithm with different parameters, our plan is to change the algo params and backfill the data for the old key by passing a new version v2 [Where flink is doing keyBy per keyId + version]. This algo will only come into effect for the new keys while the data for old keys remains the same. Thus the backfill is an occasional burst of historical data for existing keys but with a new version so flink will treat them as new keys. – jt97 Nov 02 '21 at 04:19
  • Then I suggest you run the backfill with the datastream API in BATCH execution mode. In BATCH mode watermarks and lateness aren't relevant. – David Anderson Nov 02 '21 at 08:32
  • the internal logic of the application however depends on watermarks since its performing windowing operations based on event time. Hence i still need the input stream to have watermarks. – jt97 Nov 02 '21 at 09:03

1 Answers1

0

You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thanks for the quick reply. Just checked the official documentation. This looks like the way forward. Will explore this, thanks. – jt97 Nov 02 '21 at 09:19
  • Ran into some issues when trying to implement this. Can you help out here : https://stackoverflow.com/questions/70137863/apache-flink-batch-mode-failing-for-datastream-apis-with-exception-illegalst – jt97 Nov 27 '21 at 19:02