2

I have two streams, stream A and stream B. Both streams contain the same type of event which has an ID and a timestamp. For now, all i want the flink job to do is join the events that have the same ID inside of a window of 1 minute. The watermark is assigned on event.

sourceA = initialSourceA.map(parseToEvent)
sourceB = initialSourceB.map(parseToEvent)

streamA = sourceA
                .assignTimestampsAndWatermarks(CustomWatermarkStrategy())
                .keyBy(Event.Key)

streamB = sourceB
                .assignTimestampsAndWatermarks(CustomWatermarkStrategy())
                .keyBy(Event.Key)


streamA
                .join(streamB)
                .where(Event.Key)
                .equalTo(Event.Key)
                .window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
                .apply(giveMePairOfEvents)
                .print()

Inside my test I try to send the following:

sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))

//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds)) 

For parallelism = 1, I can see the events from time 0 getting joined together.

However, for parallelism = 2 the print does not display anything getting joined. To figure out the problem, I tried to print the events after the keyBy of each stream and I can see they are all running on the same instance. Placing the print after the watermarking, for obvious reasons, that the events are currently on the different instances.

This leads me to believe that I am somehow doing something incorrectly when it comes to watermarking since for a parallelism higher than 1 it doesn't increase the watermark. So here's a couple of questions i asked myself:

  • Is it possible that each event has a seperate watermark generator and i have to increase them specifically?
  • Do I run keyBy first and then watermark so that my events from each stream use the same watermarkgenerator?

Sending another set of events as follows:

sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))

//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds)) 

sourceA.send(Event(ID_1, 122 seconds))
sourceB.send(Event(ID_1, 122 seconds))

Ended up sending the joined first events. Further inspection showed that the third set of events used the same watermarkgenerator that the second one didn't use. Something which I am not very clear on why is happening. How can I assign and increase watermarks correctly when using a join function in Flink?

EDIT 1:

The custom watermark generator:

class CustomWaterMarkGenerator(
        private val maxOutOfOrderness: Long,
        private var currentMaxTimeStamp: Long = 0,
)
    : WatermarkGenerator<EventType> {
    override fun onEvent(event: EventType, eventTimestamp: Long, output: WatermarkOutput) {
        val a = currentMaxTimeStamp.coerceAtLeast(eventTimestamp)
        currentMaxTimeStamp = a
        output.emitWatermark(Watermark(currentMaxTimeStamp - maxOutOfOrderness - 1));
    }

    override fun onPeriodicEmit(output: WatermarkOutput?) {
    }
}

The watermark strategy:


class CustomWatermarkStrategy(
): WatermarkStrategy<Event> {
    override fun createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context?): WatermarkGenerator<Event> {
        return CustomWaterMarkGenerator(0)
    }

    override fun createTimestampAssigner(context: TimestampAssignerSupplier.Context?): TimestampAssigner<Event> {
        return TimestampAssigner{ event: Event, _: Long->
            event.timestamp
        }
    }

}

Custom source:

The sourceFunction is currently an rsocket connection that connects to a mockstream where i can send events through mockStream.send(event). The first thing I do with the events is parse them using a map function (from string into my event type) and then i assign my watermarks etc.

Akula
  • 59
  • 7

1 Answers1

2
  • Each parallel instance of the watermark generator will operate independently, based solely on the events it observes. Doing the watermarking immediately after the sources makes sense (although even better, in general, is to do watermarking directly in the sources).

  • An operator with multiple input channels (such as the keyed windowed join in your application) sets its current watermark to the minimum of the watermarks it has received from its active input channels. This has the effect that any idle source instances will cause the watermarks to stall in downstream tasks -- unless those sources explicitly mark themselves as idle. (And FLINK-18934 meant that prior to Flink 1.14 idleness propagation didn't work correctly with joins.) An idle source is a likely suspect in your situation.

  • One strategy for debugging this sort of problem is to bring up the Flink WebUI and observe the behavior of the current watermark in all of the tasks.

To get more help, please share the rest of the application, or at least the custom source and watermark strategy.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Added an Edit with the watermark strategy and source. – Akula Nov 11 '21 at 09:02
  • So the job graph is source -> map -> watermarks -> keyed window join -> print. What is the parallelism of each of those 5 pipeline stages? – David Anderson Nov 11 '21 at 09:40
  • I only set the parallelism on the environment level to 2. – Akula Nov 11 '21 at 09:50
  • Also as a note, if I keep the parallelism to the default value: The second set of events (sending 3 times) doesn't join anything aswell. To me it seems that each set of events lives on some instance(?) and awaits the next set of events to push the watermark on their instance. – Akula Nov 11 '21 at 10:01
  • I believe the problem is this: you have two instances of each watermark generator, but only one key, so one instance processes no events, and its watermark cannot progress. This holds back the join. – David Anderson Nov 11 '21 at 15:21
  • Then the solution would be for the other instance to also process events which is happening when we send the third set of events which processes events on the first instance? – Akula Nov 11 '21 at 16:03
  • As I see it right now, once elements get joined they get pushed onto a task manager. In the case of parallelism two, first set gets joined and gets pushed to instance 1 (for example) second set gets joined and pushed to instance 2, therefore not pushing the watermark on instance 1, third set gets joined and gets pushed to instance 1 thus pushing the watermark and the window releasing the first set. This might not at all be how things work internally in flink, but this is what seems to be the case. I am not sure if this is normal behaviour or not. – Akula Nov 11 '21 at 16:21
  • Tbh, I haven’t been able to fully understand your description of what is happening. A reproducible example is the only way forward, I fear. – David Anderson Nov 11 '21 at 21:11