I have two streams, stream A and stream B. Both streams contain the same type of event which has an ID and a timestamp. For now, all i want the flink job to do is join the events that have the same ID inside of a window of 1 minute. The watermark is assigned on event.
sourceA = initialSourceA.map(parseToEvent)
sourceB = initialSourceB.map(parseToEvent)
streamA = sourceA
.assignTimestampsAndWatermarks(CustomWatermarkStrategy())
.keyBy(Event.Key)
streamB = sourceB
.assignTimestampsAndWatermarks(CustomWatermarkStrategy())
.keyBy(Event.Key)
streamA
.join(streamB)
.where(Event.Key)
.equalTo(Event.Key)
.window(TumblingEventTimeWindows.of(Time.of(1, TimeUnit.MINUTES)))
.apply(giveMePairOfEvents)
.print()
Inside my test I try to send the following:
sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))
//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds))
For parallelism = 1, I can see the events from time 0 getting joined together.
However, for parallelism = 2 the print does not display anything getting joined. To figure out the problem, I tried to print the events after the keyBy of each stream and I can see they are all running on the same instance. Placing the print after the watermarking, for obvious reasons, that the events are currently on the different instances.
This leads me to believe that I am somehow doing something incorrectly when it comes to watermarking since for a parallelism higher than 1 it doesn't increase the watermark. So here's a couple of questions i asked myself:
- Is it possible that each event has a seperate watermark generator and i have to increase them specifically?
- Do I run keyBy first and then watermark so that my events from each stream use the same watermarkgenerator?
Sending another set of events as follows:
sourceA.send(Event(ID_1, 0 seconds))
sourceB.send(Event(ID_1, 0 seconds))
//to increase the watermark
sourceA.send(Event(ID_1, 62 seconds))
sourceB.send(Event(ID_1, 62 seconds))
sourceA.send(Event(ID_1, 122 seconds))
sourceB.send(Event(ID_1, 122 seconds))
Ended up sending the joined first events. Further inspection showed that the third set of events used the same watermarkgenerator that the second one didn't use. Something which I am not very clear on why is happening. How can I assign and increase watermarks correctly when using a join function in Flink?
EDIT 1:
The custom watermark generator:
class CustomWaterMarkGenerator(
private val maxOutOfOrderness: Long,
private var currentMaxTimeStamp: Long = 0,
)
: WatermarkGenerator<EventType> {
override fun onEvent(event: EventType, eventTimestamp: Long, output: WatermarkOutput) {
val a = currentMaxTimeStamp.coerceAtLeast(eventTimestamp)
currentMaxTimeStamp = a
output.emitWatermark(Watermark(currentMaxTimeStamp - maxOutOfOrderness - 1));
}
override fun onPeriodicEmit(output: WatermarkOutput?) {
}
}
The watermark strategy:
class CustomWatermarkStrategy(
): WatermarkStrategy<Event> {
override fun createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context?): WatermarkGenerator<Event> {
return CustomWaterMarkGenerator(0)
}
override fun createTimestampAssigner(context: TimestampAssignerSupplier.Context?): TimestampAssigner<Event> {
return TimestampAssigner{ event: Event, _: Long->
event.timestamp
}
}
}
Custom source:
The sourceFunction is currently an rsocket connection that connects to a mockstream where i can send events through mockStream.send(event). The first thing I do with the events is parse them using a map function (from string into my event type) and then i assign my watermarks etc.