I'm having a hard time understand how Beam's windowing is supposed to work related to discarding late data.
Here is my pipeline:
pipeline
| ReadFromPubSub()
| WindowInto(
FixedWindow(60),
trigger=AfterWatermark(
early=AfterCount(1000),
late=AfterProcessingTime(60),
),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=10*60,
)
| GroupByKey()
| SomeTransform()
| Map(print)
How I understand this is that data elements from Pub/Sub are assigned with the timestamp at which they're published to a subscription.
When I start my pipeline to consume data from Pub/Sub, which also included old data, I expected that only data from the last 10 minutes (as set at the allowed_lateness param of WindowInto) are printed out. But the result is that all of the data from the subscription are printed out.
Am I missing something or do i understand Windowing wrong?
I use Python and Beam 2.42.0