My use case is that I'm trying to aggregate data using the Apache Beam Python SDK from a Google PubSub subscription using 1 hour windows. I've configured my pipeline windowing like so:
beam.WindowInto(
window.FixedWindows(60 * 60, 0),
trigger=AfterWatermark(
early=AfterCount(1),
late=AfterCount(1)),
accumulation_mode=AccumulationMode.ACCUMULATING)
My issue is that I should be seeing about 60 messages per window, and I'm only seeing 45-46 at most, usually below this number.
Some research now leads me to believe that Beam might be discarding any data it considers late, even if I've set up my triggers this way. The Beam Streaming documentation mentions The Beam SDK for Python does not currently support allowed lateness. What is not clear to me is whether it doesn't support setting a specific lateness configuration, or whether it discards late data completely.
L.E: It appears that my full data set is indeed present, however some clarifications regarding the handling of late data in Beam using the Python SDK would be helpful in setting expectations.