Does the Apache Beam Python SDK discard late data, or is it just impossible to configure lateness params?

Question

My use case is that I'm trying to aggregate data using the Apache Beam Python SDK from a Google PubSub subscription using 1 hour windows. I've configured my pipeline windowing like so:

beam.WindowInto(
    window.FixedWindows(60 * 60, 0),
    trigger=AfterWatermark(
        early=AfterCount(1),
        late=AfterCount(1)),
    accumulation_mode=AccumulationMode.ACCUMULATING)

My issue is that I should be seeing about 60 messages per window, and I'm only seeing 45-46 at most, usually below this number.

Some research now leads me to believe that Beam might be discarding any data it considers late, even if I've set up my triggers this way. The Beam Streaming documentation mentions The Beam SDK for Python does not currently support allowed lateness. What is not clear to me is whether it doesn't support setting a specific lateness configuration, or whether it discards late data completely.

L.E: It appears that my full data set is indeed present, however some clarifications regarding the handling of late data in Beam using the Python SDK would be helpful in setting expectations.

I just mean that I set up the window triggers to fire after each early and late event. Although I'm still not 100% sure how Beam for Python handles data lateness, my issue was different (specified in the answer below). — andreimarinescu, Oct 17 '19 at 09:43

score 1 · Answer 1 · answered Sep 06 '19 at 13:11

So my issue actually was that PubSub delivers messages sometimes wildly out of order. While the general direction is form old to new, if there's a backlog of 2-3 days worth of data, you can see spreads of 10-48 hours. if the full buffer is collected, no data is actually discarded.

The issue is alleviated when not using DirectRunner but DataflowRunner, as the throughput is much higher when running the pipeline on Dataflow servers.

The issue around discarding late data is still undocumented (the documentation only mentions that configuring the data lateness policy is currently unsupported for Python, as of September 2019). Late data appears to be triggered correctly using the above settings.

Does the Apache Beam Python SDK discard late data, or is it just impossible to configure lateness params?

1 Answers1