1

I'm having a hard time understand how Beam's windowing is supposed to work related to discarding late data.

Here is my pipeline:

pipeline
| ReadFromPubSub() 
| WindowInto(
    FixedWindow(60), 
    trigger=AfterWatermark(
      early=AfterCount(1000),
      late=AfterProcessingTime(60),
    ),
    accumulation_mode=AccumulationMode.DISCARDING,
    allowed_lateness=10*60,
  )
| GroupByKey()
| SomeTransform()
| Map(print)

How I understand this is that data elements from Pub/Sub are assigned with the timestamp at which they're published to a subscription.

When I start my pipeline to consume data from Pub/Sub, which also included old data, I expected that only data from the last 10 minutes (as set at the allowed_lateness param of WindowInto) are printed out. But the result is that all of the data from the subscription are printed out.

Am I missing something or do i understand Windowing wrong?

I use Python and Beam 2.42.0

kayteepee
  • 23
  • 1
  • 4

1 Answers1

3

In the above pipeline, there is no late data.

Data is only late when compared to the watermark. To calculate the watermark, Beam uses the timestamp of the messages being read. In your case, the timestamp of the messages when they are generated in the ReadFromPubsub transform.

Pubsub messages have attributes in addition to the payload. Beam uses these attributes to read and set the timestamp of each message. If you don't specify any, Beam uses the publish time to set that timestamp. So in your case you are actually working in processing time (timestamp set to current processing time), and therefore it is impossible to have late data. The watermark will always be very close to the current timestamp of the messages being processed.

See this question in SO for more details on how the timestamp attribute is used with Pubsub (for Python, check the parameter timestamp_attribute in ReadFromPubsub)

If there is no timestamp attribute in the messages you are reading, but they contain a field (in the payload) with the timestamp, you can use it to generate a timestamped value (parsing the payload and then using a DoFn that emits TimestampedValue, or the WithTimestamps transform).

Israel Herraiz
  • 611
  • 3
  • 8
  • 1
    Hi, thank you for your answer. However, I did printed out the timestamp assigned to the data, and it's the timestamp at which they are published to the Pub/Sub subscription, not when it is read in my pipeline. Also the data freshness chart also showed that the data is old, similar to when I pushed my data to Pub/Sub. This is the description of the parameter **timestamp_attribute** also states that it's the publishing time: ``` - timestamp_attribute – Message value to use as element timestamp. If None, uses message publishing time as the timestamp. ``` – kayteepee Dec 20 '22 at 02:11
  • 2
    As mentioned in the answer `allowed_lateness` indicates data beyond the computed (but imperfect) watermark. When you start reading, the system *knows* there is old data in the pipeline, and holds the watermark back accordingly. – robertwb Dec 20 '22 at 20:57
  • 1
    @robertwb so when will `allowed_lateness` dropped the data? I expect the data to be dropped if **processing_time - event_time > allowed_lateness**. Is it correct? Currently what I've observed is not so, as I've explained in my question above. – kayteepee Dec 21 '22 at 03:00
  • 2
    Data is dropped once **watermark - event_time > allowed_lateness**. The watermark is an estimate on the oldest timestamp in the unprocessed data. – robertwb Dec 21 '22 at 20:11
  • 2
    As a concrete example, suppose you have a bunch of devices publishing data. At 12:05, the system might estimate that it has all data that was published up to 12:00, and set that as a watermark. One of your devices is a bit behind and finally publishes its 11:59 info, and this is "late" but not dropped as it's within 10 minutes. Another device has been offline for an hour and finally connects and pushes its 11:00 info. That's beyond allowed lateness, and dropped. – robertwb Dec 21 '22 at 20:12
  • I made a mistake in my answer, the timestamp used when no attribute is specified is the publish time, not the clock time. Everything else in the explanation does not change :). – Israel Herraiz Dec 22 '22 at 08:11
  • 1
    Thank you both! It's clearer for me now. This means **allowed_lateness** uses the watermark, not the different between processing time and event time. And the watermark doesn't depend on processing time, it is calculated based on the timestamp of the events. In my case, when i started my pipeline, the watermark started out old because the data in my Pub/Sub subscription was old as well. And therefore, the old data wasn't dropped. – kayteepee Dec 23 '22 at 14:12
  • (answer edited to change clock time by publish time) – Israel Herraiz Dec 24 '22 at 10:56