Added timestamp to Beam unbounded PCollection does not generate aggregated data

Question

I want to associate each record with a timestamp that is already present in each record. According to Beam docs it is enough to make it via transform with beam.window.TimestampedValue() method. Unfortunately, it does not generate any aggregated data.

My Python Code with Beam pipeline

options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=options)

def encode_byte_string(element):
    element = str(element)
    return element.encode('utf-8')

class AddTimestampDoFn(beam.DoFn):
  def process(self, element):
    unix_timestamp = element[7]
    yield beam.window.TimestampedValue(element, int(unix_timestamp))

def calculateProfit(elements):
  profit = elements[6]
  elements.append(str(profit))
  return elements

def checkit(row):
    r  = row.decode('utf-8').split(',')
    print(r)
    return r

pubsub_data= (
                p 
                | 'Read from pub sub' >> beam.io.ReadFromPubSub(subscription= input_subscription)  
                | 'Remove extra chars' >> beam.Map(lambda data: (data.rstrip().lstrip()))
                | 'Split Row' >> beam.Map(checkit)
                | 'Filter By Country' >> beam.Filter(lambda elements : (elements[1] == "Mumbai" or elements[1] == "Bangalore"))
                | 'Create Profit Column' >> beam.Map(calculateProfit)                              
                | 'Apply custom timestamp' >> beam.ParDo(AddTimestampDoFn())
                | 'Form Key Value pair' >> beam.Map(lambda elements : (elements[0], int(elements[8])))
                | 'Window' >> beam.WindowInto(window.FixedWindows(5))
                | 'Sum values' >> beam.CombinePerKey(sum)
                | 'Encode to byte string' >> beam.Map(encode_byte_string)
                | 'Write to pus sub' >> beam.io.WriteToPubSub(output_topic)
                 )

result = p.run()
result.wait_until_finish()

Example piece of data (read from csv file with actual timestamp +/- 30 sec)

Store_id, Store_location, Product_id, Product_category, number_of_pieces_sold, buy_rate, sell_price,unix_timestamp
STR_2,Mumbai,PR_265,Cosmetics,8,39,66,1659964835
STR_2,Mumbai,PR_347,Cosmetics,6,13,56,1659964836
STR_2,Mumbai,PR_566,Electronics,4,47,70,1659964837
STR_1,Bangalore,PR_314,Groceries,8,31,75,1659964838
STR_2,Mumbai,PR_854,Groceries,3,28,62,1659964839
STR_2,Mumbai,PR_234,Education,8,15,64,1659964840
STR_1,Bangalore,PR_854,Groceries,6,33,70,1659964841
STR_2,Mumbai,PR_243,Groceries,1,39,69,1659964842
STR_1,Bangalore,PR_124,Education,4,27,58,1659964843
STR_1,Bangalore,PR_265,Groceries,6,25,43,1659964844
STR_2,Mumbai,PR_111,Electronics,8,25,48,1659964845
STR_1,Bangalore,PR_101,Kitchen,6,11,60,1659964846
STR_1,Bangalore,PR_124,Groceries,10,18,20,1659964847
STR_2,Mumbai,PR_265,Groceries,7,43,56,1659964848

When I remove the line dedicated to timestamp association from the Pipeline 'Window' >> beam.WindowInto(window.FixedWindows(5)) it works fine -> it prints out aggregated data every five seconds (i guess due to using PubSub generated timestamp).

What can cause no result when the timestamp is taken from the record and not from PubSub metadata?

score 1 · Answer 1 · answered Aug 08 '22 at 19:50

Using the CSV file above, the pipeline aggregates your data successfully in five second windows when the 'Apply custom timestamp' is present:

('STR_2', 254)
('STR_2', 133)
('STR_2', 104)
('STR_1', 75)
('STR_1', 171)
('STR_1', 80)

When it's not present, it is using the default global window in the batch context (reading the CSV file), and so it aggregates the whole sample:

('STR_2', 491)
('STR_1', 326)

So I think this is behaving as you want it to in the batch context. I don't know why it's behaving differently while reading from pubsub. To debug that, I suggest copying the AnalyzeElement example in the beam documentation. You can add a beam.ParDo(AnalyzeElement()) to your pipeline after the Window step, with and without your Apply custom timestamp, and see how your timestamps compare to what pub/sub is adding.

score 0 · Answer 2 · answered Aug 10 '22 at 12:37

Ok, I found the source of the problem. The associated timestamp was not equal to an actual timestamp so the difference between the manually entered event time and the automatic watermark was so big that all data was discarded as late data. To make it works, it is enough to add allowed lateness to window options

                | beam.WindowInto(
                  beam.window.FixedWindows(1* 30),
                  trigger=AfterWatermark(early=AfterCount(5)),
                  accumulation_mode=AccumulationMode.ACCUMULATING, allowed_lateness=60*60*24*100)

More info:

Apache Beam GroupByKey Produces No Output

Beam: CombinePerKey(max) hang in dataflow job

Added timestamp to Beam unbounded PCollection does not generate aggregated data

2 Answers2