I want to associate each record with a timestamp that is already present in each record. According to Beam docs it is enough to make it via transform with beam.window.TimestampedValue()
method. Unfortunately, it does not generate any aggregated data.
My Python Code with Beam pipeline
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=options)
def encode_byte_string(element):
element = str(element)
return element.encode('utf-8')
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
unix_timestamp = element[7]
yield beam.window.TimestampedValue(element, int(unix_timestamp))
def calculateProfit(elements):
profit = elements[6]
elements.append(str(profit))
return elements
def checkit(row):
r = row.decode('utf-8').split(',')
print(r)
return r
pubsub_data= (
p
| 'Read from pub sub' >> beam.io.ReadFromPubSub(subscription= input_subscription)
| 'Remove extra chars' >> beam.Map(lambda data: (data.rstrip().lstrip()))
| 'Split Row' >> beam.Map(checkit)
| 'Filter By Country' >> beam.Filter(lambda elements : (elements[1] == "Mumbai" or elements[1] == "Bangalore"))
| 'Create Profit Column' >> beam.Map(calculateProfit)
| 'Apply custom timestamp' >> beam.ParDo(AddTimestampDoFn())
| 'Form Key Value pair' >> beam.Map(lambda elements : (elements[0], int(elements[8])))
| 'Window' >> beam.WindowInto(window.FixedWindows(5))
| 'Sum values' >> beam.CombinePerKey(sum)
| 'Encode to byte string' >> beam.Map(encode_byte_string)
| 'Write to pus sub' >> beam.io.WriteToPubSub(output_topic)
)
result = p.run()
result.wait_until_finish()
Example piece of data (read from csv file with actual timestamp +/- 30 sec)
Store_id, Store_location, Product_id, Product_category, number_of_pieces_sold, buy_rate, sell_price,unix_timestamp
STR_2,Mumbai,PR_265,Cosmetics,8,39,66,1659964835
STR_2,Mumbai,PR_347,Cosmetics,6,13,56,1659964836
STR_2,Mumbai,PR_566,Electronics,4,47,70,1659964837
STR_1,Bangalore,PR_314,Groceries,8,31,75,1659964838
STR_2,Mumbai,PR_854,Groceries,3,28,62,1659964839
STR_2,Mumbai,PR_234,Education,8,15,64,1659964840
STR_1,Bangalore,PR_854,Groceries,6,33,70,1659964841
STR_2,Mumbai,PR_243,Groceries,1,39,69,1659964842
STR_1,Bangalore,PR_124,Education,4,27,58,1659964843
STR_1,Bangalore,PR_265,Groceries,6,25,43,1659964844
STR_2,Mumbai,PR_111,Electronics,8,25,48,1659964845
STR_1,Bangalore,PR_101,Kitchen,6,11,60,1659964846
STR_1,Bangalore,PR_124,Groceries,10,18,20,1659964847
STR_2,Mumbai,PR_265,Groceries,7,43,56,1659964848
When I remove the line dedicated to timestamp association from the Pipeline 'Window' >> beam.WindowInto(window.FixedWindows(5))
it works fine -> it prints out aggregated data every five seconds (i guess due to using PubSub generated timestamp).
What can cause no result when the timestamp is taken from the record and not from PubSub metadata?