Can I collect data in Apache beam pipeline in every 5 minutes and perform analysis on that data collectively after a hour?

Question

I am currently having a Apache beam pipeline that writes Pub/Sub message to Bigquery and GCS in real-time and my next goal was to pull the messages from Pub/Sub at an interval of every 5 minute and collectively perform analysis on it those multiple 5 minute windowed data that has been collected after an hour of collection, so technically 12 windowed data per hour has to be collected and perform analysis altogether on those 12 windowed data after an hour and write them to my desired sinks. How can I achieve it?

I hope I've made myself clear, please ask for more questions if required. Thanking in advance!

Every 5th minute write into another pubsub queue, every 60th minute read from that other pubsub queue. :-) — oakad, Oct 10 '22 at 04:37
Thanks for the answer, but can I achieve it using Apache Beam itself? — Mihir Sharma, Oct 10 '22 at 05:03
In a general case, you don't know how many workers your pipeline will span. The pipeline stages can be moved around between workers and so on (beam does a lot of serializing behind the scenes). Thus, for anything like your question you need to store your data externally to Beam. — oakad, Oct 10 '22 at 06:31

score 1 · Accepted Answer · answered Oct 10 '22 at 07:40

1

So, I have got the answer to achieve this and it comes with setting up timers that is used to create larger batches of data before processing and analysis can be performed using dynamic windowing of each input.

answered Oct 10 '22 at 07:40

Mihir Sharma

114
8

Can I collect data in Apache beam pipeline in every 5 minutes and perform analysis on that data collectively after a hour?

1 Answers1