I have a stream of events needs to be enriched with subscription information. Some of events are broadcasting event, means that when such events are received, I need to go the database table, find all the subscribers of the event, it can be 10,000 rows in my use case, and then transform the single broadcast event to 10,000 notification events. For normal event type, there's additional user_id key can be used to join the subscription table, which does not have the issue.
The challenges are
- how to join a large ResultSet, return them to memory doesn't seem like a scalable solution. Is there a way to partition this into many smaller parallel tasks?
- how can I organize the processing pipeline such that normal event and broadcasting event are not interfering each other. I don't want consecutive long running broadcasting events to block the processing pipeline of normal events.
I'm just getting started with Flink, what would be the correct or performant architecture for this use case? If needed, the broadcast event type and normal event type can be separated into two sources.