Flink stream join a dimension table which might return a large result set

Question

I have a stream of events needs to be enriched with subscription information. Some of events are broadcasting event, means that when such events are received, I need to go the database table, find all the subscribers of the event, it can be 10,000 rows in my use case, and then transform the single broadcast event to 10,000 notification events. For normal event type, there's additional user_id key can be used to join the subscription table, which does not have the issue.

The challenges are

how to join a large ResultSet, return them to memory doesn't seem like a scalable solution. Is there a way to partition this into many smaller parallel tasks?
how can I organize the processing pipeline such that normal event and broadcasting event are not interfering each other. I don't want consecutive long running broadcasting events to block the processing pipeline of normal events.

I'm just getting started with Flink, what would be the correct or performant architecture for this use case? If needed, the broadcast event type and normal event type can be separated into two sources.

score 0 · Answer 1 · answered Apr 27 '20 at 11:47

0

Ideally, you can provide the secondary information (database table) as an additional input to Flink and then simply use a join. That is only viable if the information can be fetched by a Flink connector. The advantage is that if you do it correctly, even updates on the table get reflected in the output appropriately. You also don't need to care about the result size as that will be automatically handled by Flink.

Alternatively, you can use asyncIO, which is in particular made to interact with external systems. The downside of asyncIO is that currently all results of all active requests have to fit into main memory. But that should be viable for 10_000 rows, especially since the respective events seem to occur rather seldom.

answered Apr 27 '20 at 11:47

Arvid Heise

3,524
5
11

How can I feed a database table to Flink? I don't know how this is possible. The physical table are sharded with many partitions. I can perform query using DAO though. I researched about AsyncIO, doesn't seem resolved the issue. The ideal solution for me is not to load 10k rows into to the memory all at once. Is there a way to progressionally load the records (like JDBC cursor or pagination)? – Daniel S. Hatten Apr 27 '20 at 12:13
From your question, it's not easy to see which API your are using, so here is the [JDBC connector for SQL/table](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connect.html#jdbc-connector). – Arvid Heise Apr 27 '20 at 12:29
1

the tables are in Oracle, and I didn't see the jdbc connector has support for that. :( – Daniel S. Hatten Apr 27 '20 at 13:32
@DanielS.Hatten, did you resolved this issue? If yes - can you please share your solution? – deeplay Sep 20 '21 at 14:24
Just note that an Oracle connector is currently being developed. Because of licensing, it's in the Ververica repo https://github.com/ververica/flink-cdc-connectors/pull/418 . – Arvid Heise Sep 29 '21 at 07:47

Flink stream join a dimension table which might return a large result set

1 Answers1