I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream()
of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X
and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
- I could maybe use queryable state since I could group the current rows behind the
userid
as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly). - Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?