How to create a refreshable table using in-memory data in Flink for joins?

Question

I have a Flink application that I rely on Table API. I do have a Kafka topic that I create a table. Then, we maintain an S3 object for list of IP addressed and some metadata information.

We also want to create a table on this S3 object. S3 object path is static and does not change, but I can override the S3 object and I want to refresh this table with the new data.

Basically, I have a collection in-memory read from the S3 object. How can I create a table and do join on the Kafka table most efficiently? The table should be refreshed when there is an update in S3 object.

David Anderson · Answer 1 · 2020-11-12T18:41:33.647

1

If you create a Table that is backed by the S3 object, using the FileSystem SQL Connector, it might do what you are looking for. Note, however, that file system sources are not fully developed, and you may run into limitations that affect your use case.

You could instead use StreamExecutionEnvironment#readFile (docs), and convert the DataStream that it produces into a Table. Note that if you read a file with readFile while using FileProcessingMode.PROCESS_CONTINUOUSLY mode, and then modify the file, the entire file will be re-ingested.

edited Nov 12 '20 at 18:41

answered Nov 12 '20 at 09:53

David Anderson

39,434
4
33
60

Isn't the connector expecting a directory(prefix)instead of an object path? – lalala Nov 12 '20 at 18:31
I'm not sure; I know the filesystem SQL sink works that way. I've expanded my answer to include an alternative. – David Anderson Nov 12 '20 at 18:42
Thank you, I guess `PROCESS_CONTINUOUSLY` would work. If I don't misunderstand, it will monitor the file and will re-ingest. I hope it will drop the existing content when new file comes, and do this atomically. – lalala Nov 12 '20 at 19:10
No, it won't happen atomically. – David Anderson Nov 12 '20 at 19:16
Do you think I can use the concept of `Temporal Tables`. For example, something similar to Hive Temporal Tabel(https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/hive/hive_streaming.html#hive-table-as-temporal-tables). My data volume is relatively small, I just need a proper abstraction that I can refresh my metadata and provide an SQL interface for joins – lalala Nov 12 '20 at 21:14
Flink has temporal tables (https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/temporal_tables.html), and yes, I think that can be part of a solution. But if you need to update the entire table as one atomic transaction, that will be an interesting challenge. – David Anderson Nov 13 '20 at 09:21
@DavidAnderson is `PROCESS_CONTINUOUSLY` supported by pyflink? – Amir Afianian Jan 10 '23 at 15:50

How to create a refreshable table using in-memory data in Flink for joins?

1 Answers1