Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

Question

I have a use case where the apache flink process must integrate near real-time data streams (events) from multiple sources but due to lack of uniform keys in the different systems I need to use a surrogate key (SK) lookup from an existing data base. The SK data set is very large (50 Million+ keys). Is it possible/advisable to cache such a data set for in-stream transformation (mapping) without a DB lookup? If yes, What are caching limitations? If not, what alternatives are possible with Flink?

score 1 · Answer 1 · answered Dec 11 '19 at 11:15

There are a few options

Local map

If the surrogate key is never changing, you could just load it in RichMapFunction#open and perform the lookup. That of course means that you will have to adjust the memory settings such that Flink doesn't try to take all memory for its own operations.

Some quick math: assume both keys are strings of length 10. They will each need 40 bytes of chars in memory. With some object overhead, we are getting to ~50 bytes per entry. With 50M entries, we are needing 2.5 GB RAM to store that. Because the hash map will have some overhead, I'd plan with 3 GB RAM.

So if you task manager has 8GB, I'd set taskmanager.memory.size to 4 GB.

Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice. Also I'd choose a format that is suited to load the data as quickly as possible (e.g., Avro) because a slow parsing will greatly reduce startup and recovery time.

State-based

If memory is an issue or data is changing, you can also model the lookup data as a map-state. I'd add a second input for that lookup data and use a KeyedCoProcessFunction. The feed whatever comes from the second input into the map-state. The state should use a rocks-db backend, such that the data effectively resides on disk.

Joining data

A lookup can also be modeled as a join. If you are already using Table API, have a look at Join with Temporal Table. This will internally use the state-based approach but is much more concise. You can also mix DataStream with Tables.

"Ofc, you need to ensure that different tasks of the same task manager are not loading the same map twice" How can you ensure that? — damjad, Nov 28 '20 at 22:56
You'd typically use a static variable initialized in `open()` with some synchronization around it. `synchronized(staticMap) { if (staticMap == null) { staticMap = initializeMap(); } }` — Arvid Heise, Dec 03 '20 at 08:30

Surrogate Key Mapping for large (50 Million) keysets in Apache Flink

1 Answers1

Local map

State-based

Joining data