0

I have some slow-changing reference data that I want to have available when processing events in Flink using PyFlink. For example, imagine there is information about employee IDs, teams and departments and how they relate to one another. The reference data can fit into memory.

I then want to process events that make reference to this reference data, perhaps things that people did and which I then want to aggregate later downstream by team or department.

I am currently thinking of having 2 streams: one for reference data and the other for the main data. The reference data stream has state (the map of employee->team->dept) and I intend to broadcast that state to the main event stream. This seems to fit the Broadcast State Pattern in the Flink docs. The streams will be in some form of event log, e.g. Kinesis or Kafka, so in the event of restart I can go back to the beginning of the available log so there should always be reference data available.

My questions:

  1. How do I ensure that there is reference data available when I start to process the main event stream? (I do not want to do something like start a job with only reference data, snapshot state, restart job from that snapshot as that is very difficult to orchestrate inside Kubernetes in an automated fashion).
  2. Do I need to buffer the main events and only emit them downstream when there is broadcast state available that that particular main event needs?
  3. Is there an example of this buffering approach (ideally in Python)?
  4. What watermark and watermarking strategy would be best to use for the reference data events?
John
  • 10,837
  • 17
  • 78
  • 141

1 Answers1

0

I wouldn't recommend to start with the DataStream API and broadcast state for the scenario you've described. It's way easier to start with a Table API application, where you either use regular joins (given that the reference data is small) or temporal time joins (in case the reference data is bigger).

Martijn Visser
  • 1,468
  • 1
  • 3
  • 9
  • I’ve tried using the table API previously but found that state just built up and was never cleared. That caused snapshots to get slower and slower and eventually killed the cluster. I couldn’t find any documentation on how information in tables and views is cleared. – John Jul 27 '23 at 20:26
  • The configuration setting to set a TTL is documented at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-state-ttl – Martijn Visser Jul 28 '23 at 09:43