1

Flink 1.14, Java, Table API + DataStream API (toDataStream/toAppendStream). I'm trying to: read events from Kafka, hourly aggregate (sum, count, etc.) and upsert results to Cassandra as soon as new events are coming, in other words — create new records or re-calculate already existing records on every new event and sink results to Cassandra immediately. And aim is to see continuously updating sum, count values of primary keyed records. For this purposes I'm using SQL:

...
TUMBLE(TABLE mytable, DESCRIPTOR(action_datetime), INTERVAL '1' HOURS)
...

But task sending results to Cassandra after window interval is expired (every 1 hour). I know, its works as described in docs:

Unlike other aggregations on continuous tables, window aggregation do not emit intermediate results but only a final result, the total aggregation at the end of the window.

Question: How I can achieve that behavior (emit to sink intermediate results as soon as new event comes in)? Don't wait 1 hour for window closing.

deeplay
  • 376
  • 3
  • 20
  • I am also looking for something similar, continueously updating sum and emit event in every update insteaad of after window interval is expired. Any updates on this one? – Chaos Jan 24 '23 at 09:08
  • 1
    I used CUMULATE window (as David suggested). But (IMHO) beware of using small values for cumulate window step, can lead to high resource consumption. If you are using old windowing type (not TVFs) you can try `table.exec.emit.early-fire.enabled` option as mentioned here - https://stackoverflow.com/questions/69904203/getting-partial-results-from-windowed-aggregation-in-apache-flinks-table-api but I never tried that. P.S: this comment is about the Table API. – deeplay Jan 24 '23 at 11:07
  • 1
    I'm using like this: ```CUMULATE(TABLE my_table_name, DESCRIPTOR(proc_time), INTERVAL 10 MINUTES, INTERVAL '1' HOURS)```. It will fire updates for the present hour every 10 minutes. – deeplay Jan 24 '23 at 11:08

1 Answers1

1

Here are some options. Perhaps none of them is precisely what you want:

(1) Use CUMULATE instead of TUMBLE. This won't give you updated results with every new event, but you can have the result updated frequently -- e.g., once a minute.

(2) Use OVER aggregation. This will give you a continuously updated aggregation over the previous 60 minutes (aligned to each event, rather than to the epoch).

(3) Use DataStream windows with a custom Trigger that fires with each event. This will provide the behavior you've asked for, but will require a rewrite using the DataStream API.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Thanks, David. I'll try CUMULATE approach, looks fine to emit every 5 minutes and on Cassandra side don't include `window_end` field into primary key fields list. I'm asking this question here because I remember that in `ksqldb/kafka-streams` windows are implemented exactly like this -- they emit every change. I wonder why Flink chose such an implementation, for the sake of less resources? – deeplay Sep 27 '22 at 11:40
  • 1
    It makes things simpler and more flexible if the output of the window operator is an append-only stream. Update streams are more constrained in how they can be processed. – David Anderson Sep 27 '22 at 13:51