0

I'm trying to replicate Flink's upsert-kafka connector example.

Using the following input:

event_id,user_id,page_id,user_region,viewtime
e0,1,11,TR,2022-01-01T13:26:41.298Z
e1,1,22,TR,2022-01-02T13:26:41.298Z
e2,2,11,AU,2022-02-01T13:26:41.298Z

and created a topic, whose event structure looks like the following:

key: {"event_id":"e2"}, 
value: {"event_id": "e2", "user_id": 2, "page_id": 11, "user_region": "AU", "viewtime": "2022-02-01T13:26:41.298Z"}

Using the following kafka upstream, kafka-upsert sink logic:

CREATE TABLE pageviews_per_region (
  user_region STRING,
  pv BIGINT,
  uv BIGINT,
  PRIMARY KEY (user_region) NOT ENFORCED
) WITH (
  'connector' = 'upsert-kafka',
  'topic' = 'pageviews_per_region',
  'properties.bootstrap.servers' = '...',
  'key.format' = 'json',
  'value.format' = 'json'
);

CREATE TABLE pageviews (
  user_id BIGINT,
  page_id BIGINT,
  viewtime TIMESTAMP,
  user_region STRING,
  WATERMARK FOR viewtime AS viewtime - INTERVAL '2' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'pageviews',
  'properties.bootstrap.servers' = '...',
  'format' = 'json'
);

-- calculate the pv, uv and insert into the upsert-kafka sink
INSERT INTO pageviews_per_region
SELECT
  user_region,
  COUNT(*),
  COUNT(DISTINCT user_id)
FROM pageviews
GROUP BY user_region;

I'm expecting to get just one key for {"user_region":"TR"} with updated pv: 2, however the created topic doesn't seem to be log compacted, hence observing two events for the same user_region:

k: {"user_region":"AU"}, v: {"user_region":"AU","pv":1,"uv":1}
k: {"user_region":"TR"}, v: {"user_region":"TR","pv":2,"uv":1}
k: {"user_region":"TR"}, v: {"user_region":"TR","pv":1,"uv":1}

Isn't the upsert-kafka connector supposed to create a log compacted topic for this use case or is it the developer's responsibility to update the topic configuration?

Another option may be I have misinterpreted something or did a mistake. Looking forward to hear your thoughts. Thanks.

Hako
  • 361
  • 1
  • 2
  • 9

1 Answers1

2

When you use CREATE TABLE to create a table for use with Flink SQL, you are describing how to interpret an existing data store as a table. In other words, you are creating metadata in a Flink catalog. It is Kafka that creates the topic on first access, and it's the developer's responsibility to adjust the log configuration to use a compaction strategy.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • 1
    Also worth pointing out that only closed Kafka log segments get compacted; it's not immediate – OneCricketeer Feb 21 '22 at 15:40
  • 1
    Thank you! So if I’d like to create a flink app with upsert kafka sink functionality, I should manually create the sink topic with log compaction first before I use it in my the application. Sounds good. – Hako Feb 21 '22 at 20:33
  • Have a quick question to validate my understanding. Is it accurate to say that upsert-kafka connector as a sink only has the advantage of publishing tombstones on deletion cases over regular kafka connector as a sink. – Hako Mar 08 '22 at 23:58