Flink Version - 1.16.1
Use case ->
- Input - I have some data coming in from a kafka topic(not compacted). Say, This data has information about some user activity on a computer.
- I have a file that has a list of blacklisted users. Users can be added to this list & users may also be deleted from this list. I am using filesystem connector for this. Whenever there is a change in the users, I create another file in the directory monitored by flink. I always want to refer the latest data which means I can either use the latest file in the directory or I can use the update_time tagged to each row in the list.
- Output - Another kafka topic
Tables-
create table input
(
`sessionid` string,
`usern` string,
`time_sql_timestamp` TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL
WATERMARK FOR time_sql_timestamp AS time_sql_timestamp - INTERVAL '5' SECOND)
WITH ( 'connector' = 'kafka','topic' = 'input','properties.bootstrap.servers' = 'kafka:9092',
'json.ignore-parse-errors' = 'true',
'json.fail-on-missing-field'= 'false',
'properties.group.id' = 'input',
'scan.startup.mode' = 'latest-offset',
'format' = 'json');
create table
context(
id string,
username string,
update_time TIMESTAMP(3),
`file.path` STRING NOT NULL METADATA,
`file.modification-time` TIMESTAMP_LTZ(3) NOT NULL METADATA,
WATERMARK FOR `update_time` AS `update_time` - INTERVAL '1' SECOND
) WITH ('connector' = 'filesystem','path' = '/tmp/context','format' = 'csv', 'source.monitor-interval' = '10');
create table
output
(
`sessionid` string,
`usern` string,
`time_sql_timestamp` TIMESTAMP(3)
WITH ( 'connector' = 'kafka','topic' = 'output',
'properties.bootstrap.servers' = 'kafka:9092',
'json.ignore-parse-errors' = 'true',
'json.fail-on-missing-field'= 'false',
'properties.group.id' = 'input',
'scan.startup.mode' = 'latest-offset',
'format' = 'json');
Flink SQL> show tables;
+------------+
| table name |
+------------+
| context |
| input |
| output |
+------------+
3 rows in set
Flink SQL> describe context;
+------------------------+------------------------+-------+-----+----------+-------------------------------------+
| name | type | null | key | extras | watermark |
+------------------------+------------------------+-------+-----+----------+-------------------------------------+
| id | STRING | TRUE | | | |
| username | STRING | TRUE | | | |
| update_time | TIMESTAMP(3) *ROWTIME* | TRUE | | | `update_time` - INTERVAL '1' SECOND |
| file.path | STRING | FALSE | | METADATA | |
| file.modification-time | TIMESTAMP_LTZ(3) | FALSE | | METADATA | |
+------------------------+------------------------+-------+-----+----------+-------------------------------------+
5 rows in set
Flink SQL> describe input;
+--------------------+------------------------+------+-----+-----------------------------------+--------------------------------------------+
| name | type | null | key | extras | watermark |
+--------------------+------------------------+------+-----+-----------------------------------+--------------------------------------------+
| sessionid | STRING | TRUE | | | |
| usern | STRING | TRUE | | | |
| time_sql_timestamp | TIMESTAMP(3) *ROWTIME* | TRUE | | METADATA FROM 'timestamp' VIRTUAL | `time_sql_timestamp` - INTERVAL '5' SECOND |
+--------------------+------------------------+------+-----+-----------------------------------+--------------------------------------------+
3 rows in set
Flink SQL> describe output;
+--------------------+--------------+------+-----+--------+-----------+
| name | type | null | key | extras | watermark |
+--------------------+--------------+------+-----+--------+-----------+
| sessionid | STRING | TRUE | | | |
| usern | STRING | TRUE | | | |
| time_sql_timestamp | TIMESTAMP(3) | TRUE | | | |
+--------------------+--------------+------+-----+--------+-----------+
3 rows in set
What I want to do is this- Select the row from input if a username exists in context (latest) and insert it into output topic. Something like -
insert into output (SELECT input.* from input, context where usern = context.username and context.file.modification-time
= (select max(file.modification-time
) from context));
or
insert into output (SELECT input.* from input, context where usern = context.username and context.update_time = (select max(context.update_time) from context));
The above queries do not give me append only result. They give me result updating results and so I am not able to successfully execute them. I get this error-
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.output' doesn't support consuming update changes which is produced by node Join(joinType=[InnerJoin], where=[(usern = username)], select=[sessionid, usern, time_sql_timestamp, username], leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])
I have also tried using tumbling window on input data and using compacted kafka topic with upsert-kafka connector for context table like below but no luck-
Flink SQL> INSERT INTO output WITH input_cte AS (SELECT time_sql_timestamp, window_start, window_end, window_time FROM
> (SELECT * from TABLE (TUMBLE(TABLE input, DESCRIPTOR(time_sql_timestamp), INTERVAL '10' SECOND)),
> context where (usern = context.username) ORDER BY time_sql_timestamp))
> (SELECT i.* FROM input_cte AS w, input as i WHERE
> i.time_sql_timestamp BETWEEN ASYMMETRIC w.window_start AND w.window_end and EXISTS
> (SELECT * from context where context.username = i.usern));
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.output' doesn't support consuming update and delete changes which is produced by node Join(joinType=[InnerJoin], where=[((time_sql_timestamp >= window_start) AND (time_sql_timestamp <= window_end))], select=[sessionid, usern, time_sql_timestamp, window_start, window_end], leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])
I tried using views to get the latest view of the context data but I get the same error- 'default_catalog.default_database.output' doesn't support consuming update changes.
Is there a way to achieve what I intent to, keeping output table as append only and not result updating?