0

Flink Version - 1.16.1

Use case ->

  1. Input - I have some data coming in from a kafka topic(not compacted). Say, This data has information about some user activity on a computer.
  2. I have a file that has a list of blacklisted users. Users can be added to this list & users may also be deleted from this list. I am using filesystem connector for this. Whenever there is a change in the users, I create another file in the directory monitored by flink. I always want to refer the latest data which means I can either use the latest file in the directory or I can use the update_time tagged to each row in the list.
  3. Output - Another kafka topic

Tables-

create table input
(
    `sessionid` string,
    `usern` string,
    `time_sql_timestamp` TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL
    WATERMARK FOR time_sql_timestamp AS time_sql_timestamp - INTERVAL '5' SECOND)
    WITH ( 'connector' = 'kafka','topic' = 'input','properties.bootstrap.servers' = 'kafka:9092',
        'json.ignore-parse-errors' = 'true',
        'json.fail-on-missing-field'= 'false',
        'properties.group.id' = 'input',
        'scan.startup.mode' = 'latest-offset',
        'format' = 'json');


create table
    context(
               id string,
               username string,
               update_time TIMESTAMP(3),
               `file.path` STRING NOT NULL METADATA,
               `file.modification-time` TIMESTAMP_LTZ(3) NOT NULL METADATA,
                WATERMARK FOR `update_time` AS `update_time` - INTERVAL '1' SECOND
 ) WITH ('connector' = 'filesystem','path' = '/tmp/context','format' = 'csv', 'source.monitor-interval' = '10');



create table
    output
(
    `sessionid` string,
    `usern` string,
    `time_sql_timestamp` TIMESTAMP(3)
     WITH ( 'connector' = 'kafka','topic' = 'output',
        'properties.bootstrap.servers' = 'kafka:9092',
        'json.ignore-parse-errors' = 'true',
        'json.fail-on-missing-field'= 'false',
        'properties.group.id' = 'input',
        'scan.startup.mode' = 'latest-offset',
        'format' = 'json');

Flink SQL> show tables;
+------------+
| table name |
+------------+
|    context |
|      input |
|     output |
+------------+
3 rows in set

Flink SQL> describe context;
+------------------------+------------------------+-------+-----+----------+-------------------------------------+
|                   name |                   type |  null | key |   extras |                           watermark |
+------------------------+------------------------+-------+-----+----------+-------------------------------------+
|                     id |                 STRING |  TRUE |     |          |                                     |
|               username |                 STRING |  TRUE |     |          |                                     |
|            update_time | TIMESTAMP(3) *ROWTIME* |  TRUE |     |          | `update_time` - INTERVAL '1' SECOND |
|              file.path |                 STRING | FALSE |     | METADATA |                                     |
| file.modification-time |       TIMESTAMP_LTZ(3) | FALSE |     | METADATA |                                     |
+------------------------+------------------------+-------+-----+----------+-------------------------------------+
5 rows in set

Flink SQL> describe input;
+--------------------+------------------------+------+-----+-----------------------------------+--------------------------------------------+
|               name |                   type | null | key |                            extras |                                  watermark |
+--------------------+------------------------+------+-----+-----------------------------------+--------------------------------------------+
|          sessionid |                 STRING | TRUE |     |                                   |                                            |
|              usern |                 STRING | TRUE |     |                                   |                                            |
| time_sql_timestamp | TIMESTAMP(3) *ROWTIME* | TRUE |     | METADATA FROM 'timestamp' VIRTUAL | `time_sql_timestamp` - INTERVAL '5' SECOND |
+--------------------+------------------------+------+-----+-----------------------------------+--------------------------------------------+
3 rows in set

Flink SQL> describe output;
+--------------------+--------------+------+-----+--------+-----------+
|               name |         type | null | key | extras | watermark |
+--------------------+--------------+------+-----+--------+-----------+
|          sessionid |       STRING | TRUE |     |        |           |
|              usern |       STRING | TRUE |     |        |           |
| time_sql_timestamp | TIMESTAMP(3) | TRUE |     |        |           |
+--------------------+--------------+------+-----+--------+-----------+
3 rows in set

What I want to do is this- Select the row from input if a username exists in context (latest) and insert it into output topic. Something like -

insert into output (SELECT input.* from input, context where usern = context.username and context.file.modification-time = (select max(file.modification-time) from context));

or

insert into output (SELECT input.* from input, context where usern = context.username and context.update_time = (select max(context.update_time) from context));

The above queries do not give me append only result. They give me result updating results and so I am not able to successfully execute them. I get this error-

[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.output' doesn't support consuming update changes which is produced by node Join(joinType=[InnerJoin], where=[(usern = username)], select=[sessionid, usern, time_sql_timestamp, username], leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])

I have also tried using tumbling window on input data and using compacted kafka topic with upsert-kafka connector for context table like below but no luck-

Flink SQL> INSERT INTO output WITH input_cte AS (SELECT time_sql_timestamp, window_start, window_end, window_time FROM
>  (SELECT * from TABLE (TUMBLE(TABLE input, DESCRIPTOR(time_sql_timestamp), INTERVAL '10' SECOND)),
>  context where (usern = context.username) ORDER BY time_sql_timestamp))
>  (SELECT i.* FROM input_cte AS w, input as i WHERE
>  i.time_sql_timestamp BETWEEN ASYMMETRIC w.window_start AND w.window_end and EXISTS
>  (SELECT * from context where context.username = i.usern));
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.output' doesn't support consuming update and delete changes which is produced by node Join(joinType=[InnerJoin], where=[((time_sql_timestamp >= window_start) AND (time_sql_timestamp <= window_end))], select=[sessionid, usern, time_sql_timestamp, window_start, window_end], leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey])

I tried using views to get the latest view of the context data but I get the same error- 'default_catalog.default_database.output' doesn't support consuming update changes.

Is there a way to achieve what I intent to, keeping output table as append only and not result updating?

Neha
  • 225
  • 1
  • 5
  • 12

0 Answers0