1

I want to join two tables (left and right) which are generated by below queries.

CREATE TABLE
  left_table (
    `experiment_id` BIGINT,
    `f_sequence` BIGINT,
    `line_string` STRING,
    `log_time` TIMESTAMP(3),
    WATERMARK FOR log_time AS log_time - INTERVAL '30' SECONDS,
    PRIMARY KEY (f_sequence) NOT ENFORCED
  )
  WITH (
    'connector' = 'filesystem',
    'path' = '<left_table_csv_path>',
    'format' = 'csv',
    'csv.ignore-parse-errors' = 'true'
    );

where left_table_csv is as below

1,0,60ee1985-a8a5-44ce-b2c3-416477c92150,2022-08-11 14:55:59.350333
1,1,a69bf93e-a097-42f0-93db-02772eb71181,2022-08-11 14:55:59.359333
1,2,3dbaf345-956e-49f0-af20-41eda50c97fa,2022-08-11 14:55:59.364333

and

CREATE TABLE
  right_table (
    `experiment_id` BIGINT,
    `model_id` STRING,
    `model_time` TIMESTAMP(3),
    WATERMARK FOR model_time AS model_time - INTERVAL '30' SECONDS,
    PRIMARY KEY (model_id) NOT ENFORCED
  )
  WITH (
    'connector' = 'filesystem',
    'path' = '<right_table_csv_path>',
    'format' = 'csv',
    'csv.ignore-parse-errors' = 'true'
    );

where right_table_csv is as below

1,1b7af6e8-149a-4602-abf3-e3fc243d3840,2022-08-11 14:55:59.349333
1,17ca2ce2-f0e6-4a8b-92e8-9cd5a391c104,2022-08-11 14:55:59.358333
1,8f74ef65-f619-4c37-a2f4-04c31a9c6469,2022-08-11 14:55:59.365333

When I run the below query

(SELECT l.experiment_id, 
    l.f_sequence,
    l.log_time, (
    SELECT r.model_time
    FROM right_table as r
    WHERE l.log_time>=r.model_time
    ORDER BY r.model_time DESC
    LIMIT 1
) AS model_time FROM left_table as l);

I expect to see result below (result of corresponding SQL query)

experiment_id   f_sequence  log_time                    model_time
1               1           2022-08-11T14:55:59.35Z     2022-08-11T14:55:59.349Z
1               2           2022-08-11T14:55:59.359Z    2022-08-11T14:55:59.349Z
1               3           2022-08-11T14:55:59.364Z    2022-08-11T14:55:59.349Z

However I get the result below

experiment_id   f_sequence   log_time                 model_time
1               2            2022-08-11 14:55:59.364  <NULL>
1               0            2022-08-11 14:55:59.350  <NULL>
1               1            2022-08-11 14:55:59.359  2022-08-11 14:55:59.358

I couldn't figure out what I am missing. I think this behavior of Flink is wrong and contradicts with sql. Is it the expected behavior or am I doing something wrong. Kind regards.

akurmustafa
  • 122
  • 10
  • Can you try executing this in batch mode and verify if it works correctly then? – David Anderson Aug 15 '22 at 15:03
  • @DavidAnderson it doesn't work correctly either in batch mode. It produces streaming case result except that all 'model_time' columns are NULL (I didn't paste the result here since formatting was really bad) – akurmustafa Aug 16 '22 at 08:50

0 Answers0