0

Context

I have a Flink job coded by python SQL api. it is consuming source data from Kinesis and producing results to Kinesis. I want to make a local test to ensure the Flink application code is correct. So I mocked out both the source Kinesis and sink Kinesis with filesystem connector. And then run the test pipeline locally. Although the local flink job always run successfully. But when I look into the sink file. The sink file is alway empty. This has also been the case when I run the code in 'Flink SQL Client'.

Here is my code:

CREATE TABLE incoming_data (
        requestId VARCHAR(4),
        groupId VARCHAR(32),
        userId VARCHAR(32),
        requestStartTime VARCHAR(32),
        processTime AS PROCTIME(),
        requestTime AS TO_TIMESTAMP(SUBSTR(REPLACE(requestStartTime, 'T', ' '), 0, 23), 'yyyy-MM-dd HH:mm:ss.SSS'),
        WATERMARK FOR requestTime AS requestTime - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'filesystem',
        'path' = '/path/to/test/json/file.json',
        'format' = 'json',
        'json.timestamp-format.standard' = 'ISO-8601'
    )

CREATE TABLE user_latest_request (
        groupId VARCHAR(32),
        userId VARCHAR(32),
        latestRequestTime TIMESTAMP
    ) WITH (
        'connector' = 'filesystem',
        'path' = '/path/to/sink',
        'format' = 'csv'
)

INSERT INTO user_latest_request
    SELECT groupId,
           userId,
           MAX(requestTime) as latestRequestTime
    FROM incoming_data
    GROUP BY TUMBLE(processTime, INTERVAL '1' SECOND), groupId, userId;

Curious what I am doing wrong here.

Note:

  • I am using Flink 1.11.0
  • if I directly dump data from source to sink without windowing and grouping, it works fine. That means the source and sink table is set up correctly. So it seems the problem is around the Tumbling and grouping for local filesystem.
  • This code works fine with Kinesis source and sink.
Alfred
  • 1,709
  • 8
  • 23
  • 38

2 Answers2

0

Have you enabled checkpointing? This is required if you are in `STREAMING mode which appears to be the case. See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/file_sink/

Martijn Visser
  • 1,468
  • 1
  • 3
  • 9
  • Hi @Martjin, yes, I enabled the checkpointing with `env.enable_checkpointing(500)`. But it still does not work. Any other configuration that I may miss? It is interesting that it seems not to work with window function. If I simply read from source (with select) and dump to the sink, it just works fine. – Alfred Nov 15 '21 at 17:29
0

The most likely cause is that there isn't enough data in the file being read to keep the job running long enough for the window to close. You have a processing-time-based window that is 1 second long, which means that the job will have to run for at least one second to guarantee that the first window will produce results.

Otherwise, once the source runs out of data the job will shut down, regardless of whether the window contains unreported results.

If you switch to event-time-based windowing, then when the file source runs out of data it will send one last watermark with the value MAX_WATERMARK, which will trigger the window.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • exactly. I changed to event-time-based windowing, which is column of requestTime and it worked as expected. So what is the suggested way for processing-time-based window unit test? Anyway we could force the window to close when the data stream hits EOF? – Alfred Nov 15 '21 at 22:36
  • Or is there a way to keep the flink cluster running and we manually shut it down when we finish all tests? – Alfred Nov 16 '21 at 00:05