0

I am trying to perform stream-stream join using Flink v1.11 app on KDA. Join wrt to ProcessingTime works, but with EventTime I don’t see any output records from Flink.

Here is my code with EventTime processing which is not working,

public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    DataStream<Trade> input1 = createSourceFromInputStreamName1(env)
            .assignTimestampsAndWatermarks(
                    WatermarkStrategy.<Trade>forMonotonousTimestamps()
                            .withTimestampAssigner(((event, l) -> event.getEventTime()))
            );
    DataStream<Company> input2 = createSourceFromInputStreamName2(env)
            .assignTimestampsAndWatermarks(
                    WatermarkStrategy.<Company>forMonotonousTimestamps()
                            .withTimestampAssigner(((event, l) -> event.getEventTime()))
            );
    DataStream<String> joinedStream = input1.join(input2)
            .where(new TradeKeySelector())
            .equalTo(new CompanyKeySelector())
            .window(TumblingEventTimeWindows.of(Time.seconds(30)))
            .apply(new JoinFunction<Trade, Company, String>() {
                @Override
                public String join(Trade t, Company c) {
                    return t.getEventTime() + ", " + t.getTicker() + ", " + c.getName() + ", " + t.getPrice();
                }
            });
    joinedStream.addSink(createS3SinkFromStaticConfig());
    env.execute("Flink S3 Streaming Sink Job");
}

I got a similar join working with ProcessingTime

public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env =
            StreamExecutionEnvironment.getExecutionEnvironment();
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
    DataStream<Trade> input1 = createSourceFromInputStreamName1(env);
    DataStream<Company> input2 = createSourceFromInputStreamName2(env);
    DataStream<String> joinedStream = input1.join(input2)
            .where(new TradeKeySelector())
            .equalTo(new CompanyKeySelector())
            .window(TumblingProcessingTimeWindows.of(Time.milliseconds(10000)))
            .apply (new JoinFunction<Trade, Company, String> (){
                @Override
                public String join(Trade t, Company c) {
                    return t.getEventTime() + ", " + t.getTicker() + ", " + c.getName() + ", " + t.getPrice();
                }
            });
    joinedStream.addSink(createS3SinkFromStaticConfig());
    env.execute("Flink S3 Streaming Sink Job");
}

Sample records from two streams which I am trying to join:

{'eventTime': 1611773705, 'ticker': 'TBV', 'price': 71.5}
{'eventTime': 1611773705, 'ticker': 'TBV', 'name': 'The Bavaria'}
Sairam Sankaran
  • 309
  • 9
  • 18

1 Answers1

0

I don't see anything obviously wrong, but any of the following could cause this job to not produce any output:

  • A problem with watermarking. For example, if one of the streams becomes idle, then the watermarks will cease to advance. Or if there are no events after a window, then the watermark will not advance far enough to close that window. Or if the timestamps aren't actually in ascending order (with the forMonotonousTimestamps strategy, the events should be in order by timestamp), the pipeline could be silently dropping all of the out-of-order events.
  • The StreamingFileSink only finalizes its output during checkpointing, and does not finalize whatever files are pending if and when the job is stopped.
  • A windowed join behaves like an inner join, and requires at least one event from each input stream in order to produce any results for a given window interval. From the example you shared, it looks like this is not the issue.

Update:

Given that what you (appear to) want to do is to join each Trade with the latest Company record available at the time of the Trade, a lookup join or a temporal table join seem like they might be good approaches.

Here are a couple of examples:

https://github.com/ververica/flink-sql-cookbook/blob/master/joins/04/04_lookup_joins.md

https://github.com/ververica/flink-sql-cookbook/blob/master/joins/03/03_kafka_join.md

Some documentation:

https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#event-time-temporal-join

https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/versioned_tables.html

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • I don't fully understand what the second issue you mentioned could be. Since the ProcessingTime join works, I am hoping to eliminate the third issue as well. I think this might be a watermark issue, which I am still getting my feet wet with. As soon as I start the Flink app, the .currentInputWatermark is -9223372036854776000 (about Long.MIN_VALUE). As I start ingesting events into the stream, it moves to 1611929908 (about 3 hours ago), while my records have a timestamp of 1611939889 (about now). I am using the forBoundedOutOfOrderness(Duration.ofSeconds(10)) strategy. – Sairam Sankaran Jan 29 '21 at 17:10
  • Yes, I agree that this is most likely a watermarking problem. Since the code you shared shows `forMonotonousTimestamps` rather than `forBoundedOutOfOrderness`, I'm not sure I have an accurate idea of what your job is doing with watermarks. – David Anderson Jan 29 '21 at 17:21
  • I changed back to the exact code in the description, and now I see the output from the join!!! I see the .currentInputWatermark close to the event time as well. I am confident that the same configuration did not work before. I no longer know what I am doing :) – Sairam Sankaran Jan 29 '21 at 17:40
  • There's a tutorial on watermarks at https://ci.apache.org/projects/flink/flink-docs-stable/learn-flink/streaming_analytics.html#event-time-and-watermarks that you might find helpful. – David Anderson Jan 29 '21 at 20:30
  • Thank you so much. I have a use case where the Company stream in the above example is a low volume stream. Let's assume the company name changes every hour. And the Trade stream is a high volume stream with a trade every second. I want to join all the trades with the company. Since the window size is common for both streams, I don't want to hold the Trade stream state for an hour before performing the join, for example, in a tumbling window. – Sairam Sankaran Jan 29 '21 at 21:32
  • Will Interval Join be a good candidate here, where for each trade I can look up in the past hour for the company event? In case there are two company events within the window, is there a way I can just pick the latest event for the join? I am thinking if versioned tables will solve this. – Sairam Sankaran Jan 29 '21 at 21:32