2

I want to Join three or more data streams or tables on a given key and a common window. however I don't know how to correctly write the code. The official document https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/operators/ give the example below, however it just join two data streams , so how to join three or more data streams on a given key and a common window?

dataStream.join(otherStream)
.where(<key selector>).equalTo(<key selector>)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply (new JoinFunction () {...});

I tried to figure out that I join the two data streams firstly with common window, and use the result data stream to join third data stream with common window? However it seems the semantic of event time for these three data streams would be changed when we set the TimeCharacteristic to event time.

==================

The same question for FlinK Table API and SQL,how to join three or more tables on a given key and a common window? The official document https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/table/sql.html just give the example below for single table.

Table result1 = tableEnv.sqlQuery(
"SELECT user, " +
"  TUMBLE_START(rowtime, INTERVAL '1' DAY) as wStart,  " +
"  SUM(amount) FROM Orders " +
"GROUP BY TUMBLE(rowtime, INTERVAL '1' DAY), user");

I tried to write the SQL like below to join three tables on a given key and a common window , however I don't think it is right.

String SQL = "SELECT" +
            " grades.user1  , SUM(salaries.amount)   FROM grades " +
            " INNER JOIN salaries ON   grades.user1 =   salaries.user1 " +
            " INNER JOIN person ON   grades.user1 =   person.user1 "+
             "GROUP BY grades.user1, TUMBLE(grades.proctime,  INTERVAL '5' SECOND) "   

So what's the correct way to join three or more datastreams /tables on a given key and a common window by datastrem API or Flink Table API/SQL ?

update at 6/16/2018 to make the question more clearly.

For the Flink SQL, what I needed , just like the Pseudocode below, is the join three tables with a common TumblingEventTimeWindow, that is to say the alternative version for DataStream API, however expressed by Flink SQL,also meaning join all events from three tables, which happened in the same TumblingEventTimeWindow.

SELECT A.a, B.b, C.c
FROM A, B, C
WHERE A.x = B.x AND A.x = C.x AND
window(TumblingEventTimeWindows.of(Time.seconds(3))

It seems that join feature also mentioned in the following Flink design document: "Event-time tumbling-windowed Stream-Stream joins: Joins tuples of two streams that are in the same tumbling event-time window", I have no idea if the Flink SQL have implemented this type of Flink SQL join feature.

https://docs.google.com/document/d/1TLayJNOTBle_-m1rQfgA6Ouj1oYsfqRjPcp1h2TVqdI/edit#

YuFeng Shen
  • 1,475
  • 1
  • 17
  • 41

1 Answers1

3

It is hard to give a definite answer to your question because the semantics of the join that you need are not clear. The semantics of the windowed join implementation of the DataStream API is different from the windowed join of the Table API / SQL.

On the DataStream API, you can simply define another join as follows:

firstStream
  .join(secondStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new JoinFunction () {...})
  .join(thirdStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new JoinFunction () {...})

Since Flink implements standard SQL, you can define a join of three tables as usual:

SELECT A.a, B.b, C.c
  FROM A, B, C
  WHERE A.x = B.x AND A.x = C.x AND
        A.ts BETWEEN B.ts - INTERVAL '10' MINUTE AND B.ts + INTERVAL '10' MINUTE AND
        A.ts BETWEEN C.ts - INTERVAL '10' MINUTE AND C.ts + INTERVAL '10' MINUTE

The window ranges (A.ts BETWEEN B.ts - X AND B.ts + Y) can be defined as necessary.

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
  • Thank you Fabian, the sample code for DataStream API is what I needed, however for the Flink SQL, what I needed is the alternative version of DataStream API you post ,that's to say join three tables with a TumblingEventTimeWindows, please see the updated question. – YuFeng Shen Jun 16 '18 at 10:13
  • To be more clearly, what I needed is "Event-time tumbling-windowed Stream-Stream joins: Joins tuples of two streams that are in the same tumbling event-time window" by using Flink SQL mentioned in the following Flink design document https://docs.google.com/document/d/1TLayJNOTBle_-m1rQfgA6Ouj1oYsfqRjPcp1h2TVqdI/ – YuFeng Shen Jun 17 '18 at 13:08
  • You can implement such a join in SQL using a UDF that computes a window ID based on the timestamp and join the streams on that window ID. However, the join would not be done in a streaming fashion and materialize all input tables. Flink SQL is not able to do the joins in a streaming fashion yet. – Fabian Hueske Jun 17 '18 at 19:51
  • Sorry I confused. Do you mean even if I implement it "using a UDF that computes a window ID based on the timestamp and join the streams on that window ID", however such UDF still cannot be used to do join the two streams with a tumbling window style? – YuFeng Shen Jun 26 '18 at 15:36
  • Such an UDF can be used to join the tables but both tables would be completely materialized in state. You would need to define a state retention time to clean up state that is not used anymore. I'd recommend to continue this discussion on the Flink user mailing list. SO is not the right place for this. – Fabian Hueske Jun 27 '18 at 07:40