0

I have two streaming tables from two Kafka topic and I want to join these streams and perform aggregate function on the data joined. Streams need to be joined using sliding window. On joining and windowing the data, I am getting an error Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.

Below is code snippet

    select cep.payload['id'] , ep.payload['id'] ,
    ep.event_flink_time,
    ep.rowtime,
    TIMESTAMPDIFF(SECOND, ep.event_flink_time, cep.event_flink_time) as timediff,
    HOP_START (cep.event_flink_time, INTERVAL '5' MINUTES, INTERVAL '10' MINUTES) as hop_start,
   HOP_END (cep.event_flink_time, INTERVAL '5' MINUTES, INTERVAL '10' MINUTES) as hop_end
FROM table1 cep
JOIN table2 ep
ON cep.payload['id'] = ep.payload['id']
group by HOP(cep.event_flink_time, INTERVAL '5' MINUTES, INTERVAL '10' MINUTES), cep.payload, ep.payload, cep.event_flink_time, ep.event_flink_time,
    ep.rowtime

I am using AWS Zeppelin notebook and using Flink SQL Table API. For streaming data, how can I join the data using the sliding window function? Or should I use a different type of join for streaming data along with window functions. Here is a ticket for same error: https://issues.apache.org/jira/browse/FLINK-10211

data_adi
  • 1
  • 2

2 Answers2

0

From your execute sql, I suggest you split this task into two parts. One is "left join" by two streaming data sources, and then execute "Group By" follow a create view. In addition to this, confirm the type of event time attributes whether it is correct.

张奇文
  • 11
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 26 '22 at 06:50
0

Streaming SQL relies on time attributes to expire state that is no longer needed to produce results. This is meaningful in the context of specific temporal queries where the timestamps on both the input and output records are advancing -- which happens with queries like windows and interval joins.

A regular join (a join without any temporal constraints) doesn't work this way. Any previously ingested record might be updated at any point in time, and the corresponding output record(s) would then need to be updated. This means that both input streams must be fully materialized in Flink state, and the output stream has no temporal ordering that downstream operations can leverage to do state retention optimization.

Given how this all works, Flink's stream SQL planner can't handle having a window after a regular join -- the regular join can't produce time attributes, and the HOP insists on having them.

One potential solution would be to reformulate the join as an interval join, if that will meet your needs.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • 1. I tried to use interval join along with sliding window, flink considers this as regular join and gives the same original error `attribute error` 2. Can we even use sliding window using interval join because interval join is itself a time window join on unbounded streams? or interval join will create window for us and hence we can avoid using sliding window? TIA – data_adi May 25 '22 at 20:25
  • 3. the datatype of rowtime is TIMESTAMP(3) and event_flink_time is converted to timestamp for example: `2022-04-18 22:54:00.196` 4. Also I can not use event_flink_time directly in select statement as Flink sql considers it as rowtime attribute so I have to convert it into `yyyy MM dd HH:mm:ss` format. – data_adi May 25 '22 at 20:25
  • I believe this should be possible, but it has to be done carefully, while respecting the rules governing time attributes and joins. If you share more details of what you've tried perhaps we can spot why it isn't working. Also, I'm confused by the original query you shared -- why isn't there any aggregation being done? What's the reason for the window, if not to do aggregation? – David Anderson May 26 '22 at 13:01
  • I also tried to such as creating a sliding window on one stream table and then joining it with the other stream table using inner join. Window and join are required to calculate the timedifference between two streams. One is the source stream and other is the target stream. Also I have updated the sql with the aggregation. – data_adi May 26 '22 at 18:18