window tvf is producing incorrect output count using table api

Question

Tried below sql query using window tvf

Steps followed :

Populated 1 million data into kafka topic within 2 minutes
Reading data from kafka as table source with watermark startegy and ran below query using window tvf concept with 1 minute tumble window.

tableEnv.executeSql("CREATE TABLE cdrTable (\r\n"
         + "    orgid STRING\r\n"
         + "    ,clusterid STRING\r\n"
         ...
         + "    ,rowtime TIMESTAMP(3) METADATA FROM 'timestamp'\r\n"
         + "    ,proctime AS  PROCTIME()\r\n"
         + "    ,WATERMARK FOR rowtime AS rowtime - INTERVAL '1' SECOND\r\n"
         + "    )\r\n"
         + "    WITH (\r\n"
         + "    'connector' = 'kafka'\r\n"
         + "    ,'topic' = 'cdr-direct'\r\n"
         + "    ,'properties.bootstrap.servers' = 'localhost:9092'\r\n"
         + "    ,'scan.startup.mode' = 'latest-offset'\r\n"
         + "    ,'format' = 'json'\r\n"
         + "    )");



String sql = "SELECT orgid, clusterid, ...
from (SELECT * FROM TABLE(TUMBLE(TABLE cdrTable, DESCRIPTOR(rowtime), INTERVAL '1' MINUTES)))
group by orgid, clusterid, ..., window_start, window_end";


Table order20 = tableEnv.sqlQuery(sql);
order20.executeInsert("outputCdrTable");

Facing issue with output/sink counts generated with above query it should be ideally 1 million but getting less counts as output(random counts for each run) for each run let say 10 to 20 percentage difference is observed.

Please help !!!

What are you doing to arrange for the 1 million records in that topic to all have timestamps that fall within the same minute? — David Anderson, Jan 31 '22 at 16:01
Please share enough that someone could reproduce the problem. Right now it's hard to guess why the results aren't as expected, but it might have to do with how the windows are aligned. — David Anderson, Feb 01 '22 at 10:16
look like the issue is something to do with idle source which is holding last watermark set to process which is causing randomly less count, tried to set 'table.exec.source.idle-timeout' but this also not helping. So what it looks like the last watermark is not processed until next set of record is pushed — ronak beejawat, Feb 02 '22 at 10:06
Hi David i pushed 1000 records to topic1 and 1801 to topic2 and reading both topic as two different table source with watermarking of 5 second (rowtime TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL) and doing a window join with tumble window for 1 min over rowtime and tried settting 'table.exec.source.idle-timeout' = 1000ms also . I am not pushing data continously i want that output to be processed via sink it is not happening because of idleness at source but when i push 1 record after some time it processed the data via sink. Can you please suggest what i am doing wrong still — ronak beejawat, Feb 03 '22 at 16:09
Both topic1 and topic2 have 16 partition and flink release version is 1.14 i am working — ronak beejawat, Feb 04 '22 at 09:16
See the last two paragraphs of this answer -- https://stackoverflow.com/a/63299797/2000823 -- for an explanation of what I believe is causing this. — David Anderson, Feb 04 '22 at 10:41
Can we have any solution for it wrt to sql api the above reference is related to datastream i guess and i didnt properly understood it, i dont want to push the record if source is idle it should be completed within the same window to process output . One more question i have when i set table.exec.source.idle-timeout to 2 mintues i saw higher watermark is generated but still it didnt emitted the last watermark data from window — ronak beejawat, Feb 04 '22 at 13:59
This discussion has strayed far from the original question. If you want to follow up further, please either edit the question or ask a new one. — David Anderson, Feb 04 '22 at 15:25
I think the question is simple and staright forward for most of the cases if we have idle soruce and the last watermark is not processed what we have to do for that in flink sql api either with or without join if we have done advance watermarking also which internally cause incorrect output for a window using window tvf David can you please help me on this really appreciate your help on this — ronak beejawat, Feb 04 '22 at 19:04
Ok, I've answered this here: https://stackoverflow.com/questions/71006465/flink-sql-windows-not-reporting-final-results/ — David Anderson, Feb 06 '22 at 11:01

window tvf is producing incorrect output count using table api

0 Answers0