2

We are using KSQL to perform some aggregations / filtering on a real time data. One of the use case we have is, we need to perform some operation on last N days of a particular activity, this would be continuous operation. So this needs to be hopping window.

When I tried the query, with hopping duration as M days, KSQL query returned M records instead of 1 (which was hoping for).

Query :

select PROTO, 
       TIMESTAMPTOSTRING(WindowStart(), 'yyyy-MM-dd''T''HH:mm:ss''Z''', 'UTC') as "timestamp", 
       TIMESTAMPTOSTRING(WindowEnd(), 'yyyy-MM-dd''T''HH:mm:ss''Z''', 'UTC'), 
       COUNT(PROTO) AS Count 
  FROM DATASTREAM  
         WINDOW HOPPING (SIZE 5 DAYS, ADVANCE BY 1 DAY) 
 WHERE MSG like '%SOMESTRING%' 
   AND SPLIT(PROTO, '/')[0] = 'tcp' 
 GROUP BY PROTO; 

tcp/22 | 2020-01-27T00:00:00Z | 2020-02-01T00:00:00Z | 1
tcp/22 | 2020-01-28T00:00:00Z | 2020-02-02T00:00:00Z | 1
tcp/22 | 2020-01-29T00:00:00Z | 2020-02-03T00:00:00Z | 1
tcp/22 | 2020-01-30T00:00:00Z | 2020-02-04T00:00:00Z | 1
tcp/22 | 2020-01-31T00:00:00Z | 2020-02-05T00:00:00Z | 1

Is there any way to get only first record only, or the records for which end time <= current time or any other workaround to get 1 result per window ?


Please consider below data records.


{ "time": "2020-01-25 23:36:37 UTC", "msg": "Error"} 
{ "time": "2020-01-25 23:36:38 UTC", "msg": "Error"} 
{ "time": "2020-01-25 23:36:40 UTC", "msg": "Error"} 
{ "time": "2020-01-26 23:36:37 UTC", "msg": "Error"} 
{ "time": "2020-01-26 23:36:38 UTC", "msg": "Error"} 
{ "time": "2020-01-26 23:36:39 UTC", "msg": "Error"} 
{ "time": "2020-01-26 23:36:40 UTC", "msg": "Error"} 
{ "time": "2020-01-27 23:36:37 UTC", "msg": "Error"} 
{ "time": "2020-01-27 23:36:38 UTC", "msg": "Error"} 
{ "time": "2020-01-27 23:36:39 UTC", "msg": "Error"} 
{ "time": "2020-01-28 23:36:37 UTC", "msg": "Error"} 
{ "time": "2020-01-28 23:36:38 UTC", "msg": "Error"} 
{ "time": "2020-01-29 23:36:37 UTC", "msg": "Error"} 
{ "time": "2020-01-29 23:36:38 UTC", "msg": "Error"} 
{ "time": "2020-01-29 23:36:39 UTC", "msg": "Error"} 
{ "time": "2020-01-29 23:36:40 UTC", "msg": "Error"} 

I am looking for count of records which have, msg as Error past 2 days.

If I fire the KSQL query on 25th at 23:36:37, I would be expecting result as :

2020-01-25T23:36:37Z | 1
2020-01-25T23:36:38Z | 2
2020-01-25T23:36:40Z | 3

2020-01-26T23:36:37Z | 4
2020-01-26T23:36:38Z | 5
2020-01-26T23:36:39Z | 6
2020-01-26T23:36:40Z | 7

2020-01-27T23:36:37Z | 5
2020-01-27T23:36:38Z | 6
2020-01-27T23:36:39Z | 7

2020-01-28T23:36:37Z | 4
2020-01-28T23:36:38Z | 5

2020-01-29T23:36:37Z | 3
2020-01-29T23:36:38Z | 4
2020-01-29T23:36:39Z | 5
2020-01-29T23:36:40Z | 6

1 Answers1

0

I think you need a TUMBLING window instead if you want one window for the five day period. You're instead getting five windows because you've used HOPPING with an advance of 1 DAY - see the WINDOWSTART() changes per day.

References:

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
  • Ok, understood why it is behaving that way. But I can't use Tumbling window either, the windows would be non overlapping in that case, right ? The use case was to get Last N days of data continuously, Agg. data (one data point) for 1st Jan to 1st Feb, 2nd Jan to 2nd Feb, and so on. If I don't get any other solution, I was thinking, may be just run the tumbling query once, kill it, do it again next day ? – user12818876 Feb 05 '20 at 05:46
  • When data for the tumbling window is received, the aggregate is updated and re-emitted. Maybe I don't understand your question fully. Could you update it with an example set of input records and your desired output? – Robin Moffatt Feb 05 '20 at 10:27
  • Agree, but in tumbling window the counter will be initialised after the window is over. It won't keep track of partial window. Sorry for being unclear, I have updated the question with the example. Please let me know if I need to add any information. – user12818876 Feb 05 '20 at 11:39