How can I perform an aggregation on top of an aggregation over a time window in Kafka with ksql

Question

I have a bunch of firewall data. I would like to:

A) sum the bytes per IP per hour, and then

B) calculate the min and max sums across all IPs in that hour

I have been able to do A in Kafka, however, I cannot figure out how to do B. I've been poring over the docs and feel like I'm closing in but I always seem to find only part of the solution.

I have my firewall_stream going great.

client.create_stream(
    table_name='firewall_stream',
    columns_type=['src_ip VARCHAR',
                  'dst_ip VARCHAR',
                  'src_port INTEGER',
                  'dst_port INTEGER',
                  'protocol VARCHAR',
                  'action VARCHAR',
                  'timestamp VARCHAR',
                  'bytes BIGINT',
    ],
    topic='firewall',
    value_format='JSON'
)

I created materialized view bytes_sent with tumbling window of 1 hour, sum(bytes) and group by IP address. This works great!.

client.ksql('''
CREATE TABLE bytes_sent as
  SELECT src_ip, sum(bytes) as bytes_sum
  FROM firewall_stream
  GROUP BY src_ip
  EMIT CHANGES
''')

This is where I get stuck. First I tried to just create another materialized view off of bytes_sent that did a max(bytes_sum) group by windowstart but I got an error that you can't do an aggregation on a windowed materialized view.

So then I removed the time window (figured I'd put it back on in the 2nd materialized view), but then I don't have any field for my "group by" clause. In Postgres, I could do max without a group by and it will calculate it across the table but Kafka always requires that group by. And now I'm not sure what to use.

It appears one cannot do joins with windowed tables from the docs (although I haven't tried it and may be misunderstanding).

My only other guess is to create another stream from that materialized view bytes_sent and look at changelog events then somehow turn them into a max bytes across all IPs within a given time window.

Any feedback on how to approach this would be greatly appreciated!!

score 0 · Answer 1 · answered Jan 30 '21 at 16:55

The short answer is you can't currently do this.

My quick solution was to essentially:

create a topic as the output of the first aggregation
create a new stream on that new topic, but outside of ksql
run the new ksql aggregation on that 2nd one

That said, there's a lot that can go wrong in this sort of setup. At this point we're just ruling out ksql for this particular use case and will likely use streams directly.

How can I perform an aggregation on top of an aggregation over a time window in Kafka with ksql

1 Answers1

Linked