I'm using PyFlink 1.13 for a project and I'm trying to do the following:
- Read data from Kafka topic where messages contain a UserId
- Perform tumbling windowing over 2 seconds on the data
- Call a Python UDF with my windows values
Here's a visual representation of the data flow I'm trying to achieve:
I'm using PyFlink's Table API and both of my tables were declared using the SQL DDL.
My query execution looks like this:
SELECT UserId, Timestamp, my_udf(Data) AS Result,
FROM InputTable
GROUP BY TUMBLE(Timestamp, interval 2 SECONDS), UserId, Data
Here's my Python UDF function:
@udf(input_types=SOME_INPUT_TYPE, result_type=SOME_OUTPUT_TYPE)
def my_udf(window_data):
# ...business logic here with window_data
return some_result
My current problem is that for some reason the my_udf
function receives each rows separately so in the example above would be called 4 times instead of 2.
I've been looking into the PyFlink docs and I'm not able to find how to achieve what I want.
The info is probably in the docs but it seems I failed to find/understand it.
Any help would be appreciated.
Thanks !