I am using the Python SDK for Apache Beam and I am not able to perform an aggregation by window and key from a unbounded PCollection. Data comes from a Kafka topic, and it is organised as a dictionary with key, value, and timestamp. I read it with the Kafka consumer in the beam_nuggets package (as I have not been able to make the default Kafka Consumer work), apply a three-minute long fixed window, GroupByKey and calculate the mean. I am not interested in dealing with late data at the moment (the default trigger should work well). It seems that all data is divided correctly in windows, but the aggregating function after GroupByKey is never called.
Here is the code I used:
import json
import apache_beam as beam
from apache_beam.transforms import window, trigger
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
beam_options = PipelineOptions(
runner = "DirectRunner",
streaming = True,
)
class AddTimestampDoFn(beam.DoFn):
def process(self, element):
unix_timestamp = element["datetime"]/1000
yield beam.window.TimestampedValue(element, unix_timestamp)
def add_key(x):
print("add key", x["datetime"])
return (x["key"], x)
def process_group(x):
print("process_group")
return sum(x)/len(x)
with beam.Pipeline(options = beam_options) as pipeline:
data = (pipeline | kafkaio.KafkaConsume(consumer_config = {"bootstrap_servers": "localhost:9092",
"topic": "foo",
"group_id": "consumer_group",
"auto_offset_reset": "latest"},
value_decoder = bytes.decode
)
| "ToDict" >> beam.MapTuple(lambda k,v: json.loads(v))
| "Add timestamp" >> beam.ParDo(AddTimestampDoFn())
| "Add key" >> beam.Map(add_key)
| "Window" >> beam.WindowInto(window.FixedWindows(60*3))
)
grouped = (data | f"Group" >> beam.GroupByKey()
| f"ProcessGroup" >> beam.Map(process_group)
)
The first part seems to work correctly, as the "add key" debug log is printed for each message the Kafka Consumer receives. It also seems that windows are correctly set and each datapoint is assigned to a window. However, the "process_group" log is never printed, as if the pipeline never reaches that point.
I know there are a couple of similar questions on StackOverflow (like this one, this one, or this one) but none of the solutions seems to work.
I also tried different trigger functions (like AfterWatermark), but still it does not seem to work.
The Apache Beam version is 2.41.0