0

I am using the Python SDK for Apache Beam and I am not able to perform an aggregation by window and key from a unbounded PCollection. Data comes from a Kafka topic, and it is organised as a dictionary with key, value, and timestamp. I read it with the Kafka consumer in the beam_nuggets package (as I have not been able to make the default Kafka Consumer work), apply a three-minute long fixed window, GroupByKey and calculate the mean. I am not interested in dealing with late data at the moment (the default trigger should work well). It seems that all data is divided correctly in windows, but the aggregating function after GroupByKey is never called.

Here is the code I used:

import json
import apache_beam as beam
from apache_beam.transforms import window, trigger
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio

beam_options = PipelineOptions(
        runner = "DirectRunner",
        streaming = True,
        )

class AddTimestampDoFn(beam.DoFn):
    def process(self, element):
        unix_timestamp = element["datetime"]/1000

        yield beam.window.TimestampedValue(element, unix_timestamp)

def add_key(x):
    print("add key", x["datetime"])
    return (x["key"], x)

def process_group(x):
    print("process_group")
    return sum(x)/len(x)

with beam.Pipeline(options = beam_options) as pipeline:
    data = (pipeline | kafkaio.KafkaConsume(consumer_config = {"bootstrap_servers": "localhost:9092",
                                                      "topic": "foo",
                                                      "group_id": "consumer_group",
                                                      "auto_offset_reset": "latest"},
                                                      value_decoder = bytes.decode
                                                )
                     | "ToDict" >> beam.MapTuple(lambda k,v: json.loads(v))
                     | "Add timestamp" >> beam.ParDo(AddTimestampDoFn())
                     | "Add key" >> beam.Map(add_key)
                     | "Window" >> beam.WindowInto(window.FixedWindows(60*3))
            )
    grouped = (data | f"Group" >> beam.GroupByKey()
                    | f"ProcessGroup" >> beam.Map(process_group)
              )

The first part seems to work correctly, as the "add key" debug log is printed for each message the Kafka Consumer receives. It also seems that windows are correctly set and each datapoint is assigned to a window. However, the "process_group" log is never printed, as if the pipeline never reaches that point.

I know there are a couple of similar questions on StackOverflow (like this one, this one, or this one) but none of the solutions seems to work.

I also tried different trigger functions (like AfterWatermark), but still it does not seem to work.

The Apache Beam version is 2.41.0

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
amnt
  • 18
  • 2
  • You may have to define a trigger for it to work on the DirectRunner. The DirectRunner is mostly used for testing (especially for streaming), and I've seen some bad behaviors before. Also try to run on DataflowRunner to make sure it works, as it has a much better and robust support for streaming use cases. – Bruno Volpato Nov 02 '22 at 18:05
  • Thank you for your answer. Unfortunately Dataflow is not an option for me, I've tried using the Flink PortableRunner, still no luck. I have also defined a trigger like this: `trigger=trigger.AfterWatermark(), accumulation_mode=trigger.AccumulationMode.ACCUMULATING, allowed_lateness=window.Duration.of(0)`, but I have the same issue both for DirectRunner and PortableRunner. – amnt Nov 03 '22 at 12:05

1 Answers1

1

I had a similar issue with Flink and CoGroupByKey while I was reading from my testing Kafka topic with a static number of records. Once I started to produce new messages every few seconds, the issue disappeared and CoGroupByKey started to operate as expected.

Solution found here:

https://github.com/apache/beam/issues/22809#issuecomment-1310971785

Hej Ja
  • 53
  • 7