1

I'm trying to setup a simple pipeline using Apache Beam to read data from Kafka. As it is a test, I run the pipeline on a DirectRunner. My consumer group needs to be prefixed with X for authorization reasons. But Apache Beam uses an internal autogenerated prefix already. Is there a way to overwrite this autogenerated prefix?

def run_pipeline():
    options = PipelineOptions(streaming=True)
    with beam.Pipeline(options=options) as p:
        (
            p
            | "Read from Kafka"
            >> ReadFromKafka(
                consumer_config={
                    "bootstrap.servers": "host:port, \
                                          host:port, \
                                          host:port, \
                                          host:port",
                    "auto.offset.reset": "earliest",
                    "group.id": "X_group",
                    "security.protocol": "SASL_PLAINTEXT",
                    "sasl.mechanism": "PLAIN",
                    "sasl.jaas.config": f"org.apache.kafka.common.security.plain.PlainLoginModule required username={username} password={password};",
                },
                topics=["topic"],
                max_num_records=10,
            )
            | "Print to console" >> beam.Map(print)
        )

The error message as my group should start with X_group and not end with X_group as Apache Beam does:

trace: "org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: Reader-0_offset_consumer_1220072650_X_group\n"
nerdizzle
  • 424
  • 4
  • 17
  • we are facing a similar issue. Have you found a way to get this fixed or to work around the issue? – DSchmidt Jul 05 '22 at 08:05
  • @DSchmidt sadly there is no fix. We went with the obvious workaround by simply changing the authorisation to allow all groups that contain X. But it gets even more frustrating once the pipeline is running. If you run the pipeline on DataFlow everything seems to work as long you don't restart the pipeline. Once the pipeline is restarted, all data from Kafka is read again from the earliest message and this cannot be controlled by changing `"auto.offset.reset": "latest"`. I don't know if I'm wrong, but I would expect Apache Beam to only consume the messages that have not yet been consumed. – nerdizzle Jul 06 '22 at 15:58
  • okay thanks, thats unfortunate to hear. We did the same workaround to allow all groups but facing other issues now. I read about two consumers, the "data consumer" with the given group.id and a "offset consumer" with the generated id. But I am not sure how they work together. Had the hope that it works like you expect as well. Maybe you are missing some offset commits? – DSchmidt Jul 07 '22 at 07:39

0 Answers0