1

I am trying to run an app that uses a kafka producer (Python client), and an apache beam pipeline that will (for now) simply consume those messages by printing them to STDOUT.

I understand that using Kafka external transform with apache beam is a cross-language endeavor as it calls to a Java external service. I followed the following link's Option 1 :

Option 1: Use the default expansion service

This is the recommended and easiest setup option for using Python Kafka transforms. This option is only available for Beam 2.22.0 and later.

This option requires following pre-requisites before running the Beam pipeline.

Install Java runtime in the computer from where the pipeline is constructed and make sure that ‘java’ command is available.

I am running apache-beam==2.31.0, and just installed java :

openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing

I am not completely sure which runner I should use as the portability documentation seems to point towards a Universal Local Runner, but I can't seem to find this runner in the documentation.

Here's the code sample I'm trying to make work:

import argparse
import apache_beam as beam
from helpers import ccloud_lib

from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions


def run(argv=None):
    """Main entry point; runs a word_count pipeline"""

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--input_topic",
        dest="input_topic",
        default="wordcount",
        help="Kafka topic to use for input",
    )
    parser.add_argument(
        "--kafka_config",
        dest="config_file",
        default="config/confluent/python.config",
    )

    args = parser.parse_known_args(argv)[0]
    beam_options = PipelineOptions(runner="DirectRunner")

    consumer_conf = ccloud_lib.read_ccloud_config(args.config_file)
    consumer_conf["group.id"] = "python_wordcount_group_1"
    consumer_conf["auto.offset.reset"] = "earliest"

    with beam.Pipeline(options=beam_options) as pipeline:
        pipeline
        | "Read"
        >> ReadFromKafka(
            consumer_config=consumer_conf,
            topics=[args.input_topic],
        )
        | "Print" >> beam.Map(print)

I launch the module, but I don't undestand exactly how this works as some java artifacts seem to be downloaded and a docker image launched. I then get the following warning message :

INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'2021/08/25 14:38:05 Failed to obtain provisioning information: failed to dial server at localhost:36071\n\tcaused by:\ncontext deadline exceeded\n'

To summarize my questions, can you explain what goes on when I launch the script? And which runner should I be using to do this? How do I fix this?

Imad
  • 2,358
  • 5
  • 26
  • 55

1 Answers1

0

I think the universal runner is located under apache_beam.runners.portability.portable_runner.

blais
  • 687
  • 7
  • 9