Does GCP Dataflow support kafka IO in python?

Question

I am trying to read data from kafka topic using kafka.ReadFromKafka() method in python code.My code looks like below:

from apache_beam.io.external import kafka
import apache_beam as beam

options = PipelineOptions()

with beam.Pipeline(options=options) as p:
           plants = (
      p
        |       'read' >> kafka.ReadFromKafka({'bootstrap.servers': 'public_ip:9092'}, ['topic1']))

But getting below error message.

ERROR:apache_beam.runners.runner:Error while visiting read Traceback (most recent call last): File "test_file.py", line 16, in <module> | 'read' >> kafka.ReadFromKafka({'bootstrap.servers': 'localhost:9092'}, ['topic1']) File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 547, in __exit__ self.run().wait_until_finish() File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 526, in run return self.runner.run_pipeline(self, self._options) File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 565, in run_pipeline self.visit_transforms(pipeline, options) File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/runner.py", line 224, in visit_transforms pipeline.visit(RunVisitor(self)) File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 572, in visit self._root_transform().visit(visitor, self, visited) File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 1075, in visit part.visit(visitor, pipeline, visited) File "/usr/local/lib/python3.7/dist-packages/apache_beam/pipeline.py", line 1078, in visit visitor.visit_transform(self) File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/runner.py", line 219, in visit_transform self.runner.run_transform(transform_node, options) File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/runner.py", line 249, in run_transform (transform_node.transform, self)) NotImplementedError: Execution of [<ReadFromKafka(PTransform) label=[ReadFromKafka(beam:external:java:kafka:read:v1)]>] not implemented in runner <apache_beam.runners.dataflow.dataflow_runner.DataflowRunner object at 0x7f72463344a8>.

Is it because apache beam Dataflow runner doesn't support kafkaIO ?

You are running as `DataflowRunner` and connecting to `localhost:9092`? — bigbounty, Jul 07 '20 at 12:44
i have added public ip in "advertised.listeners" in server.properties file as well as in code. Still facing same error. — Joseph N, Jul 07 '20 at 13:40
But Dataflow can't access your localhost if you are running in google servers — bigbounty, Jul 07 '20 at 13:43
Any suggesions, whats should i do in this situation? i have my kafka in gcp compute instance — Joseph N, Jul 07 '20 at 14:08
If you want to test your pipeline, you can run your pipeline using `DirectRunner` on your local machine. If you want the code on direct runner, I can write it as an answer — bigbounty, Jul 07 '20 at 14:09
Note also that as of https://issues.apache.org/jira/browse/BEAM-3788 Beam Python supports using the Java KafkaIO classes via cross-langauge. Note that the use_runner_v2 experiment must be set to use this on Dataflow. — robertwb, Jul 07 '20 at 22:46

score 3 · Answer 1 · answered Jul 07 '20 at 13:40

3

The python SDK for beam does support connecting to Kafka. Below is a code snippet

from __future__ import print_function
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio

kafka_topic = "notifications"
kafka_config = {"topic": kafka_topic,
                "bootstrap_servers": "localhost:9092",
                "group_id": "notification_consumer_group"}

with beam.Pipeline(options=PipelineOptions()) as p:
    notifications = p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(kafka_config)
    notifications | 'Writing to stdout' >> beam.Map(print)

The bootstrap_servers is a comma separated host and port configuration where your brokers are deployed. You will get this information from your Kafka cluster configuration.

answered Jul 07 '20 at 13:40

Jayadeep Jayaraman

2,747
3
15
26

"bootstrap_servers": "localhost:9092", will "localhost" work here? i am running my code on DataflowRunner. – Joseph N Jul 07 '20 at 13:45
You need to host kafka on a cloud and give the kafka endpoints so that code in your dataflow can access it – bigbounty Jul 07 '20 at 13:49
localhost refers to the local IP of the machine. Dataflow workers will not host Kafka, you need to host Kafka on a separate VM as @bigbounty mentioned and you can connect to it from your dataflow job. – Jayadeep Jayaraman Jul 07 '20 at 13:52
You need to specify `"bootstrap_servers": "PUBLIC_IP:9092"` where `PUBLIC_IP` is the public IP address of your kafka host (e.g. your gcp compute instance vm). – robertwb Jul 07 '20 at 22:40
How can I set up SASL/SSL configuration in the `kafka_config`? – ShahNewazKhan Aug 21 '22 at 12:33

Does GCP Dataflow support kafka IO in python?

1 Answers1

Linked