1

I'm trying to run Apache Beam backed by a local Flink cluster in order to consume from a Kafka Topic, as described in the Documentation for ReadFromKafka.

The code is basically this pipeline and some other setups as describe in the Beam Examples

 with beam.Pipeline() as p:

  lines = p | ReadFromKafka(
      consumer_config={'bootstrap.servers': bootstrap_servers},
    topics=[topic],
          ) | beam.WindowInto(beam.window.FixedWindows(1))

  output = lines | beam.FlatMap(lambda x: print(x))

  output | WriteToText(output)

Since I attempted to run on Flink, I followed this doc for Beam on Flink and did the following:

--> I download the binaries for flink 1.10 and followed these instructions to proper setup the cluster.

I checked the logs for the server and task instance. Both were properly initialized.

--> Started kafka using docker and exposing it in port 9092.

--> Executed the following in the terminal

python example_1.py --runner FlinkRunner --topic myTopic --bootstrap_servers localhost:9092  --flink_master localhost:8081 --output output_folder

The terminal outputs

2.23.0: Pulling from apache/beam_java_sdk Digest: sha256:3450c7953f8472c2312148a2a8324b0115fd71b3a7a01a3b017f6de69d89dfe1 Status: Image is up to date for apache/beam_java_sdk:2.23.0 docker.io/apache/beam_java_sdk:2.23.0

But then after writing some messags to myTopic, the terminal remains frozen and I don't see anything in the output folder. I checked flink-conf.yml and given these two lines

jobmanager.rpc.address: localhost 
jobmanager.rpc.port: 6123

I assumed that the port for the jobs would be 6123 instead of 8081 as specified in beam documentation, but the behaviour for both ports is the same.

I'm very new to Beam/Flink, so I'm not quite sure that it can be, I have two hypothesis as of now, but can't quite figure out how to investigate'em:

  1. Something related to the port that Beam communicates with Flink in order to send the jobs.

2.The Expansions Service for Python SDK mentioned in the apache.beam.io.external.ReadFromKafka docs

Note: To use these transforms, you need to start a Java Expansion Service. Please refer to the portability documentation on how to do that. Flink Users can use the built-in Expansion Service of the Flink Runner’s Job Server. The expansion service address has to be provided when instantiating the transforms.

But reading the portability documentation, it refers me back to the same doc for Beam on Flink.

Could someone, please, help me out?

Edit: I was writing to the topic using Debezium Source Connector for PostgreSQL and seeing the behavior mentioned above. But when I tried to the topic manually, the application crashed with the following

RuntimeError: org.apache.beam.sdk.util.UserCodeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[]
at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
Lucas Abreu
  • 425
  • 3
  • 14

1 Answers1

1

You are doing everything correctly; the Java Expansion Service no longer needs to be started manually (see the latest docs). Also, Flink serves the web UI at 8081, but accepts job submission there just as well, so either port works fine.

It looks like you may be running into the issue that Python's TextIO does not yet support streaming.

Additionally, there is the complication that when running Python pipelines on Flink, the actual code runs in a docker image, and so if you are trying to write to a "local" file it will be a file inside the image, not on your machine.

robertwb
  • 4,891
  • 18
  • 21
  • thanks for the update. I checked for file with the prefix of the output in the container, but didn't find any. Also I was writing to the topic using Debezium connected to a PostgreSQL. I tried writing some messages manually to the topic and then the application crashed with the error `RuntimeError: org.apache.beam.sdk.util.UserCodeException: org.apache.beam.sdk.coders.CoderException: cannot encode a null byte[] at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36) ` Any additional ideas'd be much appreciated – Lucas Abreu Aug 31 '20 at 22:54
  • This looks like https://issues.apache.org/jira/browse/BEAM-10529 . You might want to follow https://lists.apache.org/thread.html/r439c1a9a801e1f01752be272ba9e5b6187cdd617fc750dd63384ef80%40%3Cuser.beam.apache.org%3E – robertwb Sep 01 '20 at 07:43
  • Hello, I seem to be having kidn of a similar issue except perhaps my setup is not correct? could you please checkout my post [here]?(https://stackoverflow.com/questions/68925332/how-to-consume-messages-using-beams-external-kafka-transform-locally) – Imad Aug 26 '21 at 13:12