Unable to consume data using the latest Pyflink Kafka connector

Question

I am trying to read the data from the Kafka topic. Kafka is set up fine. Now, when I wrote the code using PyFlink and no matter if I add the jars or not, the error remains the same.

from pyflink.datastream.connectors.kafka import KafkaSource, KafkaOffsetsInitializer
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.common import SimpleStringSchema, Configuration


class SourceData(object):
    def __init__(self, env):
        self.env = env
        self.env.set_runtime_mode(RuntimeExecutionMode.STREAMING)
        self.env.set_parallelism(1)
        self.config = Configuration()
        self.config.set_string("pipeline.jars", "file:///../jars/flink-sql-connector-kafka-1.17.1.jar")
        self.env.configure(self.config)


    def get_data(self):
        source = KafkaSource.builder() \
            .set_bootstrap_servers("localhost:9092") \
            .set_topics("test-topic") \
            .set_starting_offsets(KafkaOffsetsInitializer.earliest()) \
            .set_value_only_deserializer(SimpleStringSchema()) \
            .build()
        self.env \
            .add_source(source) \
            .print()
        self.env.execute("source")


SourceData(StreamExecutionEnvironment.get_execution_environment()).get_data()

Environment:

Flink 1.17.1
Java 11
Kafka Client latest one
Python 3.10.11

Error:

TypeError: Could not found the Java class 'org.apache.flink.connector.kafka.source.KafkaSource.builder'. The Java dependencies could be specified via command line argument '--jarfile' or the config option 'pipeline.jars'

I also tried without config option and using env.add_jars but, still the error remains the same. Do I need to configure anything else?

The Second option I tried was copying the jar to the pyflink>lib inside the site-packages of my virtual environment. After doing this, I am getting the below error:

py4j.protocol.Py4JError: An error occurred while calling o12.addSource. Trace:
org.apache.flink.api.python.shaded.py4j.Py4JException: Method addSource([class org.apache.flink.connector.kafka.source.KafkaSource, class java.lang.String, null]) does not exist

What happens when you instead try to use `the config option 'pipeline.jars'`? You shouldn't go manually copying files around. That's the purpose of the `--jarfile` CLI option — OneCricketeer, Aug 23 '23 at 13:17
@OneCricketeer As far as I saw on the documentation, the `config` option is for table API, and `add_jars` is for datastream. The error remains the same(as mentioned in the first option) even if I use the `add_jars` or `config` option. — RushHour, Aug 23 '23 at 15:44
The latest 1.17.1 since I am using the latest Flink version which is linked in the documentation. I downloaded it from there. — RushHour, Aug 23 '23 at 15:48
Okay, just checking since `Method ... does not exist` usually means there is an API version mismatch — OneCricketeer, Aug 23 '23 at 15:49
Yeah thanks, I definitely know there is some mismatch in the version but unable to figure it out. — RushHour, Aug 23 '23 at 15:50
@OneCricketeer for me putting the jars inside `pyflink -> lib` is working fine . however, when I try to use the below line instead of copying the jars, it doesn't work `self.env.add_jars("file:///kafka-clients-3.4.0.jar", "file:///flink-sql-connector-kafka-1.17.1.jar")` Any hints? — RushHour, Aug 30 '23 at 07:17
Also, it seems that `KafkaSource` is a bit buggy because no matter whatever solution you try, it doesn't work. However, `FlinkKafkaConsumer` works fine even though it is deprecated. — RushHour, Aug 30 '23 at 07:21
Does the user running the Flink process have read access to those JARs or the directory they're stored in? Have you tried using an absolute filesystem path? Is there a way you can print the environment config after adding the jars to ensure it's being setup? — OneCricketeer, Aug 30 '23 at 12:12

Unable to consume data using the latest Pyflink Kafka connector

0 Answers0