How to add Kafka dependencies for PySpark on a Jupyter notebook

Question

I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092.

I now want to consume this in a spark structured stream.

For this I setup spark 3.4 and installed pyspark over Jupyter kernel and its working well.

The issue I have now is with how to correctly configure the Kafka spark dependency jars on Jupyter. I have tried the following:

spark = SparkSession \
    .builder \
    .appName("KafkaStreamingExample") \
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0') \
    .getOrCreate()

stream_df = spark.readStream\
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "stocky") \
    .load()

I get the error

Failed to find data source: kafka

I know there are options to load the packages with spark-submit, but I particularly need to know if its possible to get it working within the Jupyter notebook environment.

It would be great if someone can point me in the right direction.

score 1 · Accepted Answer · answered Aug 16 '23 at 20:34

1

What you've done is correct, but version matters. You're running Spark 3.4.0? So you'll be unable to use spark-sql-kafka-0-10 version 3.3.0

You also need the Kafka client JAR for using with pyspark

https://github.com/OneCricketeer/docker-stacks/blob/master/hadoop-spark/spark-notebooks/kafka-sql.ipynb

answered Aug 16 '23 at 20:34

OneCricketeer

179,855
19
132
245

Thanks for that, I had updated my code with what I believe is the correct version as follows: spark = SparkSession.builder \ .appName("KafkaStreamingExample") \ .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.4.0,org.apache.kafka:kafka-clients:3.4.0") \ .getOrCreate() however, it is still failing to find the kafka source. Is there some additional confis required ? – rarpal Aug 17 '23 at 10:14
Make sure the Scala version is correct as well. `2.12` vs `2.13` part of the package name. You'll find this listed in other JARs that are part of the Spark classpath – OneCricketeer Aug 18 '23 at 21:23
1

Thanks, I heard from the grapevine that the more recent versions of PySpark does not work well with the kafka driver 0.10. So to be on the safe side I dropped way back to the earlier version spark-sql-kafka-0-10_2.11:2.3.4 and downgrading my PySpark to 2.3.4. And now its working as expected. Its quite possible a couple of versions higher may also work, but for the moment I am happy. Thanks for your input. – rarpal Aug 19 '23 at 20:12
Spark 3+ would definitely be recommended over a 4+ year old version... I've not personally had issues with it – OneCricketeer Aug 19 '23 at 23:37

How to add Kafka dependencies for PySpark on a Jupyter notebook

1 Answers1