I have setup kafka 2.1 on windows and able to successfully communicate a topic from producer to consumer over localhost:9092.
I now want to consume this in a spark structured stream.
For this I setup spark 3.4 and installed pyspark over Jupyter kernel and its working well.
The issue I have now is with how to correctly configure the Kafka spark dependency jars on Jupyter. I have tried the following:
spark = SparkSession \
.builder \
.appName("KafkaStreamingExample") \
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0') \
.getOrCreate()
stream_df = spark.readStream\
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "stocky") \
.load()
I get the error
Failed to find data source: kafka
I know there are options to load the packages with spark-submit, but I particularly need to know if its possible to get it working within the Jupyter notebook environment.
It would be great if someone can point me in the right direction.