pyspark version - 2.4.7 kafka version - 2.13_3.2.0
Hi, I am new to pyspark and streaming properties. I have come across few resources in the internet, but still I am not able to figure out how to send a pyspark data frame to a kafka broker. I need to write a producer code. I am reading the data from a csv file and trying to send it to kafka topic. Please help me out with the code and the configurations.
import findspark
findspark.init("/usr/local/spark")
from pyspark.sql import SparkSession
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.functions import *
import os
from kafka import KafkaProducer
import csv
def spark_session():
'''
Description:
To open a spark session. Returns a spark session object.
'''
spark = SparkSession \
.builder \
.appName("Test_Kafka_Producer") \
.master("local[*]") \
.getOrCreate()
return spark
if __name__ == '__main__':
spark = spark_session()
topic = "Kafkatest"
spark_version = '2.4.7'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.13:{}'.format(spark_version)
#producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
#value_serializer= lambda x: x.encode('utf-8'))
df1 = spark.read.csv("annual-enterprise-survey-2020-financial-year-provisional-size-bands-csv.csv", inferSchema = True, header = True)
df1.show(10)
print("sending df===========")
df1.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", topic) \
.save()
print("End------")
The error that I am encountering for this bit of code is
py4j.protocol.Py4JJavaError: An error occurred while calling o41.save. : org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;