0

pyspark version - 2.4.7 kafka version - 2.13_3.2.0

Hi, I am new to pyspark and streaming properties. I have come across few resources in the internet, but still I am not able to figure out how to send a pyspark data frame to a kafka broker. I need to write a producer code. I am reading the data from a csv file and trying to send it to kafka topic. Please help me out with the code and the configurations.

import findspark
findspark.init("/usr/local/spark")
from pyspark.sql import SparkSession
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.functions import *
import os
from kafka import KafkaProducer

import csv

def spark_session():
    '''
    Description:
        To open a spark session. Returns a spark session object.
    '''
    spark = SparkSession \
        .builder \
        .appName("Test_Kafka_Producer") \
        .master("local[*]") \
        .getOrCreate()
    
    return spark
   
if __name__ == '__main__':

    spark = spark_session()
    topic = "Kafkatest"
    spark_version = '2.4.7'
    os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.13:{}'.format(spark_version)
 
    #producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                       #value_serializer= lambda x: x.encode('utf-8'))

    df1 = spark.read.csv("annual-enterprise-survey-2020-financial-year-provisional-size-bands-csv.csv", inferSchema = True, header = True)
    df1.show(10)

    print("sending df===========")

    df1.write \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("topic", topic) \
    .save()

    print("End------")

The error that I am encountering for this bit of code is py4j.protocol.Py4JJavaError: An error occurred while calling o41.save. : org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
subh
  • 17
  • 1
  • 4

2 Answers2

1

You don't need Spark to read a CSV file and run a Kafka Producer in Python (I see you already tried to import KafkaProducer, which should have worked)

E.g

from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'], value_serializer= lambda x: x.encode('utf-8'))
with open("annual-enterprise-survey-2020-financial-year-provisional-size-bands-csv.csv") as f:
    for i, line in enumerate(f):
        if i > 0:
            producer.send(topic, line)
producer.flush()

But if PYSPARK_SUBMIT_ARGS doesn't work, as it looks like it doesn't, you should use the same option on the CLI

spark-submit --packages ... app.py

Or you can use config("spark.jars.packages", "...") on the session, as shown below.


You'll also need to ensure that the Kafka dataframe only has the mentioned schema, as per the documentation (topic, key, value, etc). In other words, all CSV columns should be encoded as one string, so you'd be better off using spark.read.text and filtering out the first header row before you produce anything

Example

from pyspark.sql import SparkSession

scala_version = '2.12'  # TODO: Ensure this is correct
spark_version = '3.2.1'
packages = [
    f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
    'org.apache.kafka:kafka-clients:3.2.0'
]
spark = SparkSession.builder\
   .master("local")\
   .appName("kafka-example")\
   .config("spark.jars.packages", ",".join(packages))\
   .getOrCreate()

# Read all lines into a single value dataframe  with column 'value'
# TODO: Replace with real file. 
df = spark.read.text('file:///tmp/data.csv')

# TODO: Remove the file header, if it exists

# Write
df.write.format("kafka")\
  .option("kafka.bootstrap.servers", "localhost:9092")\
  .option("topic", "foobar")\
  .save()

Verified on host with

$ kcat -b localhost:9092 -C -t foobar
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Hey, thanks for the answer. But I would be extremely greatfull if you could send me a link of exactly how to encode it. As I said I am fairly new to pyspark-kafka streaming. Code bits will help a lot. – subh Jun 13 '22 at 13:23
  • Like I said, `spark.read.text` will already have a single column dataframe that can be produced – OneCricketeer Jun 13 '22 at 13:26
  • Thanks for the code. I actually did it like this before but this doesn't solve my problem. As I will be solely working on pyspark dfs and everytime converting them to csv's for streaming will not be optimised solution. I have tried reading csv just for example. Sending dataframe is concern for me. – subh Jun 13 '22 at 13:45
  • Once again, `spark.read.text` is what you should try. However, if you're only ever going to be reading a local csv file, then there's nothing to "stream" or "optimize". Adding pyspark will be slower with added JVM overhead for reading local files rather than directly looping them in Python – OneCricketeer Jun 13 '22 at 13:52
  • @subh I've added a working Spark example – OneCricketeer Jun 13 '22 at 21:43
  • Thanks a lot or your code. It works fine. U saved me in job today! – subh Jun 14 '22 at 05:42
0

You are trying to write df directly but does it follow the Kafka required schema where value column is necessary

Please check this link for details and you might then need to encode your dataframe into value column to send it to Kafka

Anjaneya Tripathi
  • 1,191
  • 1
  • 3
  • 8
  • The error has nothing to do with the dataframe encoding – OneCricketeer Jun 13 '22 at 13:07
  • 1
    I have gone through this one, but I am not understanding how do I change my usual dataframe having differnt columns obtained from reading a csv, to a spark streaming df? – subh Jun 13 '22 at 13:08
  • Can you try using [this](https://stackoverflow.com/questions/57258461/how-do-i-convert-a-dataframe-to-json-and-write-to-kafka-topic-with-key) link – Anjaneya Tripathi Jun 13 '22 at 13:14