I am trying to send a large CSV to kafka. The basic structure is to read a line of the CSV and zip it with the header.
a = dict(zip(header, line.split(",")
This then gets converted to a json with:
message = json.dumps(a)
I then use kafka-python library to send the message
from kafka import SimpleProducer, KafkaClient
kafka = KafkaClient("localhost:9092")
producer = SimpleProducer(kafka)
producer.send_messages("topic", message)
Using PYSPARK I have easily created an RDD of messages from the CSV file
sc = SparkContext()
text = sc.textFile("file.csv")
header = text.first().split(',')
def remove_header(itr_index, itr):
return iter(list(itr)[1:]) if itr_index == 0 else itr
noHeader = text.mapPartitionsWithIndex(remove_header)
messageRDD = noHeader.map(lambda x: json.dumps(dict(zip(header, x.split(","))
Now I want to send these messages: I define a function
def sendkafka(message):
kafka = KafkaClient("localhost:9092")
producer = SimpleProducer(kafka)
return producer.send_messages('topic',message)
Then I create a new RDD to send the messages
sentRDD = messageRDD.map(lambda x: kafkasend(x))
I then call sentRDD.count()
Which starts churning and sending messages
Unfortunately this is very slow. It sends 1000 messages a second. This is on a 10 node cluster of 4 cpus each and 8gb of memory.
In comparison, creating the messages takes about 7 seconds on a 10 million row csv. ~ about 2gb
I think the issue is that I am instantiating a kafka producer inside the function. However, if I don't then spark complains that the producer doesn't exist even though I have tried defining it globally.
Perhaps someone can shed some light on how this problem may be approached.
Thank you,