How can I improve the following code performance to ingest 1 million record /second

Question

The following code are ingesting 10k-20k record per second and I want to improve the performance of it. I am reading a json format and ingesting it into database using Kafka. -I am running it on the cluster of five nodes with zookeeper and Kafka installed on it.

Can you give me some tips to improve?

import os
import json
from multiprocessing import Pool
from kafka.client import KafkaClient
from kafka.producer import SimpleProducer


def process_line(line):
    producer = SimpleProducer(client)
    try:
       jrec = json.loads(line.strip())
       producer.send_messages('twitter2613',json.dumps(jrec))
    except ValueError, e:
                {}


if __name__ == "__main__":
    client = KafkaClient('10.62.84.35:9092')
    myloop=True
    pool = Pool(30)


    direcToData = os.listdir("/FullData/RowData")
    for loop in direcToData:
        mydir2=os.listdir("/FullData/RowData/"+loop)

        for i in mydir2:
            if  myloop:
                 with open("/FullData/RowData/"+loop+"/"+i) as source_file:
                     # chunk the work into batches of 4 lines at a time
                     results = pool.map(process_line, source_file, 30)

Better to post this [here](http://codereview.stackexchange.com/) I think. — pushkin, Dec 06 '15 at 15:36
are you implementing https://docs.python.org/2/library/os.html#os.walk by hand? — Jasper, Dec 06 '15 at 15:41
@Jasper yes. Do you have any suggestions to do it in parallel on multiple nodes? — Ahmed Alashrafy, Dec 06 '15 at 18:23

score 0 · Answer 1 · answered Dec 06 '15 at 15:38

0

You can maybe import only the fonction that you need form OS. It can be a first optimization.

answered Dec 06 '15 at 15:38

White

71
1
13

How can I improve the following code performance to ingest 1 million record /second

1 Answers1