1

I have around 30 csv files which I am trying to convert to json in parallel. The conversion is happening , but is taking quite sometime. Around 25 minutes. Each file will have 2 million records. Below is my code. I am new to python, can you suggest possible ways to tune this code so that the conversion time speeds up.

import csv
import json
import os 
import multiprocessing as mp

path = '<some_path>'


""" Multiprocessing module to generate json"""

total_csv_file_list = []
for filename in os.listdir(path):
    total_csv_file_list.append(os.path.join(path,filename))

total_csv_file_list = list(filter(lambda x:x.endswith('.csv'), total_csv_file_list))
total_csv_file_list = sorted(total_csv_file_list)
print(total_csv_file_list) 


def gen_json (file_list):
        csvfile = open(file_list, 'r') 
        jsonfile = open((file_list.split('.')[0]+'.json'), 'w')
        fieldnames = ("<field_names")
        reader = list(csv.DictReader( csvfile, fieldnames))
        json.dump(reader, jsonfile,indent=4)

try:    
    p_json = mp.Pool(processes=mp.cpu_count())
    total_json_file_list = p_json.map(gen_json,total_csv_file_list)
finally:
    p_json.close()
    p_json.join()
    print("done")
Aritra Bhattacharya
  • 720
  • 1
  • 7
  • 18

1 Answers1

0

Staying in pure python - Not much and not justified by complexity needed and speed gained.

Try use one less worker than available cores. The OS can still do it's task on the free core. So less context switching should happen for your program.

As you are not interested in the result , map_async could be faster than map. You do not return anything from this function, but still there is some overhead in map to return results

Doublecheck if you are not hitting memory limits and the os don't start swapping to disk. Not sure if, DictReader together with json will load the file completely into memory, or will they do buffering. If swapping is the problem , you will need to do buffering and chunk writing yourself.

Staying with pure python, one could also try leverage asyncio, but it would require custom chunking and custom producer consumer code with a queue with multiple event loops - 1 per consumer

Real speed gain can be achieved with Cython. Define a class for row data with well speciefied c type fields. Read csv with pure reader and create an object of the defines class for each row. Serialize a list of such classes into json. Such code can be compiled with Cython into a python c-extension. Even without proper typing, Cython compiled python code is approx twice faster ,

dre-hh
  • 7,840
  • 2
  • 33
  • 44