I am using ijson to read large json files, I am trying to paralellel processing, not working

Question

I am trying to read json files of 30gb and I can do it by using ijson, but to speeed up the process I am trying to use multiprocessing. but I am unable to make it work, I can see the n workers ready but only one worker is taking all the load of the work.

Does anyone if it is possible to run multiprocessing + ijson

here is a sample of the code:

import ijson
import pandas as pd
import multiprocessing
file='jsonfile'
player=[]
def games(record):
  games01 = record["games"]
  for game01 in games01:
    try:
        player.append(game01['player'])
    except KeyError:
        player.append('No_record_found')
if __name__=='__main__':    
   with open(file, "rb") as f:
      pool = multiprocessing.Pool()
      pool.map(games, ijson.items(f, "game.item"))
      pool.close()
      pool.join()

Why would you expect the multiprocessing to improve performance? Isn't your bottleneck Disk IO? — luk2302, Jun 02 '23 at 11:36
my bottleneck is the size of the file, 30GB files, I cannot load it into ram, I thought I could do multiprocessing while reading from the same file. I am not an expert. Thanks for your help — meuto, Jun 02 '23 at 12:12
Your `games` method is probably able to work through the 30gb of file in a single second. But reading the file actually takes waaaay longer - your bottleneck is the reading, not the processing. And you cannot speed up the reading unless you get better hardware / a faster drive. — luk2302, Jun 02 '23 at 12:18
It makes sense what you are saying. so it is impossible to do multiprocessing with ijson reading line by line the json? — meuto, Jun 02 '23 at 12:43
It is possible I guess but it simply makes no sense here. If your processing would take like 10 seconds per entry and you have a few thousand entries then parallelizing the workload might make sense. — luk2302, Jun 02 '23 at 14:31
@luk2302, I actually have like 500 milions records. but I have not been able to make it work. unfortunately. I am still trying to figure out what is the best way. Thank you for all your help — meuto, Jun 06 '23 at 03:50

I am using ijson to read large json files, I am trying to paralellel processing, not working

0 Answers0