0

I have a piece of code, it process thousands of files in a directory, for each file, it generate an object (dictionary) with part of its key-value as:

{
    ........
    'result': [...a very long list...]
}

if I process all the files, save result in a list then use jsonlines library to write all, my laptop (mac) will run out of memory.

So my solution will be process one by one, and get result, then insert into the jsonline file and delete the object and release memory.

After check the official document: https://jsonlines.readthedocs.io/en/latest/

I couldn't find a method which can write without overwriting the jsonline file.

So how I can handle such big output.

Besides, I'm using parallel threads to process result:

from multiprocessing.dummy import Pool
Pool(4).map(get_result, file_lst)

I do hope to open the json_file, write each result and then release the memory.

Jie Hu
  • 87
  • 2
  • 9
  • 1
    Can you change the structure of the file to only write a small JSON object on each line? In other words, avoid putting a very long list inside a JSON object. – Janne Karila Jan 30 '19 at 06:55
  • Alternatively you can have another logger processes running which writes to a file and your worker processes running `get_result` only send part of the result to the logger. Logger process can additionally decorate the result ensuring that the written file is still a JSON even if it was written line-by-line – vin Jan 30 '19 at 07:16
  • the result is from pytorch process, to generate a vector for a image. so it can be a good idea to split the result. but better to have line by line writer – Jie Hu Jan 30 '19 at 07:44

1 Answers1

2

If I understands your question correctly, I think this will solve it:

with jsonlines.open('yourTextFile', mode='a') as writer:
    writer.write(...)

As you mentioned you are overwriting the file, I think this is because you use mode='w' (w = writing) instead of using mode='a' (a = appending)