I have two files with book ids
- current.json [~10,000 lines] -> books saved in the system
- feed.json [~300,000 lines] -> feed file contents all books from a book store
from these 2 files I want to generate 3 files
- not_available.json -> books exists in current but not in feed
- to_be_updated.json -> books exists in both current and feed
- new.json -> books exists only in the feed
because the files are huge I read the file line by line, I can't put the data in an array in the memory
pseudocode
my code for to do this is as follow
// export to_be_updated.json and new.json
feed <- initstream(feed.json)
while(lf <- feed.nextline())
found <- false;
current <- initstream(current.json)
while(lc <- current.nextline())
if(JSON.parse(lf).id == JSON.parse(lc).id)
found <- true
break
if(found) then append(lf, to_be_updated.json)
else append(lf, new.json)
// export not_avialbale.json
current <- initstream(current.json)
while(lc <- current.nextline())
found <- false;
feed <- initstream(feed.json)
while(lf <- feed.nextline())
if(JSON.parse(lf).id == JSON.parse(lc).id)
found <- true
break
if not(found) then append(lc, not_available.json)
This code has a time complexity of O(nm)
for n = 10,000
and m = 300,000
and space complexity of O(1)
so it takes really very long time, for 500mb
it takes about 2hrs using core i5
I tried to put the logic in only one nested loop but that wasn't possible. I'm trying to reach better complexity with unsorted files
do you think this is the best way to do it ? is there any better approach ?
update (file format)
feed.json has the following format (example)
{"id": "12340", "title": "A life journey", "price": "34.00"}
{"id": "12341", "title": "all over the world", "price": "42.00"}
{"id": "12342", "title": "good to remember", "price": "60.00"}
{"id": "12343", "title": "A night in Mars", "price": "14.00"}
...