nodejs - filter huge json file data

Question

I have two files with book ids

- current.json [~10,000 lines]    -> books saved in the system
- feed.json    [~300,000 lines]   -> feed file contents all books from a book store

from these 2 files I want to generate 3 files

- not_available.json -> books exists in current but not in feed
- to_be_updated.json -> books exists in both current and feed
- new.json           -> books exists only in the feed

because the files are huge I read the file line by line, I can't put the data in an array in the memory

pseudocode

my code for to do this is as follow

// export to_be_updated.json and new.json
feed <- initstream(feed.json)
while(lf <- feed.nextline())
    found <- false;
    current <- initstream(current.json)
    while(lc <- current.nextline())
        if(JSON.parse(lf).id == JSON.parse(lc).id)
            found <- true
            break
    if(found) then append(lf, to_be_updated.json)
    else append(lf, new.json)

// export not_avialbale.json
current <- initstream(current.json)
while(lc <- current.nextline())
    found <- false;
    feed <- initstream(feed.json)
    while(lf <- feed.nextline())
        if(JSON.parse(lf).id == JSON.parse(lc).id)
            found <- true
            break
    if not(found) then append(lc, not_available.json)

This code has a time complexity of O(nm) for n = 10,000 and m = 300,000 and space complexity of O(1) so it takes really very long time, for 500mb it takes about 2hrs using core i5

I tried to put the logic in only one nested loop but that wasn't possible. I'm trying to reach better complexity with unsorted files

do you think this is the best way to do it ? is there any better approach ?

update (file format)

feed.json has the following format (example)

{"id": "12340", "title": "A life journey", "price": "34.00"}
{"id": "12341", "title": "all over the world", "price": "42.00"}
{"id": "12342", "title": "good to remember", "price": "60.00"}
{"id": "12343", "title": "A night in Mars", "price": "14.00"}
...

Would it be possible to upload the files somewhere? Is using mongodb or sql an option? — Martin Gottweis, Jun 02 '16 at 14:44
@MartinGottweis the data of `current.json` come from `mongodb`, the has simple format check my update — Eltorrooo, Jun 02 '16 at 14:47
Put the feed json in mongo and with easy querying you will have your lists. In general you will need to order/hash/make a b-tree out of your data for this to be not slow. — Martin Gottweis, Jun 02 '16 at 14:51

nodejs - filter huge json file data

pseudocode

update (file format)

0 Answers0