2

I am parsing ~250K XMLs and loading the data into a SQLite database. I am using node version 10.15.1 with cheerio and better-sqlite3 on a Mac OS X laptop with 8GB memory. I am readdirSync-ing the entire folder of ~250K files, and parsing the XML files and loading the extracted data using transactions in batches of 10K. I am using --max_old_space_size=4096 but still getting the FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory.

Now, if I process 100K files, then quit node, then start again and process the remaining ~150K files, then it all works. But, I'd rather do it all in one go as this is something that has to be done unattended. Is there anything else I can do given my constraints? I can't use a machine with more memory because I don't have access to one. I could try bumping up the --max_old_space_size a bit more, or I could try doing smaller batches of transactions, but am not sure if that will help (I tried with 8000 files per transaction instead of 10K, but that too ran out of memory). The only thing right now that seems to help is quitting node in between. Is there anyway I can simulate that? That is, tell node to release all the memory and pretend it has been restarted? Any other thoughts?

punkish
  • 13,598
  • 26
  • 66
  • 101
  • 1
    It sounds like you have a memory leak, though without any example code it would be hard to diagnose. Node will automatically reclaim any memory that is unused. There are some cases where you have created too many small objects in such a small amount of time that it won't be able to garbage collect and still run out of memory. I suggest you check out `heapdump` and see if there are any memory leaks. – icirellik Feb 18 '19 at 20:37
  • I've had similar issues with cheerio and switching to whacko helped. – pguardiario Feb 18 '19 at 23:53
  • be careful @pguardiario [whacko is no longer maintained](https://github.com/inikulin/whacko#readme) – punkish Feb 19 '19 at 08:25

1 Answers1

0

So, I finally stumbled on a way around my problem (I use "stumbled" because I am not sure if this is definitively the right strategy, but it works for me).

I found that actually increasing the --max_old_space_size value didn't really help me. In any case, as I mentioned above, my MacBook has only 8GB, so I have a low limit anyway. Quite contrarian, what helped actually was to simply reduce my batch size. So, instead of processing 10K XMLs, storing their data in memory, and then inserting them in a transaction in SQLite, I processed 1K XMLs at a time. Sure, to process ~250K files, I had to now deal with 250 loops instead of 25 loops, but it didn't really increase my time too much. I have found that the processing time is pretty linear, about 5K ms per 1K files (or 50K ms per 10K files). SQLite is pretty fast whether I throw 1K or 10K INSERTs at it in a transaction, but it is my XML parser process that starts acting up when dealing with very large amounts of data. In fact, this may also not be any issue with cheerio (which, I've found to be very good). It just may be my coding style which can be improved a lot.

In any case, processing 1K transactions with a --max_old_space_size=2048 did the job for me. The memory use by node (as shown in Activity Monitor was pretty stable, and the entire dump of 250K files were parsed and loaded in the db in about 42 mins. I can live with that.

punkish
  • 13,598
  • 26
  • 66
  • 101