I have a JSON file of size 2GB, i create a readStream from it using fs.createReadStream and pipe it through JSONStream.parse (https://github.com/dominictarr/JSONStream), and then push each record into the database, the setup works totally fine but overtime the entire process slows down especially after a processing ~ 200 thousand records, throughout the process memory seem to increase slowly.
Initially there was an issue with JSONStream (https://github.com/dominictarr/JSONStream/issues/101) and is resolved based on this change https://github.com/dominictarr/JSONStream/pull/154 and now i use version 1.3.4 so we could rule out JSONStream from the list.
Verified the database (cassandra) and there wasn't any problem there, so wondering if this issue has to do something with backpressure (https://nodejs.org/en/docs/guides/backpressuring-in-streams/), but not sure what could be the potential solution.... please share any thoughts/suggestions
const fs = require('fs');
const JSONStream = require('JSONStream');
const es = require('event-stream');
fs.createReadStream('../path/to/large/json/file')
.pipe(JSONStream.parse('*'))
.pipe(processData())
function processData() {
return es.mapSync((data) => {
pushDatatoDB(data)
});
}
I tried using highWaterMark option so i could process a certain number of records at a given point of time, and then resume the stream to move further, this was a slight improvement in the process but does not resolve the slowness completely.
Also, tried the .on('data') handler like below but the issue prevails,
const fs = require('fs');
const JSONStream = require('JSONStream');
const es = require('event-stream');
fs.createReadStream('../path/to/large/json/file', { highWaterMark: 2048 })
.pipe(JSONStream.parse('*'))
.on('data', (record) => {
pushDatatoDB(data)
})