1

I have a JSON file of size 2GB, i create a readStream from it using fs.createReadStream and pipe it through JSONStream.parse (https://github.com/dominictarr/JSONStream), and then push each record into the database, the setup works totally fine but overtime the entire process slows down especially after a processing ~ 200 thousand records, throughout the process memory seem to increase slowly.

Initially there was an issue with JSONStream (https://github.com/dominictarr/JSONStream/issues/101) and is resolved based on this change https://github.com/dominictarr/JSONStream/pull/154 and now i use version 1.3.4 so we could rule out JSONStream from the list.

Verified the database (cassandra) and there wasn't any problem there, so wondering if this issue has to do something with backpressure (https://nodejs.org/en/docs/guides/backpressuring-in-streams/), but not sure what could be the potential solution.... please share any thoughts/suggestions

const fs = require('fs');
const JSONStream = require('JSONStream');
const es = require('event-stream');

fs.createReadStream('../path/to/large/json/file')
  .pipe(JSONStream.parse('*'))
  .pipe(processData())

function processData() {
  return es.mapSync((data) => {
     pushDatatoDB(data)
  });
}

I tried using highWaterMark option so i could process a certain number of records at a given point of time, and then resume the stream to move further, this was a slight improvement in the process but does not resolve the slowness completely.

Also, tried the .on('data') handler like below but the issue prevails,

const fs = require('fs');
const JSONStream = require('JSONStream');
const es = require('event-stream');

fs.createReadStream('../path/to/large/json/file', { highWaterMark: 2048 })
  .pipe(JSONStream.parse('*'))
  .on('data', (record) => {
      pushDatatoDB(data)
  })
Sai
  • 1,790
  • 5
  • 29
  • 51
  • Could you share some details of the spec of the machine it is running on? Ram size, cpu no., cpu type, swap size etc? – Chris Cousins Aug 21 '18 at 02:41
  • 1. 4CPU and 16GB RAM – Sai Aug 21 '18 at 02:46
  • Thanks - when you say ram increases slowly; what does it begin at and what does it end at (and is the increase fairly linear)? – Chris Cousins Aug 21 '18 at 02:49
  • What happens if `pushDataToDb()` is *not* done (maybe replaced with a console-out of a counter), but everything else is the same? That should determine if the issue is in the 'reading' or the 'using'. – user2864740 Aug 21 '18 at 02:53
  • @Chris Cousins the increase seem to be fairly linear, i'm trying to get the start value, not sure about the end value since i could not figure it out because the process slows down and is never finished for hours – Sai Aug 21 '18 at 02:56
  • @user2864740 i tried that too, i just incremented the counter w/o having `pushDataToDb()`, in this case the readStream is way too fast and is complete within few minutes. `pushDataToDb()` just inserts record to the database and returns a promise and based on the promise count i pause and resume the stream when 2048 promises resolves... writes to cassandra seem to have no issues – Sai Aug 21 '18 at 02:58
  • @Sai Then maybe there is a 'unbound promise leak' (eg. upward concurrency is not as expected, if read time << write time)? That is, maybe `pushDataToDb` *never* hard-blocks and a backlog of requests/promises start to accumulate. Node it self if gonna read as fast as it can with the pipe model. – user2864740 Aug 21 '18 at 03:25
  • @user2864740 yep that is a valid scenario, i also tried w/o promises, i.e. cassandra insert has a callback option `client.execute(insertQuery, [ id ], { prepare: true }, callback);`, so i used that as oppose to promises where inside the callback i increment the recordCount and now the stream is never paused or resumed, even in this case the process slows down – Sai Aug 21 '18 at 03:32
  • @Sai it's because the stream is never backpressured and you're reading the whole JSON into memory while sending all of it to cassandra. You could take a look at [`scramjet`](https://www.npmjs.com/package/scramjet) (I wrote it) - you could then simply return the promise from pushDataToDB and that would be used to keep backpressure. You could also spawn your process to couple of CPUs with `distribute`, but I doubt it would give you much profit. – Michał Karpacki Nov 26 '18 at 15:43

0 Answers0