1

I'm parsing a fairly large dataset from MongoDB (of about 40,000 documents, each with a decent amount of data inside).

The stream is being accessed like so:

  var cursor = db.domains.find({ html: { $exists: true } });

  cursor.on('data', function(rec) {
      i++;
      var url = rec.domain;
      var $ = cheerio.load(rec.html);
      checkList($, rec, url, i);
      // This "checkList" function parses HTML data with Cheerio to find different elements on the page. Lots of if/else statements
  });

  cursor.on('end', function(){
    console.log("Streamed all objects!");
  })

Each record gets parsed with Cheerio (the record contains HTML data from a page scraped earlier) and then I process the Cheerio data to look for various selectors, then saved back to MongoDB.

For the first ~2,000 objects the data is parsed quite quickly (in ~30 seconds). After that it becomes far slower, around 50 records being parsed per second.

Looking in my Macbook Air's activity monitor I see that it's not using a crazy amount of memory (226.5mb / 8gb ram) but it is using a whole lot of CPU (io.js is taking up 99% of my cpu).

Is this a possible memory leak? The checkLists function isn't particularly intensive (or at least, as far as I can tell - there are quite a few nested if/else statements but not much else).

Am I meant to be clearing my variables after they're being used, like setting $ = '' or similar? Any other reason with Node would be using so much CPU?

JVG
  • 20,198
  • 47
  • 132
  • 210
  • Well the "cursor" result is being converted into a stream, but your "parsed" results could easily be staying in memory. But mainly see `stream.pause` Since processing the stream does not "stop" with "on data". So you just stacking operations. – Blakes Seven Jul 16 '15 at 02:37
  • maybe garbage collection and bson parsing. Use a profiler! node-inspector is a good place to start – NG. Jul 16 '15 at 02:38

1 Answers1

4

You basically need to "pause" the stream or otherwise "throttle" it from executing on very data item recieved straight away. So the code in the "event" does not wait before completion before the next event is fired, unless you stop the events emitting.

  var cursor = db.domains.find({ html: { $exists: true } });

  cursor.on('data', function(rec) {
      cursor.pause();    // stop processessing new events
      i++;
      var url = rec.domain;
      var $ = cheerio.load(rec.html);
      checkList($, rec, url, i);
      // if checkList() is synchronous then here
      cursor.resume();  // start events again
  });

  cursor.on('end', function(){
    console.log("Streamed all objects!");
  })

If checkList() contains async methods, then pass in the cursor

      checkList($, rec, url, i,cursor);

And process the "resume" inside:

  function checkList(data, rec, url, i, cursor) {

      somethingAsync(args,function(err,result) {
          // We're done
          cursor.resume(); // start events again
      })
  }

The "pause" stops the events emitting from the stream until the "resume" is called. This means your operations don't "stack up" in memory and wait for each to complete.

You probably want more advanced flow control for some parallel processing, but this is basically how you do it with streams. And resume inside

Blakes Seven
  • 49,422
  • 14
  • 129
  • 135
  • Wow, this is amazing, I've been looking hard for this kind of code optimisation thing and this looks like exactly what I'm after. I'll test it out and report back. – JVG Jul 16 '15 at 02:55
  • Gave it a go, when I just restarted the cursor after sending the record to `checkList()` it doesn't have any effect, however when I put `cursor.restart()` in the callback of my `mongo.save()` function it has become far, far faster! Will need to look into parallel processing to see what I can improve from here. Cheers mate. – JVG Jul 16 '15 at 04:18
  • @Jascination That's basically what I said, put the "resume" inside any async bacllback response contained. That's the thing you want to wait for. You can likely get some more improvements in gere. Take a look at [async](https://github.com/caolan/async) for varying takes on iterators where you can have multiple operations working at once. – Blakes Seven Jul 16 '15 at 04:25