0

I have a function that fetches thread (gmail conversation) ids from database and then asks Google API for all data for each thread id. Once it receives a thread object, it stores it to database. This works fine for my inbox which has ~1k messages. But I am not sure if it would work for accounts with over 100k messages.

Now what I am asking, once a machine runs out of memory, will it break or will it continue executing callback functions whenever enough RAM is available again? Should I modify this code to do this part-by-part (rerun whole script at some points and continue with fresh RAM from where it last ended?)

function eachThread(auth) {
  var gmail = google.gmail('v1');

  MongoClient.connect(mongoUrl, function(err, db){
    assert.equal(null, err);
    var collection = db.collection('threads');
    // Find all data in collection and convert it to array
    collection.find().toArray(function(err, docs){
      assert.equal(null, err);
      var threadContents = [];
      // For each doc in array...
      for (var i = 0; i < docs.length; i++) {
        gmail
        .users
        .threads
        .get( {auth:auth,'userId':'me', 'id':docs[i].id}, function(err, resp){
          assert.equal(null, err);
          threadContents.push(resp);
          console.log(threadContents.length);
          console.log(threadContents[threadContents.length - 1].id);
          var anotherCollection = db.collection('threadContents');
          anotherCollection.updateOne(
            {id: threadContents[threadContents.length - 1].id},
            threadContents[threadContents.length - 1],
            {upsert:true},
            function(err, result){
              assert.equal(null, err);
              console.log('updated one.');
          });
          if (threadContents.length === docs.length) {
            console.log('Length matches!');
            db.close();
          }
        });//end(callback(threads.get))
      }//end(for(docs.length))
    });//end(find.toArray)
  });//end(callback(mongo.connect))
}//end(func(eachThread))
Kunok
  • 8,089
  • 8
  • 48
  • 89
  • I don't know if it is because of the RAM, but I do remeber implementing part by part in a SQL to MongoDB tool. The part by part version was *2 faster before testing different sizes of parts. – DrakaSAN Sep 01 '16 at 09:32
  • @DrakaSAN I did exactly the same. I had mySQL database with many millions of rows and I made a CRON that did part by part migration from SQL to mongo. But that was PHP. I believe this callback world might work better as one time execution, no matter how long would it take, I just need it to do the work. – Kunok Sep 01 '16 at 09:34
  • 1
    What you could do is avoid `threadContents` and insert the `resp` itself. Also you are creating `anotherCollection` inside the loop which makes no sense since it's the same object over and over again. Then you will definetly not have any problems with ram. – Stan Sep 01 '16 at 09:36
  • Also, part by part in your case is as easy as to use `async.mapLimit` at the right place, I d reccomend to do that anyway. – DrakaSAN Sep 01 '16 at 09:40
  • @Stan I agree. That was my mistake, code is a bit messy. Even after suggested modification, it is reduced by half. But there is account that has ~200k messages. Would that work? Is there a way to calculate it? – Kunok Sep 01 '16 at 09:41
  • 1
    I would recommend to replace your for loop with async.forEachSeries this way you run one thread after another. In your current code gmail.users.threads.get is called immediately as many times as docs.lenght – Molda Sep 01 '16 at 09:46
  • 1
    @Molda: `forEachLimit` would be even better, since he can decide how many concurrent transferts happens. – DrakaSAN Sep 01 '16 at 09:52

3 Answers3

2

You will not run out of memory if you will not get everything and push it to array. Also I would not instantiate objects that are the same on every element inside the loop.

Here is example of code that will not run out of memory, however it is fire-and-forget meaning that you will not get callback when it's finished etc. If you wish to do that you will need to use promises/async.

// Fire-and-forget type of function
// Will not run out of memory, GC will take care of that
function eachThread(auth, cb) {
  var gmail = google.gmail('v1');

  MongoClient.connect(mongoUrl, (err, db) => {
    if (err) {
      return cb(err);
    }

    var threadsCollection = db.collection('threads').find();
    var contentsCollection = db.collection('threadContents');

    threadsCollection.on('data', (doc) => {
      gmail.users.threads.get({ auth: auth, 'userId': 'me', 'id': doc.id }, (err, res) => {
        if (err) {
          return cb(err);
        }

        contentsCollection.updateOne({ id: doc.id }, res, { upsert: true }, (err, result) => {
          if (err) {
            return cb(err);
          }
        });
      });
    });

    threadsCollection.on('end', () => { db.close() });
  });
}
Kunok
  • 8,089
  • 8
  • 48
  • 89
Stan
  • 25,744
  • 53
  • 164
  • 242
  • You added `cb` as parameter. Is that basically callback variable that doesn't need to be defined earlier or I need to define it somewhere? – Kunok Sep 01 '16 at 10:21
  • It is a cb for error handling. `eachThread('my auth', (err) => { if (err) { console.error(err) } })` here is an example. – Stan Sep 01 '16 at 10:28
  • I get this error: `TypeError: cb is not a function` – Kunok Sep 01 '16 at 10:30
  • Well if you want you can just remove the `cb(err)` everywhere so you don't have a callback: https://jsfiddle.net/3pjkuy9g/ – Stan Sep 01 '16 at 10:32
  • I updated answer, it's on `'end'` event. Also, I'm pretty sure it's not mandatory because MongoClient handles pooling for you. – Stan Sep 01 '16 at 10:44
  • Thank you. This is actually one of my very first node scripts so I was a bit difficult to digest. – Kunok Sep 01 '16 at 10:46
1

Now what I am asking, once a machine runs out of memory, will it break or will it continue executing callback functions whenever enough RAM is available again?

If you run out of memory, the OS will kill your process. In Linux you will see an OOM (Out of Memory). So yes, it'll break.

In these scenarios, you may consider using either streams or generators so you keep in memory just the chunk of data you need to process.

In your case MongoDB provides streams on find method https://mongodb.github.io/node-mongodb-native/2.0/tutorials/streams/

Something like this should work:

var collection = db.collection('threads');
var cursor = collection.find()

cursor.on('data', function(doc) {
  gmail
  .users
  .threads
  .get( {auth:auth,'userId':'me', 'id': doc.id}, function(err, resp) {
    ...
  })
})
Carlos Hernando
  • 319
  • 1
  • 6
1

Replacing your for loop by async.mapLimit is enought to add the part by part functionality. I've also taken the liberty to move the anotherCollection creation alongside collection, since opening the connection once is better than opening it hundre if not thousands of times.

I've also replaced your assert.equal by callback(err). async's function will understand that it should stop everything, and it allow you to cleanly exit instead of throwing a exception.

EDIT:

As @chernando remarked, using collection.find().toArray will import the whole collection into RAM. A better way of doing the part by part would be to stream the data, or to ask the DB to give the data by chunk.

This version assume you have enought RAM to get collection.find().toArray working without issue.

I will probably come back later with a adaptation of the tool I talked about in the comments when I ll have the time.

var async = require('async');

function eachThread(auth) {
  var gmail = google.gmail('v1'),
      limit = 100; //Size of the parts

  MongoClient.connect(mongoUrl, function(err, db){
    assert.equal(null, err);
    var collection = db.collection('threads'),
        anotherCollection = db.collection('threadContents');
    // Find all data in collection and convert it to array
    collection.find().toArray(function(err, docs){
      assert.equal(null, err);
      var threadContents = [];
//Change here
      async.mapLimit(docs, limit, (doc, callback) => {
        gmail
        .users
        .threads
        .get( {auth:auth,'userId':'me', 'id':docs[i].id}, function(err, resp){
          if(err) {
            return callback(err);
          }
          threadContents.push(resp);
          console.log(threadContents.length);
          console.log(threadContents[threadContents.length - 1].id);
          anotherCollection.updateOne(
            {id: threadContents[threadContents.length - 1].id},
            threadContents[threadContents.length - 1],
            {upsert:true},
            function(err, result){
              if(err) {
                console.error(err);
              } else {
                console.log('updated one.');
              }
              callback(err);
          });
        });//end(callback(threads.get))
//Change here
      }, (error) => {
        if(error) {
          console.error('Transfert stopped because of error:' + err);
        } else {
          console.log('Transfert successful');
        }
      });//end(async.mapLimit)
    });//end(find.toArray)
  });//end(callback(mongo.connect))
}//end(func(eachThread))
DrakaSAN
  • 7,673
  • 7
  • 52
  • 94
  • 2
    Be aware that `collection.find().toArray` will consume your RAM eagerly. – Carlos Hernando Sep 01 '16 at 10:05
  • Ok I will keep eye on this answer until you come back later. – Kunok Sep 01 '16 at 10:19
  • 1
    @Kunok: My tool used `mongoose`, which have the options `skip` and `limit`, that I don't find in `mongo-client`. I have the choice between changing your code extensively, or find a alternative. I'll see later. – DrakaSAN Sep 01 '16 at 12:04
  • @DrakaSAN I decided not to use `mongoose` at this point because I work with raw objects here and I am not about to define models and schema. – Kunok Sep 01 '16 at 12:14
  • 1
    It s fine, I was just explaining why you will probably have to wait a long time before I get back at providing this :) – DrakaSAN Sep 01 '16 at 12:17