-1

I am on scraping actions on nodejs, I am using request to connect to site, cheerio to access to the data and mongodb to store the data extracted. Also I am using async.js to avoid infinite recursion.

I have got a memory problem because my process takes memory and do not free it. I think that the problem is on mongodb because if I don't use mongodb the memory remains stable.

This is my summarized code:

// Use function scrape_urls to process the urls
var q = self.asyn.queue(scrape_urls, 3);

//I push a bunch of urls ...    
for (var j = 0; j < self.urls_data.length; j++) {
    q.push(self.urls_data[j]);
}

q.drain = function () {
    console.log("END");
};

function scrape_urls(data_url, next_action) {
    request({
        method: 'GET',
        url: data_url.url
    }, function (err, response, body) {

        var $ = cheerio.load(body);
        data = { // ... scraped data ... };

        mongo_client.connect(connection_string, function (err, db) {

            if (err) { return console.dir(err); }

            var collection = db.collection('foo');

            collection.insert(data);

            next_action();

        });
    });
};

As I say, if I avoid to use mongodb and only I connect to the urls using request, the memory will not grow endless, I think that connecting to mongodb is the problem.

Any ideas?

dlopezgonzalez
  • 4,217
  • 5
  • 31
  • 42
  • 1
    I see that you create connections each time manually, but you it seems that you do not close those connections. Can you try again with `mongo_client.close()` at the end of the process before the `next_action()` – Metux Dec 14 '15 at 16:31
  • I cannot close on that line because the callback can not be completed. I am thinking to keep the connection alive but if the connection goes down I am in problems to reconnect in my code. – dlopezgonzalez Dec 14 '15 at 16:50

1 Answers1

0

Problem solved.

I leave here a solution. I made a helper to reuse the connection and maintain only one (after all, nodejs is single-thread):

var MongoDbHelper = function (mongo_client, connection_string){  
    var self = this;

    this.mongo_client = mongo_client;       
    this.connection_string = connection_string;
    this.db = undefined;

    self.log = function (thread, str)
    {
        console.log(new Date().toISOString() + ' ' + process.memoryUsage().rss + ' [' + thread + '] ' + str);
    }   

    self.getcollection = function(collection_name, callback)
    {
        var collection = null;

        try
        {
            collection = self.db.collection(collection_name);       
        }
        catch(ex)
        {
            self.db = undefined;    
        }               

        // reconnecting if the connection is lost
        if(self.db == undefined)
        {
            self.mongo_client.connect(connection_string, function(err, db) {

                self.db = db;
                var collection = self.db.collection(collection_name);
                callback(err, self.db, collection);

            });                         
        }   
        else
        {           
            callback(null, self.db, collection);    
        }
    }

};

module.exports = MongoDbHelper
dlopezgonzalez
  • 4,217
  • 5
  • 31
  • 42