Process 50k webpages on runtime (NodeJS)

Question

I need to download ~50k webpages, get some data from them and put it to variable.

I wrap each request into Promise and then Promise.all() them. I use Request library.

Simplified code:

const request = require('request');
const urls = [url1, url2, ...];
const promises = [];

urls.forEach(url => {
    promises.push((resolve, reject) => {
        request(url, (error, response, body) => {
            if(error){ reject(error); return; }

            // do something with page

            resolve(someData);
        });
    });
});

Promise.all(promises.map(pr => new Promise(pr)))
    .then((someDataArray)=>{ /* process data /* });

But I receive ENFILE exception, which stands for too many open files in the system (on my desktop max number of open files is 2048).

I know that Promises execute on creation, but I can't solve this problem.

Maybe there is other approach to do that? Thanks for response.

You can try this using async.forEachLimit where you can define the limit on number of requests. It will execute the next batch of limited requests once the previous batch is complete. https://caolan.github.io/async/docs.html — Aman Gupta, May 09 '17 at 11:55
Why are there 3 answers using the same library ? Wouldn't one be enough ? — Denys Séguret, May 09 '17 at 12:12

Denys Séguret · Answer 1 · 2017-05-09T12:11:50.577

What you want is to launch N requests then start a new one whenever one finishes (be it successful or not).

There are many libraries for that but it's important to be able to implement this kind of limitation yourself:

const request = require('request');
const urls = [url1, url2, ...];
const MAX_QUERIES = 10;
var remaining = urls.length;

const promises = [];

function startQuery(url){
    if (!url) return;
    request(url, (error, response, body) => {
        if (error) // handle error
        else // handle result
        startQuery(urls.shift());
        if (--remaining==0) return allFinished();
    });
}

for (var i=0; i<MAX_QUERIES; i++) startQuery(urls.shift());

function allFinished(){
    // all done
}

score 1 · Answer 2 · answered May 09 '17 at 12:02

You can try this using async.forEachLimit where you can define the limit on number of requests. It will execute the next batch of limited requests once the previous batch is complete.

Install the package using npm install --save async

async.forEachLimit(urls, 50,function(url, callback) {
    //process url using request module
    callback();
}, function(err) {
    if (err) return next(err);
    console.log("All urls are processed");
});

for further help look: https://caolan.github.io/async/docs.html

score 1 · Answer 3 · answered May 09 '17 at 12:22

Others have said how to do the flow control using async or promises, and I won't repeat them. Personally, I prefer the async JS method but that's just my preference.

Two things that they did not cover, however, which I think are as important as flow control if you want your script performant and reliable.

1) Don't rely on the callbacks or promises to handle processing the files. All examples provided so far use those. Myself, I would make use of the request streams API instead to treat the request as a readable stream and pipe that stream to a writeable that processes it. Simplest example is to use fs to write the file to the file system. This makes much better use of your system resources as it writes each data chunk to storage as it comes in, rather than having to hold the whole file in memory. You can then call the callbacknor resolve the promise when the stream ends.

2) You should not try and process and in memory list of 50k URLs. If you do and you fail on, let's say the 20,000th URL, then you have to figure out how to sort out the done ones from the not done ones and update your code or the JSON file you read them from. Instead, use a database (any will do) that has a collection/table/whatever of URLs and metadata about them. When your program runs, query for ones that don't have the attributes indicating that they have been successfully fetched, and then when you successfully fetch them or the request fails, you can use that same data structure to give you some intelligence about why it failed or when it succeeded.

score 0 · Answer 4 · answered May 09 '17 at 12:07

Install async package and use forEachLimit to limit number of operations.

const request = require('request');
const urls = [];
for(var temp=0;temp<1024;temp++){
  urls.push("http://www.google.com");
}
const async = require("async");
const promises = [];
var i=0;
async.forEachLimit(urls, 10, function(url, callback) {
  request(url, (error, response, body) => {
    if (error) {
      callback(error);
      return;
    }

    var somedata = null;
    console.log(++i);
    callback(null, somedata);
  });
}, function(err) {
  /* process data */ 
});

score 0 · Answer 5 · answered May 09 '17 at 12:10

Like said in comments, you could use the async.js module

const request = require('request');
const async = require('async');

var listOfUrls = [url1, url2, ...];

async.mapLimit(listOfUrls, 10, function(url, callback) {
  // iterator function
  request(url, function(error, response, body) {
    if (!error && response.statusCode == 200) {
      var dataFromPage = ""; // get data from the page
      callback(null, arrToCheck);
    } else {
      callback(error || response.statusCode);
    }
  });
}, function(err, results) {
  // completion function
  if (!err) {
    // process all results in the array here
    // Do something with the data
    resolve(results);
  } else {
    // handle error here
    console.log(err);
  }
});

Here you will process 10 url's at a time, when all url's have been processed the result callback is called, where you can process your data

Process 50k webpages on runtime (NodeJS)

5 Answers5