1

I have NodeJS project with a BIG array (about 9000 elements) containing URLs. Those URLs are going to be requested using the request-promise package. However, 9000 concurrent GET requests to the same website from the same client is neither liked by the server or the client, so I want to spread them out over time. I have looked around a bit and found Promise.map together with the {concurrency: int} option here, which sounded like it would do what I want. But I cannot get it to work. My code looks like this:

const rp = require('request-promise');
var MongoClient = require('mongodb').MongoClient;
var URLarray = []; //This contains 9000 URLs

function getWebsite(url) {
  rp(url)
  .then(html => { /* Do some stuff */ })
  .catch(err => { console.log(err) });
}

MongoClient.connect('mongodb://localhost:27017/some-database', function (err, client) {
  Promise.map(URLArray, (url) => {
    db.collection("some-collection").findOne({URL: url}, (err, data) => {
      if (err) throw err;
      
      getWebsite(url, (result) => {
        if(result != null) {
          console.log(result);
        }
      });
      
    }, {concurrency: 1});
});

I think I probably misunderstand how to deal with promises. In this scenario I would have thought that, with the concurrency option set to 1, each URL in the array would in turn be used in the database search and then passed as a parameter to getWebsite, whose result would be displayed in its callback function. THEN the next element in the array would be processed.

What actually happens is that a few (maybe 10) of the URLs are fetch correctly, then the server starts to respond sporadically with 500 internal server error. After a few seconds, my computer freezes and then restarts (which I guess is due to some kind of panic?).

How can I attack this problem?

  • 9000 requests? That's too many. I'd take a step back and consider if there's any more suitable approach - such as setting up an API on the other server that can respond with multiple batches of data at once. – CertainPerformance Feb 02 '21 at 19:20
  • Yes, its ugly, but it's sort of a one time web scrape and I have no problem that this could take a whole day to complete if I can manage to spread it out over time. But, as you say, there are probably other better approaches that doesn't require all 9000 requests to be sent in one run. – Samuel Larsson Feb 02 '21 at 19:25
  • Ok, that's reasonable. Are tons of parallel calls of `findOne` a problem? – CertainPerformance Feb 02 '21 at 19:26
  • I'm not entirely sure what causes my computer to act the way it does. From what I can tell, the `rp` promise in `getWebsite` is what is producing the error messages. The database is local, so parallel `findOne` calls _shouldn't_ be the bottleneck, but it's possible. – Samuel Larsson Feb 02 '21 at 19:31

1 Answers1

0

If the problem is really about concurrency, you can divide the work into chunks and chain the chunks.

Let's start with a function that does a mongo lookup and a get....

// answer a promise that resolves to data from mongo and a get from the web
// for a given a url, return { mongoResult, webResult }
// (assuming this is what OP wants. the OP appears to discard the mongo result)
//
function lookupAndGet(url) {
  // use the promise-returning variant of findOne
  let result = {}
  return db.collection("some-collection").findOne({URL: url}).then(mongoData => {
    result.mongoData = mongoData
    return rp(url) 
  }).then(webData => {
    result.webData = webData
    return result
  })
}

lodash and underscore both offer a chunk method that breaks an array into an array of smaller. Write your own or use theirs.

const _ = require('lodash')
let chunks = _.chunk(URLArray, 5)  // say 5 is a reasonable concurrency

Here's the point of the answer, make a chain of chunks so you only perform the smaller size concurrently...

let chain = chunks.reduce((acc, chunk) => {
  const chunkPromise = Promise.all(chunk.map(url => lookupAndGet(url)))
  return acc.then(chunkPromise)
}, Promise.resolve())

Now execute the chain. The chunk promises will return chunk-sized arrays of results, so your reduced result will be an array of arrays. Fortunately, lodash and underscore both have a method to "flatten" the nested array.

// turn [ url, url, ...] into [ { mongoResult, webResult }, { mongoResult, webResult }, ...]
// running only 5 requests at a time
chain.then(result => {
  console.log(_.flatten(result))
})
danh
  • 62,181
  • 10
  • 95
  • 136
  • Or use [Bluebird's Promise.map()](http://bluebirdjs.com/docs/api/promise.map.html) with `concurrency` option. It's slightly more sophisticated than chunking in that it works on a "one-in-one-out" basis, in which the number of in-flight requests (or whatever) is maintained at a constant level (until the final few). – Roamer-1888 Feb 03 '21 at 05:29