1

I have a for-loop in a program I am running with Node.js. The function is x() from the xray package and I am using it to scrape and receive data from a webpage and then write that data to a file. This program is successful when used to scrape ~100 pages, but I need to scrape ~10000 pages. When I try to scrape a very large amount of pages, the files are created but they do not hold any data. I believe this problem exists because the for-loop is not waiting for x() to return the data before moving on to the next iteration.

Is there a way to make node wait for the x() function to complete before moving on to the next iteration?

//takes in file of urls, 1 on each line, and splits them into an array. 
//Then scrapes webpages and writes content to a file named for the pmid number that represents the study
 
//split urls into arrays
var fs = require('fs');
var array = fs.readFileSync('Desktop/formatted_urls.txt').toString().split("\n");


var Xray = require('x-ray');
var x = new Xray();
 
for(i in array){
        //get unique number and url from the array to be put into the text file name
                number = array[i].substring(35);
                url = array[i];


        //use .write function of x from xray to write the info to a file
        x(url, 'css selectors').write('filepath' + number + '.txt');
                               
}

Note: Some of the pages I am scraping do not return any value

Hannah Murphy
  • 119
  • 1
  • 9
  • 3
    Promises are very helpful here. One of the most popular libraries is called Bluebird. – Jared Dykstra Nov 17 '15 at 03:46
  • 1
    I agree with Jared Dykstra. You want to 1) get rid of the loop body, and structure it as a "promise". 2) Set a counter up front, e.g.`ct = array.length`, before the first call, 3) keep calling yourself until the counter decrements to 0. – paulsm4 Nov 17 '15 at 03:51

2 Answers2

2

You can't make a for loop wait for an async operation to complete. To solve this type of problem, you have to do a manual iteration and you need to hook into a completion function for the async operation. Here's the general outline of how that would work:

var index = 0;
function next() {
    if (index < array.length) {
        x(url, ....)(function(err, data) {
            ++index;
            next();
        });
    }
}
next();

Or, perhaps this;

var index = 0;
function next() {
    if (index < array.length) {
        var url = array[index];
        var number = array[i].substring(35);
        x(url, 'css selectors').write('filepath' + number + '.txt').on('end', function() {
            ++index;
            next() 
        });
    }
}
next();
jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • the x().write() return a writeStream, so need to iterator to next when the `end` event emit `x(url, 'css celectors').write('filepath').on('end', function() { next() })` – Sean Nov 17 '15 at 03:36
  • @Sean - I don't know that library, but think I found the doc and edited my answer to use a different form. Presumably the OP can adapt this general form to their specific needs. – jfriend00 Nov 17 '15 at 03:38
  • Thank you, do I change 'end' to a specification or is that a keyword? – Hannah Murphy Nov 17 '15 at 05:17
  • This only works for one iteration of the function, do I customize 'end', or is there another problem? – Hannah Murphy Nov 17 '15 at 05:22
  • @HannahMurphy - you add your own logic to use the `index` variable and your array to vary the query for each iteration. It will iterate once for each element in the array. I don't know the x-ray library myself, but you can only make any of this work if you can get a notification callback when the async operation is done. Sean (see earlier comment) suggested that the `end` event on the stream was the way to do that. I don't know that library so that is something you will have to figure out. This answer show you the structure for iterating an async operation one at a time. – jfriend00 Nov 17 '15 at 06:11
2

The problem with your code is that you're not waiting for the files to be written to the file system. A better way than downloading the files one by one is to do them in one go and then wait till they complete, rather than processing them one by one before proceeding to the next.

One of the recommended libraries for dealing with promises in nodejs, is bluebird.

http://bluebirdjs.com/docs/getting-started.html

In the updated sample (see below), we iterate through all of the urls and start the download, and keep track of the promises, and then once the files have been written each promise is resolved. Finally, we just wait on all of the promises to get resolved using Promise.all()

Here's the updated code:

var promises = [];
var getDownloadPromise = function(url, number){
    return new Promise(function(resolve){
        x(url, 'css selectors').write('filepath' + number + '.txt').on('finish', function(){
            console.log('Completed ' + url);
            resolve();
        });
    });
};

for(i in array){
    number = array[i].substring(35);
    url = array[i];

    promises.push(getDownloadPromise(url, number));                               
}

Promise.all(promises).then(function(){
    console.log('All urls have been completed');
});
Don
  • 6,632
  • 3
  • 26
  • 34
  • This works, thank you so much! The only thing, which isn't the code's fault, but it throws an ECONNRESET error when I pass in a very large array. Any ideas on how to avoid this? – Hannah Murphy Nov 18 '15 at 04:52
  • 1
    This might be an indication that the request is taking too long to process and the request has timed out. You may need to increase the request timeout. Either way you need to trap the exceptions on your server to view the details. Use app.use(function(err){...}); if using express. – Don Nov 18 '15 at 05:30