0

I am currently trying to return a request of all the file names (in each existing folder) on a particular website. My web application is using NodeJS, Express, Cheerio, and Request to web scrape. My code is first getting a list of all the folder names. After retrieving a list of folder names, it then goes inside each folder name to get a list of file names and store them in the 'files' array. Finally, the 'files' array is what will be sent to the client-side.

Right now I am having a big issue with asynchronous stuff because my request would always return an empty list of 'files'. I have the Q node module installed and have tried using promises, but have had no luck getting the results I want. I am still new to nodeJS and would love it if someone can help me out.. :)

exports.getAllImages = function(req, res) {
    var folders = [];
    var files = [];

    //Step 1: Get folder names and store all of them in the 'folders' array
    var foldersUrl = 'http://students.washington.edu/jmzhwng/Images/';
    request(foldersUrl, function(error, response, html){
        if(!error){
            var $ = cheerio.load(html);
        $("a:contains('-')").filter(function(){
            var data = $(this)[0].attribs.href;
            folders.push(data); 
        })

        //Step 2: Using the 'folders' array, get file names in each folder and store all of them in the 'files' array
        for (var i=0; i < folders.length; i++) {
            var imagesUrl = 'http://students.washington.edu/jmzhwng/Images/' + folders[i];
            request(imagesUrl, function(error, response, html){
                if(!error){
                    var $ = cheerio.load(html);
                    $("a:contains('.')").filter(function(){
                        var data = $(this)[0].attribs.href;
                        files.push(data);
                    })
                }
            })
        }

        //Step 3: Return all file names to client-side
        res.json({
            images: files
        }, 200);
        console.log('GET ALL IMAGES - ' + JSON.stringify(files));
    }
})

For better readability or support, you can view the JSFiddle I created here: http://jsfiddle.net/fKGrm/

user3314402
  • 254
  • 1
  • 3
  • 15

1 Answers1

2

You don’t necessarily need promises for this—you’re 95% of the way there already without them. The main issue, as I think you’re aware, is that your response is being sent before the image requests come back. You just need to wait for those to finish before you send the response.

The most basic way is to count the number of callbacks you receive in your Step 2. When it equals the folders.length, then send your response.

Here’s a simplified version of that:

var request = require('request'),
    cheerio = require('cheerio');

var baseUrl = 'http://students.washington.edu/jmzhwng/Images/';

var files = [];

request(baseUrl, function (error, res, body) {
  var folders = folderLinks(cheerio.load(body));
      count = 0;

  folders.forEach(function (folder) {
    request(baseUrl + folder, function (error, res, body) {
      files.push.apply(files, fileLinks(cheerio.load(body)));

      if (++count == folders.length) {
        console.log(files);
      }
    });
  });
});

function folderLinks ($) {
  return $('a:contains(-)').get().map(function (a) {
    return a.attribs.href;
  });
}

function fileLinks ($) {
  return $('a:contains(.)').get().map(function (a) {
    return a.attribs.href;
  });
}
Todd Yandell
  • 14,656
  • 2
  • 50
  • 37
  • Thank you so much for your help, Todd! Your clean and simplified version works perfectly. You are amazing!!! – user3314402 May 09 '14 at 20:58
  • You’re welcome! One more thing: Remember to add error checking if you use this in production. I left it out for clarity. Also if you find yourself writing this pattern often, try [run-parallel](http://npmjs.org/package/run-parallel) or [batch](http://npmjs.org/package/batch). – Todd Yandell May 09 '14 at 21:05
  • Sounds great! Alright, I will. Thanks again for all your help! Saved me countless hours – user3314402 May 09 '14 at 22:51