2

I am new to web crawling and I need some pointers about these two Node JS crawlers.

Aim: My aim is to crawl a website and obtain ONLY the internal (local) URLs within that domain. I am not interested in any page data or scraping. Just the URLs.

My Confusion: When using node-crawler or simplecrawler, do they have to download the entire pages before they return response? Is there a way to only find a URL, ping maybe perform some get request and if 200 response, just proceed to the next link without actually having to request the entire page data?

Is there any other NodeJS crawler or spider which can request and log only URLs? My concern is to make the crawl as lightweight as possible.

Thank you in advance.

Machiavelli
  • 411
  • 4
  • 15

1 Answers1

6

Crawling only the HTML pages of a website is usually a pretty lightweight process. It is also necessary to download the response bodies of HTML bodies to be able to crawl the site, since the HTML is searched for additional URLs.

simplecrawler is configurable so that you can avoid downloading images etc from a website. Here's a snippet that you can use to log the URLs that the crawler visits and avoid to download image resources.

var Crawler = require("simplecrawler");
var moment = require("moment");
var cheerio = require("cheerio");

var crawler = new Crawler("http://example.com");

function log() {
    var time = moment().format("HH:mm:ss");
    var args = Array.from(arguments);

    args.unshift(time);
    console.log.apply(console, args);
}

crawler.downloadUnsupported = false;
crawler.decodeResponses = true;

crawler.addFetchCondition(function(queueItem) {
    return !queueItem.path.match(/\.(zip|jpe?g|png|mp4|gif)$/i);
});

crawler.on("crawlstart", function() {
    log("crawlstart");
});

crawler.on("fetchcomplete", function(queueItem, responseBuffer) {
    log("fetchcomplete", queueItem.url);
});

crawler.on("fetch404", function(queueItem, response) {
    log("fetch404", queueItem.url, response.statusCode);
});

crawler.on("fetcherror", function(queueItem, response) {
    log("fetcherror", queueItem.url, response.statusCode);
});

crawler.on("complete", function() {
    log("complete");
});

crawler.start();
fredrikekelund
  • 2,007
  • 2
  • 21
  • 33