how to crawl all the internal url's of a website using crawler?

Question

I wanted to use a crawler in node.js to crawl all the links in a website (internal links) and get the title of each page , i saw this plugin on npm crawler, if i check the docs there is the following example:

var Crawler = require("crawler");

var c = new Crawler({
   maxConnections : 10,
   // This will be called for each crawled page
   callback : function (error, res, done) {
       if(error){
           console.log(error);
       }else{
           var $ = res.$;
           // $ is Cheerio by default
           //a lean implementation of core jQuery designed specifically for the server
           console.log($("title").text());
       }
       done();
   }
});

// Queue just one URL, with default callback
c.queue('http://balenol.com');

But what i really want is to crawl all the internal urls in the site and is the inbuilt into this plugin or does this need to be written seperately ? i don't see any option in the plugin to visit all the links in a site , is this possible ?

Evya · Answer 1 · 2018-05-04T05:39:54.257

The following snippet crawls all URLs in every URL it finds.

const Crawler = require("crawler");

let obselete = []; // Array of what was crawled already

let c = new Crawler();

function crawlAllUrls(url) {
    console.log(`Crawling ${url}`);
    c.queue({
        uri: url,
        callback: function (err, res, done) {
            if (err) throw err;
            let $ = res.$;
            try {
                let urls = $("a");
                Object.keys(urls).forEach((item) => {
                    if (urls[item].type === 'tag') {
                        let href = urls[item].attribs.href;
                        if (href && !obselete.includes(href)) {
                            href = href.trim();
                            obselete.push(href);
                            // Slow down the
                            setTimeout(function() {
                                href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`) // The latter might need extra code to test if its the same site and it is a full domain with no URI
                            }, 5000)

                        }
                    }
                });
            } catch (e) {
                console.error(`Encountered an error crawling ${url}. Aborting crawl.`);
                done()

            }
            done();
        }
    })
}

crawlAllUrls('https://github.com/evyatarmeged/');

Works nicely but its worth noting that this script will crawl all external URLs referenced in the page too — Mike, Nov 20 '18 at 13:18
Thank you for this. Web crawlers in general will not work for sites whose content only loads after the javascript has been parsed and executed, such as angular sites etc etc — Craig Wayne, Oct 27 '22 at 13:25

score 2 · Answer 2 · answered Dec 31 '20 at 17:16

2

In the above code, just change the following to get the internal links of a website...

from

href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`)

to

href.startsWith(url) ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`)

answered Dec 31 '20 at 17:16

syedshabbir

35
5

how to crawl all the internal url's of a website using crawler?

2 Answers2