I am new to web crawling and I need some pointers about these two Node JS crawlers.
Aim: My aim is to crawl a website and obtain ONLY the internal (local) URLs within that domain. I am not interested in any page data or scraping. Just the URLs.
My Confusion: When using node-crawler or simplecrawler, do they have to download the entire pages before they return response? Is there a way to only find a URL, ping maybe perform some get request and if 200 response, just proceed to the next link without actually having to request the entire page data?
Is there any other NodeJS crawler or spider which can request and log only URLs? My concern is to make the crawl as lightweight as possible.
Thank you in advance.