I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.
So far I successfully build a scraper using the node packages request
and cheerio
. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer
. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.
My biggest question: Where do I start to crawl?
I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?