0

I want to build a a webcrawler that goes randomly around the internet and puts broken (http statuscode 4xx) image links into a database.

So far I successfully build a scraper using the node packages request and cheerio. I understand the limitations are websites that dynamically create content, so I'm thinking to switch to puppeteer. Making this as fast as possible would be nice, but is not necessary as the server should run indefinetely.

My biggest question: Where do I start to crawl?

I want the crawler to find random webpages recursively, that likely have content and might have broken links. Can someone help to find a smart approach to this problem?

  • There is no way you could decide if a website has broken links or not without actually visiting the site. WIth puppeter you can first go to google and try to search for some random keywords and you can then use the search result to crawl random webpages. – Aman Gupta Sep 28 '19 at 13:39
  • That approach would probably give me many duplicate hosts as google shows me popular and SEO compliant results, yet I am specifically looking for outdated, old and fringe content :-/ – Matthias Pitscher Sep 28 '19 at 14:19
  • Where to start crawling depends on your use case. Why do you want to find broken images and what is your goal? To give you the big image: Google has more than 130 trillion pages indexed ([source from 2016](https://searchengineland.com/googles-search-indexes-hits-130-trillion-pages-documents-263378)). Let's assume you get a similar big list and you can crawl 100 pages per second, you need 40 thousand years (!!) to crawl them. – Thomas Dondorf Sep 28 '19 at 14:21
  • The goal is to create a database of missing images (with hosts as diverse as possible). Google apparently has almost 2 billion sites indexed for "horse", yet decides to show me only the 341 most relevant... – Matthias Pitscher Sep 28 '19 at 14:44
  • @MatthiasPitscher I thought there might be a more specific use case... Added an answer. – Thomas Dondorf Sep 28 '19 at 20:51
  • Broken links now live at https://missing.pictures – Matthias Pitscher Oct 03 '20 at 23:21

3 Answers3

2

List of Domains

In general, the following services provide lists of domain names:

  • Alexa Top 1 Million: top-1m.csv.zip (free)
    CSV file containing 1 million rows with the most visited websites according to Alexas algorithms
  • Verisign: Top-Level Domain Zone File Information (free IIRC)
    You can ask Verisign directly via the linked page to give you their list of .com and .net domains. You have to fill out a form to request the data. If I recall correctly, the list is given free of charge for research purposes (maybe also for other reasons), but it might take several weeks until you get the approval.
  • whoisxmlapi.com: All Registered Domains (requires payment)
    The company sells all kind of lists containing information regarding domain names, registrars, IPs, etc.
  • premiumdrops.com: Domain Zone lists (requires payment)
    Similar to the previous one, you can get lists of different domain TLDs.

Crawling Approach

In general, I would assume that the older a website, the more likely it might be that it contains broken images (but that is already a bold assumption in itself). So, you could try to crawl older websites first if you use a list that contains the date when the domain was registered. In addition, you can speed up the crawling process by using multiple instances of puppeteer.

To give you a rough idea of the crawling speed: Let's say your server can crawl 5 websites per second (which requires 10-20 parallel browser instances assuming 2-4 seconds per page), you would need roughly two days for 1 million pages (1,000,000 / 5 / 60 / 60 / 24 = 2.3).

Thomas Dondorf
  • 23,416
  • 6
  • 84
  • 105
0

I don't know if that's what you're looking for, but this website renders a new random website whenever you click the New Random Website button, it might be useful if you could scrape it with puppeteer.

Stelrin
  • 18
  • 1
  • 5
0

I recently had this question myself and was able to solve it with the help of this post. To clarify what other people have said previously, you can get lists of websites from various sources. Thomas Dondorf's suggestion to use Verisign's TLD zone file information is currently outdated, as I learned when I tried contacting them. Instead, you should look at ICANN's CZDNS. This website allows you to access TLD file information (by request) for any name, not just .com and .net, allowing you to potentially crawl more websites. In terms of crawling, as you said, Puppeteer would be a great choice.