Nutch 1.17 web crawling with storage optimization

Question

I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.

One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
After the job completion, index URLs in Solr
Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).

There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:

a. How to achieve above document crawl rate i.e., how many machines are enough ? 
b. Should I need to add more machines or is there any better solution ?
c. Is it possible to remove raw data from Nutch and keep metadata only ?
d. Is there any best strategy to achieve the above objectives.

score 1 · Accepted Answer · answered Sep 28 '20 at 20:45

a. How to achieve above document crawl rate i.e., how many machines are enough ?

Assuming a polite delay between successive fetches to the same domain is chosen: let's assume 10 pages can be fetcher per domain and minute, the max. crawl rate is 600 million pages per hour (10^6*10*60). A cluster with 360 cores should be enough to come close to this rate. Whether it's possible to crawl the one million domains exhaustively within 48 hours depends on the size of each of the domains. Keep in mind, that the mentioned crawl rate of 10 pages per domain and minute, it's only possible to fetch 10*60*48 = 28800 pages per domain within 48 hours.

c. Is it possible to remove raw data from Nutch and keep metadata only ?

As soon as a segment was indexed you can delete it. The CrawlDb is sufficient to decide whether a link found on one of the 1 million home pages is new.

After the job completion, index URLs in Solr

Maybe index segments immediately after each cycle.

b. Should I need to add more machines or is there any better solution ? d. Is there any best strategy to achieve the above objectives.

A lot depends on whether the domains are of similar size or not. In case they show a power-law distribution (that's likely), you have few domains with multiple millions of pages (hardly crawled exhaustively) and a long tail of domains with only few pages (max. few hundred pages). In this situation you need less resources but more time to achieve the desired result.

Thanks sir, How you calculated number of cores for this objective ? You said that 360 cores are enough but how found ? — Hafiz Muhammad Shafiq, Sep 29 '20 at 05:06
From my experience, using a parsing fetcher (fetcher.parse=true) and given that there are no other bottlenecks (disk, RAM, network or the Solr index) it's possible to fetch and process 200k - 250k pages per core and hour. Using 360 cores you'd get 70-90 million pages per hour. Sounds like a reasonable throughput for 1 million domains. I'd even start with a smaller cluster - it's likely that domain sizes are not balanced which will slow down the throughput after the first crawl cycles. — Sebastian Nagel, Sep 29 '20 at 09:25

Nutch 1.17 web crawling with storage optimization

1 Answers1