I am using Nutch 1.17 to crawl over million websites. I have to perform following things for this.
- One time run crawler as deep crawler so that it should fetched maximum URLs from given (1 million) domains. For first time, you can run it for max 48 hours.
- After this, run crawler with same 1 million domains after 5 to 6 hour and only select those URLs that are new on those domains.
- After the job completion, index URLs in Solr
- Later on, there is no need to store raw HTML, hence to save storage (HDFS), remove raw data only and maintain each page metadata so that in next job, we should avoid to re-fetch a page again (before its scheduled time).
There isn't any other processing or post analysis. Now, I have a choice to use Hadoop cluster of medium size (max 30 machine). Each machine has 16GB RAM, 12 Cores and 2 TB Storage. Solr machine(s) are also of same spaces. Now, to maintain above, I am curious about followings:
a. How to achieve above document crawl rate i.e., how many machines are enough ?
b. Should I need to add more machines or is there any better solution ?
c. Is it possible to remove raw data from Nutch and keep metadata only ?
d. Is there any best strategy to achieve the above objectives.