I am Using Crawler4j crawler to crawl some domains.Now I want to Improve the efficiency of the crawler, I want my crawler to use my full bandwidth and crawl as many url's as possible in a given time period.For that I am taking the following settings:-
- I have increased the no. of crawler threads to 10 (using this function ContentCrawler('classfilename',10);)
- I have reduced the politeness delay to 50 ms (using Crawlconfig.setpolitenessdelay(50);)
- I am giving depth of crawling as 2 (using Crawlconfig.setMaxDepthOfCrawling(2))
Now what I want to know is:-
1) Are there any side effects with these kind of settings.
2) Are there any things I have to do apart from this so that I can improve my Crawler speed.
3) Can some one tell me maximum limits of every setting(ex:- Max no. of threads supported by crawler4j at a time etc).Beacuse I have already gone through the code of Crawler4j but I did not find any limits any where.
4)How to crawl a domain without checking it's robots.txt file.Beacause I understood that crawler4j is first checking a Domain's robots.txt file before crawling.I don't want that!!
5)How does page fetcher works(pls explain it briefly)
Any help is appreciated,and pls go easy on me if the question is stupid.