1

I am Using Crawler4j crawler to crawl some domains.Now I want to Improve the efficiency of the crawler, I want my crawler to use my full bandwidth and crawl as many url's as possible in a given time period.For that I am taking the following settings:-

  • I have increased the no. of crawler threads to 10 (using this function ContentCrawler('classfilename',10);)
  • I have reduced the politeness delay to 50 ms (using Crawlconfig.setpolitenessdelay(50);)
  • I am giving depth of crawling as 2 (using Crawlconfig.setMaxDepthOfCrawling(2))

Now what I want to know is:-

1) Are there any side effects with these kind of settings.

2) Are there any things I have to do apart from this so that I can improve my Crawler speed.

3) Can some one tell me maximum limits of every setting(ex:- Max no. of threads supported by crawler4j at a time etc).Beacuse I have already gone through the code of Crawler4j but I did not find any limits any where.

4)How to crawl a domain without checking it's robots.txt file.Beacause I understood that crawler4j is first checking a Domain's robots.txt file before crawling.I don't want that!!

5)How does page fetcher works(pls explain it briefly)

Any help is appreciated,and pls go easy on me if the question is stupid.

Sudhir kumar
  • 549
  • 2
  • 8
  • 31

1 Answers1

3

I'll try my best to help you here. I cant garantee for correctness neither completeness.

  1. b) Reducing the politness delay will create more load on the site to crawl and can (on small servers) increase the recieving time in long term. But this is not a common problem nowadays so 50ms should still be fine. Also note that if it takes 250ms to recieve the response from the webserver it will still take 250ms for the next page to be crawled by this thread.

    c) I am not quite sure what you want to achieve with setting the crawlDepth to a value of two. E.g. a crawl depth from 1 would mean you crawl the seed than u crawl every site found on the seed and than u stop. (crawlDepth = 2 would just go one step further and so on). This will not influence your crawl speed, just your crawl time and the pages found.

  2. Do not implement time heavy actions within the CrawlerThread and all methods/classes it covers. Do them at the end or in an extra thread.

  3. There are no limits provided by the crawler-configuration itself. Limits will be set by your CPU(not likely) or the structure of the site to crawl (very likely).

  4. Add this line to your CrawlController: robotstxtConfig.setEnabled(false);

It should look like this now:

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
  1. The page fetcher will set some parameters and then send a HTTPget request to the webservice on the given url with the previous set parameters. The response from the webserver will be evaluated and some information like the response header and the html code in binary form will be saved.

Hope I could help you a bit.

Tobias K.
  • 83
  • 5
  • Thanks Tobias for ur answer. U have answered most of my questions and they are working good.But I did not understand your 2) answer. This Can u explain it more clearly. – Sudhir kumar Oct 07 '14 at 06:49
  • Explanation to 1.c:- Yes, what u have said is absolutely right.How fast we have completed crawling a domain depends on depth the crawling. That's y I mentioned that too!! – Sudhir kumar Oct 07 '14 at 06:56
  • If you start crawling, your crawler will initiate itself, get some URLs from the database, start fetching one, parse this one and then at some point come to the ´visit(Page page)´ method in you Crawler.class. You should not implement anything like I/O or other code which will take a long time in this whole process of crawling cause this will be blocking your crawler-thread. Aswell you should not add complex objects to the classes used internal by the crawler like WebURL or Page. – Tobias K. Oct 07 '14 at 07:41