0

I have built a multi threaded web crawler which makes requests to fetch the web pages from corresponding servers. As it is multi threaded it can make overburden a server. Due to which server can block the crawler(politeness).

I just want to add functionality of minimum delay between consequtive request to same server. Whether storing minimum delay from robot.txt from each server(domain) into a HashMap and comparing it to last timing of request made to that particular server will be all right?

What if no delay is specified in robot.txt ?

Prannoy Mittal
  • 1,525
  • 5
  • 21
  • 32

2 Answers2

1

The defacto standard robots.txt file format doesn't specify a delay between requests. It is a non-standard extension.

The absence of a "Crawl-delay" directive does not mean that you are free to hammer the server as hard as you like.


Whether storing minimum delay from robot.txt from each server(domain) into a HashMap and comparing it to last timing of request made to that particular server will be all right?

That is not sufficient. You also need to implement a minimum time between requests for cases where the robots.txt doesn't use the non-standard directive. And you should also respect "Retry-After" headers in 503 responses.

Ideally you should also pay attention to the time taken to respond to a request. A slow response is potential indication of congestion or server overload, and a site admin is more likely to block your crawler if it is perceived to be the cause of congestion.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

I use 0.5 seconds as delay on my web crawler. Use that as default, and if it is specified you should use that.