Minimum Delay betweeen consecutive request to server by a web crawler

Question

I have built a multi threaded web crawler which makes requests to fetch the web pages from corresponding servers. As it is multi threaded it can make overburden a server. Due to which server can block the crawler(politeness).

I just want to add functionality of minimum delay between consequtive request to same server. Whether storing minimum delay from robot.txt from each server(domain) into a HashMap and comparing it to last timing of request made to that particular server will be all right?

What if no delay is specified in robot.txt ?

Can you please elaborate your question in more detail ? – Gunjan Shah Oct 10 '12 at 10:29 — Gunjan Shah, Oct 10 '12 at 10:29

score 1 · Answer 1 · answered Oct 10 '12 at 10:57

The defacto standard robots.txt file format doesn't specify a delay between requests. It is a non-standard extension.

The absence of a "Crawl-delay" directive does not mean that you are free to hammer the server as hard as you like.

Whether storing minimum delay from robot.txt from each server(domain) into a HashMap and comparing it to last timing of request made to that particular server will be all right?

That is not sufficient. You also need to implement a minimum time between requests for cases where the robots.txt doesn't use the non-standard directive. And you should also respect "Retry-After" headers in 503 responses.

Ideally you should also pay attention to the time taken to respond to a request. A slow response is potential indication of congestion or server overload, and a site admin is more likely to block your crawler if it is perceived to be the cause of congestion.

score 1 · Answer 2 · answered Oct 15 '12 at 09:20

1

I use 0.5 seconds as delay on my web crawler. Use that as default, and if it is specified you should use that.

answered Oct 15 '12 at 09:20

Tsubasa Kato

33
7

Minimum Delay betweeen consecutive request to server by a web crawler

2 Answers2