0

I am looking at using mod_evasive and/or mod_throttle to prevent abusive access to my web site (running Apache 2.4). By "abusive", I mean using wget or HTTtrack to download the whole web site, for example. Both mod_evasive and mod_throttle have ways to limit the number of page accesses a user can make per unit time. So, for example, I can limit an IP address to 5 pages every 10 minutes or something like that.

However, I would like to allow search robots to exceed the limit.

So, there seem to be two options:

(1) I can somehow submit pages individually to search engines. So, I block robots from the site, but just send them pages explicitly whenever a page gets updated (can I do that?).

(2) Whitelist particular robots somehow. The problem here is that I won't know the IP address of the robot ahead of time.

What approach should be used?

Tyler Durden
  • 477
  • 1
  • 6
  • 16

1 Answers1

0

The whitelist need not be IP based. mod_qos can do user agent based matching.

This won't stop anyone from pretending to be googlebot, but it will slow people who don't change it from wget.

Should downloads still seem excessive, try to detect spoofed user agents with analysis of your request logs. Use webmaster tools and known IP addresses of the search engines. How much time you spend on it depends on how valuable your web server resources are and how much you want to keep the entire site from being mirrored.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34