I am looking at using mod_evasive and/or mod_throttle to prevent abusive access to my web site (running Apache 2.4). By "abusive", I mean using wget or HTTtrack to download the whole web site, for example. Both mod_evasive and mod_throttle have ways to limit the number of page accesses a user can make per unit time. So, for example, I can limit an IP address to 5 pages every 10 minutes or something like that.
However, I would like to allow search robots to exceed the limit.
So, there seem to be two options:
(1) I can somehow submit pages individually to search engines. So, I block robots from the site, but just send them pages explicitly whenever a page gets updated (can I do that?).
(2) Whitelist particular robots somehow. The problem here is that I won't know the IP address of the robot ahead of time.
What approach should be used?