1

I have a Website where :

  • We currently have a lot of bots trying to grab our content (it's a business directory) from China, Ukraine, etc...
  • 5% of IP addresses are "unresolved", according to AWStats

So my idea is to limit the number of HTTP requests per IPs (excepted for well-known bots, for instance Google Bot) :

  • That would solve my #1 problem (the bots)
  • But that would also block all "unresolved IP" traffic

-> is it a good idea to block all those "unresolved" IPs? Am I going to block some legitimate traffic?

3 Answers3

4

Just limiting the number of HTTP request per address would not result in blocking "unresolved" IP addresses.

Finding out which address is "unresolved" in real time would force reverse DNS lookups for every visitor at least once. This would increase your initial HTTP response times at best and create a near-DoS condition when name servers are unavailable and timing out - you really do not want that.

In general, trying to protect publicly available content from being grabbed by bots is a Sisyphean task - you surely would not let every visitor pass a Turing test before admitting them to your site. Any of the available approaches would only be able to lower the load on your web servers, not prevent grabbing completely. Also, as with all statistics-based approaches for differentiation, reducing the number of bots being able to access your content inevitably would increase the number of regular human users inconvenienced by your blocking rules.

the-wabbit
  • 40,737
  • 13
  • 111
  • 174
  • Are you sure that finding a user's IP address requires to ask a DNS server ? That would be to find his domain name, not his IP. Those "unresolved IP addresses" are users for which I have no IP address at all. – Julien Dubois May 15 '12 at 12:10
  • 1
    @JulienDubois You cannot have a valid connection without knowing the client's IP address, so "finding" the IP address does not induce any additional effort - it is already exposed to the web server by the networking API. It is finding the appropriate DNS domain name via a [reverse DNS lookup](http://en.wikipedia.org/wiki/Reverse_DNS_lookup) which is problematic. – the-wabbit May 15 '12 at 12:33
  • OK, so back to my question : how come 5% of the IP addresses are "unresolved". I mean : those people have no IP address at all written in the logs. – Julien Dubois May 15 '12 at 12:44
1

I don't know awstats in detail, but i think the "unresolved" status does apply to all ip addresses without an reverse DNS record. Blocking all the traffic from ip addresses without reverse record will kill a lot of normal visitors.

Try to selective block the bots by inspecting your website logs. You can use fail2ban to block this traffic in an automated way. fail2ban is based on logfile analysis, so you only have to find a pattern in your access.log and configure fail2ban accordingly.

ercpe
  • 576
  • 3
  • 15
1

Your proposal is predicated on the 5% of addresses being the same addresses which are stealing your content - but you don't say if that's the case. Certainly I'd expect that you will be blocking a lot of legitimate traffic.

I agree with most of what syneticon-dj says, however there are more effective ways of anti-leeching (try googling for that term). Checking the referer, requiring a session id, using CSRF protection but passing the token in a cookie instead of a form field. This provides a mechanism for identifying the leechers - in terms of blocking them, then really you want to do this at as early a stage as possible - i.e. when you get a SYN packet from such an IP. That means blocking them on the firewall. Fail2ban provides a method for re-configuring your firewall on the fly based on log entries. But do beware that long chains of iptables rules will affect the latency and hence throughput.

symcbean
  • 21,009
  • 1
  • 31
  • 52