How to block spider if he's disobeying the rules of robots.txt

Question

Is there any way to block a crawler/spider search bots if they're not obeying the rules written in robots.txt file. If yes, where can I find more info about it?

I would prefer some .htaccess rule, if not then PHP.

score 1 · Accepted Answer · answered Mar 13 '15 at 14:31

There are ways to prevent most bots from spidering your site.

Aside from filtering by user agent and known IP adresses, you should as well implement behaviour driven blocking. That means, if it acts like a crawler, block it.

You can find multiple lists of search engine bots here. But most of the big players obey the robots.txt.

So the other, rather big part is the blocking because of the bots behaviour. Things are getting less complicated when you are using a framework like Laravel or Symfony, because you easily set a filter to be executed before every page load. If not, you'd have to implement a function which is called before every page load.

Now there are some things to consider. A spider usually crawls as fast as it can. So you could use the session to measure time between page loads and page loads in a given time span. If amount X this is exceed, the client is blocked.

Sadly, this approach relies on the bot handling sessions/cookies correctly, which may not always be the case.

Another or an additional approach would be to measure the amount of page loads from a given IP address. This is dangerous because there may be as well a huge amount of users using the same IP address. So this may exclude humans.

A third approach I can think of is to use some kind of honeypot. Create a link that leads to a specific site. That link has to be visible to computers, but not to humans. Hide it away with some css. If someone or something is accessing the page using the hidden link, you can be (close to) sure it is a program. But be aware, there are browser addons which are preloading every link they can find. So you cannot rely totally on this.

Depending on the nature of your site, on last approach would be to hide the complete site behind a capture. This is a harsh measure in terms of usability, so decide carefully if it applies to your use case.

Then there are techniques like using flash or complicated Javascript most bots do not understand, but it's disgusting and I don't want to talk about it. ^^

Finally, I will now come to a conclusion.

By using a well written robots.txt most robots will leave you alone. In addition to that, you should combine all or some of the approaches mentioned beforehand to get the bad guys.

Afterall, as long as your site is publically available, you can never evade a custom made bot tailored specifically for your site. When a browser can parse it, a robot can do it as well.

For a more useful answer I would need to know what you are trying to hide and why.

Wow, thanks for so wide answer, 3stadt! I think I will go for an htaccess rule: Order Deny,Allow Deny from all — dotzzy, Mar 13 '15 at 14:57

How to block spider if he's disobeying the rules of robots.txt

1 Answers1