1

I've seen plenty of robot.txt stuff, and some mod-rewrite solutions that looked promising… but haven't been able to find a simple solution to block Spiders / Scrapers / whoever the hell I want to block… I'd rather do this by hostname / domain, as it seems simpler than relying on user-agents, etc…

For example, say I were to see this in Apache logs..

msnbot-207-46-192-48.search.msn.com - - [07/Dec/2011:23:01:41 -0500] "GET /%3f/$/bluebox/blog/2011/iphoto/ HTTP/1.1" 404 366

ok… I want to prevent *.search.msn.com from ever coming here, or any of my sites - in any of my folders - VHOST or otherwise…

Typically, I have MANY <VirtualHost *.80>'s setup, and DO NOT want to have to repeat the config for each host.. In that same vein, I have many DocumentRoot's… and putting some file in each of them, aka .htaccess really isn't an option..

I had been using something in httpd.conf that resembled…

RewriteEngine on

RewriteCond %{HTTP_USER_AGENT} ^BadBot [OR]

RewriteRule ^(.*)$ http://go.away/`

How can i use the hostnames provided by UseCanonicalName On to blanket-Deny all any domain I so desired?

mralexgray
  • 1,353
  • 3
  • 12
  • 29

2 Answers2

0

Might not be the best idea to do it by host name since Apache would have to do a lookup for each request.

Why not do it with IPtables?

ckliborn
  • 2,778
  • 4
  • 25
  • 37
0

UseCanonicalName is for the server hostname, not the client's.

This will work just fine in your global config, outside of any VirtualHost, as long as you don't have an Order directive in the vhosts:

Order Allow,Deny
Allow from all
Deny from search.msn.com
Shane Madden
  • 114,520
  • 13
  • 181
  • 251