0

I want to stop bingbot completely and immediately .

I'd like to do this using mod_rewrite in .htaccess.

I've got these rules ...

Options +FollowSymLinks 
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT}  ^bingbot/.*         [OR]
RewriteCond %{HTTP_USER_AGENT}  ^Bingbot/.*         [OR]
RewriteRule ^(.*)$ http://go.away/                  [L]

... but they're not working. What I can see in my logs is this type of entry ...

msnbot-207-46-195-224.search.msn.com - - [11/Jul/2011:15:07:27 -0700] "GET /index.php?url_mainnav=13&url_subnav=131&url_expand=394,949,4631&url_startrow=110 HTTP/1.1" 403 502 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

... I've tried numerous variations on the regex for HTTP_USER_AGENT but I can't the response I want so I presume that the actual structure of the rules I'm using is incorrect.

Can anyone point me in the right direction ?

By the way I know this sort of thing is much better done in iptables etc and I also know about robots.txt. It's shared hosting so I don't have control of iptables and I don't want to wait the six/eight hours for bingbot to reread robots.txt.


Well things are moving forward. Taking the answer into account I changed the rewrite rules to :

Options +FollowSymLinks 
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT}  ^bingbot/.*             [OR,NC]
RewriteCond %{HTTP_USER_AGENT}  .*bingbot/.*            [OR]
RewriteCond %{HTTP_USER_AGENT}  .*Bingbot/.*            [OR]
RewriteRule ^(.*)$ http://go.away/                      [L]

The entries for the bingbot are still appearing in the access log but this has made me realise that (I think) I'm misinterpreting the HTTP response codes shown in the logs. It seems that 403 is 'Forbidden' so perhaps my rule here is doing what I want (telling bingbot to go away) but the request is getting logged ? I thought the log would not reflect stuff that was pushed away by mod_rewrite ? Would be interested if anyone can comment as I'm still not 100% that I'm getting rid of the accesses by bingbot.

glaucon
  • 253
  • 1
  • 6
  • 16
  • 1
    But you are willing to wait an unspecified amount of time for an answer on ServerFault. How curious. – womble Jul 11 '11 at 23:05

1 Answers1

1

Well, the regex in your RewriteCond demands that the User Agent start with bingbot. That's what the ^ in the regex does.

^bingbot/.*

Since the User Agent (from your log example) doesn't start with that, it won't match and skips the Rule.

"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Remove the ^ and it should work, though I've not tested.

A tip: you can remove duplication from your RewriteConds by making the match case-insensitive with the [NC] option.

RewriteCond %{HTTP_USER_AGENT}  ^bingbot/.*         [OR,NC]
Martijn Heemels
  • 7,728
  • 7
  • 40
  • 64
  • OK thanks for that but now I'm a bit puzzled because if (as you correctly point out) the UA starts "Mozilla..." shouldn't the regex *not* start with a carat ? I mean doesn't the carat indicate the start of the user agent string ? – glaucon Jul 11 '11 at 23:02
  • 1
    @glaucon Just use these lines `RewriteCond %{HTTP_USER_AGENT} bingbot [NC]` `RewriteRule . - [F,L]` and you should be fine (it will show Apache's 403 error). – LazyOne Jul 11 '11 at 23:12
  • Exactly, the carat at the beginning of your regex indicates that you require 'bingbot' to be the first letters in the string. This obviously doesn't match "Mozilla...", which means the condition is not met, and the Rule is not executed. – Martijn Heemels Jul 12 '11 at 20:43