1

I'm trying to block yandex from my site. I've tried the solutions posted in other threads but they are not working so I'm wondering if I am doing something wrong?

The user-agent string is:

    Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots

I have tried the following (one at a time). RewriteEngine is on

    SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
    Order Allow,Deny
    Deny from env=bad_bot_block
    Allow from ALL

    SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block
    <RequireAll>
    Require all granted
    Require not env bad_bot_block       
    </RequireAll>

Can anyone see a reason one of the above won't work or have any other suggestions?

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
user3052443
  • 758
  • 1
  • 7
  • 22

2 Answers2

1

In case anyone else has this problem, the following worked for me:

    RewriteCond %{HTTP_USER_AGENT} ^.*(yandex).*$ [NC]
    RewriteRule .* - [F,L]
user3052443
  • 758
  • 1
  • 7
  • 22
  • 2
    Your regex can be simplified... `^.*(yandex).*$` is the same as simply `yandex` (no need for the capturing subpattern). And `.*` can be reduced to `^` (`.*` is comparatively inefficient since it forces a traversal of the entire URL-path.) The `L` flag is not required when used with `F` - it is _implied_. The `NC` flag on the _condition_ would not seem to be required (forcing a case-insensitive match is again less efficient). – MrWhite Jul 29 '22 at 09:15
  • Thanks once again. I will make those changes. Regarding the NC flag, there will be other bots to block and I'm not sure of the case used so I thought it safer to use that flag. I know I can look it up in the user-agent string but I'm trying to make this automatic in the code. – user3052443 Jul 29 '22 at 19:09
0
SetEnvIfNoCase User-Agent "^yandex.com$" bad_bot_block

With the start and end-of-string anchors in the regex you are bascially checking that the User-Agent string is exactly equal to "yandex.com" (except that the . is any character), which clearly does not match the stated user-agent string.

You need to check that the User-Agent header contains "YandexBot" (or "yandex.com"). You can also use a case-sensitive match here, since the real Yandex bot does not vary the case.

For example, try the following instead:

SetEnvIf User-Agent "YandexBot" bad_bot_block

Consider using the BrowserMatch directive instead, which is a shortcut for SetEnvIf User-Agent.

If you are on Apache 2.4 then you should be using the Require (second) variant of your two code blocks. Order, Deny and Allow directives are Apache 2.2 and formerly deprecated on Apache 2.4.

However, consider using using robots.txt instead to block crawling in the first place. Yandex supposedly supports robots.txt.

MrWhite
  • 43,179
  • 8
  • 60
  • 84
  • The .com you mentioned was the problem but I still couldn't get it to work with either of the methods I showed. Instead I used rewrite as shown and that worked. Thank you for pointing out the mistake. Also, regarding the robots file, yandex doesn't honor it. It was one of the first things I tried. – user3052443 Jul 28 '22 at 18:54
  • 1
    "but I still couldn't get it to work" - You likely have a conflict with other directives. For instance, if you have mod_rewrite directives that trigger a second (or more) pass(es) through the rewrite engine then the env var is renamed with a `REDIRECT_` prefix and you will need to check for `REDIRECT_bad_bot_block` instead. (Although this can often be avoided by modifying your existing directives, which may require Apache 2.4.) – MrWhite Jul 29 '22 at 09:12