How to block fake google spider and fake web browser access?

Question

Recently I found that someguys are trying to mirror my website. They are doing this in two ways:

Pretend to be google spiders . Access logs are as following:

89.85.93.235 - - [05/May/2015:20:23:16 +0800] "GET /robots.txt HTTP/1.0" 444 0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "66.249.79.138"
79.85.93.235 - - [05/May/2015:20:23:34 +0800] "GET /robots.txt HTTP/1.0" 444 0 "http://www.example.com" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "66.249.79.154"

The http_x_forwarded_for address are google addresses.

Pretend to be a normal web browser.

I'm trying to use below configuration to block their access:

For problem 1, I'll check X_forward_for address. If user agent is spider, and X_forward_for is not null. Then block. I'm using

map $http_x_forwarded_for $xf {
    default 1;
    "" 0;
}

map $http_user_agent $fakebots {
    default 0;
    "~*bot" $xf;
    "~*bing" $xf;
    "~*search" $xf;
}

if ($fakebots) {
    return 444;
}

With this configuration, it seems the fake google spider can't access the root of my website. But they can still access my php files, and they can't access and js or css files. Very strange. I don't know what's wrong.

For problem 2 user-agent who declare they are not spiders. I'll use ngx_lua to generate a random value and add the value into cookie, and then check whether they can send this value back or not. If they can't send it back, then it means that they are robot and block access.

map $http_user_agent $ifbot {
    default 0;
    "~*Yahoo" 1;
    "~*archive" 1;
    "~*search" 1;
    "~*Googlebot" 1;
    "~Mediapartners-Google" 1;
    "~*bingbot" 1;
    "~*msn" 1;
    "~*rogerbot" 3;
    "~*ChinasoSpider" 3;
}

if ($ifbot = "0") {
    set $humanfilter 1;
}
    #below section is to exclude flash upload
if ( $request_uri !~ "~mod\=swfupload\&action\=swfupload" ) {
    set $humanfilter "${humanfilter}1";
}


if ($humanfilter = "11") {
    rewrite_by_lua '
    local random = ngx.var.cookie_random
    if(random == nil) then
        random = math.random(999999)
    end
    local token = ngx.md5("hello" .. ngx.var.remote_addr .. random)
    if (ngx.var.cookie_token ~= token) then
      ngx.header["Set-Cookie"] = {"token=" .. token, "random=" .. random}
      return ngx.redirect(ngx.var.scheme .. "://" .. ngx.var.host .. ngx.var.request_uri)
    end
    ';
}

But it seems that with above configuration, google bot is also blocked while it shouldn't.

And the last questions is that I tried to use "deny" to deny access from an IP. But it seems that it can still access my server.

In my http part:

deny  69.85.92.0/23;
deny  69.85.93.235;

But when I check the log, I still can find

69.85.93.235 - - [05/May/2015:19:44:22 +0800] "GET /thread-1251687-1-1.html HTTP/1.0" 302 154 "http://www.example.com" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "123.125.71.107"
69.85.93.235 - - [05/May/2015:19:50:06 +0800] "GET /thread-1072432-1-1.html HTTP/1.0" 302 154 "http://www.example.com" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "220.181.108.151"
69.85.93.235 - - [05/May/2015:20:15:44 +0800] "GET /archiver/tid-1158637.html?page=1 HTTP/1.0" 302 154 "http://www.example.com" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "180.76.5.196"
69.85.93.235 - - [05/May/2015:20:45:09 +0800] "GET /forum.php?mod=viewthread&tid=1131476 HTTP/1.0" 302 154 "http://www.example.com" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "123.125.71.53"

Any one can help?

Did you have setup your robot.txt correctly ? You can block spider's indexing that way (User-agent: * Disallow: /) — yagmoth555, May 05 '15 at 13:08
Thanks yagmoth555. I don't want to block google spider, I only want to block fake googlespiders. So it means I can't use robot.txt to block the robot. — Meteor, May 05 '15 at 13:27

How to block fake google spider and fake web browser access?

0 Answers0