8

I'm trying to separate the robot access log and human access log, so I'm using below configuration:

    http {
....
    map $http_user_agent $ifbot {
        default 0;
        "~*rogerbot"        3;
        "~*ChinasoSpider"       3;
        "~*Yahoo"           1;
        "~*Bot"         1;
        "~*Spider"          1;
        "~*archive"         1;
        "~*search"          1;
        "~*Yahoo"           1;
        "~Mediapartners-Google" 1;
        "~*bingbot"         1;
        "~*YandexBot"           1;
        "~*Feedly"  2;
        "~*Superfeedr"  2;
        "~*QuiteRSS"    2;
        "~*g2reader"    2;
        "~*Digg"    2;
        "~*trendiction"     3;
        "~*AhrefsBot"           3;
        "~*curl"            3;
        "~*Ruby"            3;
        "~*Player"          3;
        "~*Go\ http\ package"   3;
        "~*Lynx"            3;
        "~*Sleuth"          3;
        "~*Python"          3;
        "~*Wget"            3;
        "~*perl"            3;
        "~*httrack"         3;
        "~*JikeSpider"          3;
        "~*PHP"         3;
        "~*WebIndex"            3;
        "~*magpie-crawler"      3;
        "~*JUC"         3;
        "~*Scrapy"          3;
        "~*libfetch"            3;
        "~*WinHTTrack"      3;
        "~*htmlparser"      3;
        "~*urllib"          3;
        "~*Zeus"            3;
        "~*scan"            3;
        "~*Indy\ Library"       3;
        "~*libwww-perl"     3;
        "~*GetRight"            3;
        "~*GetWeb!"         3;
        "~*Go!Zilla"            3;
        "~*Go-Ahead-Got-It"     3;
        "~*Download\ Demon" 3;
        "~*TurnitinBot"     3;
        "~*WebscanSpider"       3;
        "~*WebBench"        3;
        "~*YisouSpider"     3;
        "~*check_http"      3;
        "~*webmeup-crawler"     3;
        "~*omgili"      3;
        "~*blah"        3;
        "~*fountainfo"      3;
        "~*MicroMessenger"      3;
        "~*QQDownload"      3;
        "~*shoulu.jike.com"     3;
        "~*omgilibot"       3;
        "~*pyspider"        3;
    }
....
}

And in server part, I'm using:

    if ($ifbot = "1") {
    set $spiderbot 1;
}
if ($ifbot = "2") {
    set $rssbot 1;
}
if ($ifbot = "3") {
    return 403;
    access_log /web/log/badbot.log  main;
}

access_log /web/log/location_access.log  main;
    access_log /web/log/spider_access.log main if=$spiderbot;
    access_log /web/log/rssbot_access.log main if=$rssbot;

But it seems that nginx will write some robot logs in to both location_access.log and spider_access.log.

How can I separate the logs for the robot?

And another questions is that some robot logs are not written to spider_access.log but exist in location_access.log. It seems that my map is not working. Is anything wrong when I define "map"?

Meteor
  • 151
  • 1
  • 6
  • 1
    You don't have any conditions for `location_access.log` so it logs every request. – Alexey Ten Apr 27 '15 at 06:37
  • "not written to spider_access.log but exist in location_access.log", check if their UA match anything in map – Alexey Ten Apr 27 '15 at 06:38
  • Which condition should I use? I thought at the same level (http, server, location), if it log to one file, then it will not log to another one. – Meteor Apr 27 '15 at 09:50
  • Your assumption is wrong. "Several logs can be specified on the same level" – Alexey Ten Apr 27 '15 at 10:03
  • 1
    How about using the `map` variable as the filename for the log file? `access_log /web/log/$logtype_access.log main;`, and you set $logtype via the `map`? – Tero Kilkanen Apr 27 '15 at 21:02
  • @TeroKilkanen sorry, I didn't catch you. what do you mean set logtype via map? – Meteor Apr 27 '15 at 23:40
  • Set the variable called `$logtype` via the map, and use the variable in the log file name. Maybe me calling the variable that was a bit misleading, sorry. – Tero Kilkanen Apr 28 '15 at 01:22

2 Answers2

1

Working solution, without any other process involved:

Inspired from the comments. You can adapt it easily to several kinds of bots (bad/good ones) and put the return 403; statement in the right part. The idea is following:

In the http part:

map $http_user_agent $bot {
    default "";
    "~*Googlebot"   "yes";
    "~*MJ12bot"     "yes";
    # Add as many as desired
}
map $bot $no_bot {
    default "no";
    "yes"   "";
}

Then, in the server part:

access_log   /var/log/regular_access.log main if=$no_bot;
access_log   /var/log/bots_access.log main if=$bot;

This works but is not really nice when you want to use nginx as reverse proxy and redirect to several web servers. (Not very flexible way to define the names of the the logfiles).

Better looking but not working

I would have liked to use this solution:

http part:

map $http_user_agent $bot_header {
    default "";
    "~*Googlebot"   "bots_";
    "~*MJ12bot"     "bots_";
    # Add as many as desired
}

map $server_name $log_filename {
    default          "unknown";
    "site1....."     "site1_***.log";
    "site2....."     "site2_***.log";
}

And then, in each server part:

server { # simple reverse-proxy...
        listen       37........:80;
        server_name  dev.****.net;
        access_log   /var/log/nginx/access/$bot_header$log_filename  main;

        # pass all requests
        location / {
                    # There, your config
        }
  }

But this second one doesn't work. Even if it's the right path to the right file, with the correct rights on it, nginx records an error saying its rights are not sufficient. Funny part is, this error is logged into a file having exactly the same owners and rights as the one where it can't write. No idea why, or whether it's a bug? Maybe someone can try and fix the problem?

zezollo
  • 430
  • 1
  • 4
  • 10
0

You are pushing the limits of the if conditional Nginx, which was intended for minimal use.

Consider using Rsyslog to follow your Nginx access log. Rsyslog has robust options for matching the contents of log strings and sending them to different logs as a result. Then you can have the three separate logs that you are looking for.

Mark Stosberg
  • 3,901
  • 24
  • 28