2

I need to block a bunch of robots from crawling a few hundred sites hosted on a Nginx web server running on a Ubuntu 16.04 machine.

I've found a fairly simple example here (important part of the code is below) but it seems that this functionality is available only inside of a server block (because of the IF statement) and I think this is a horrible idea, especially with a large number of sites on the machine.

  if ($http_user_agent ~* (ahrefs|wget|crawler|majestic) ) {
    return 403;
  }

So, the question is can something similar be achieved from the main nginx.conf file and work for all the domains currently defined in the sites-enabled folder and the ones added in the future?

I've also read about the map approach and found a whole project on GitHub that uses it - https://github.com/mariusv/nginx-badbot-blocker but it still requires editing of all the files in the sites-enabled folder and this will take too much time for a few hundred already up and running sites.

Sledge Hammer
  • 143
  • 10
  • Re: "too much time", sed script? – HTTP500 Aug 01 '16 at 15:38
  • @HTTP500 - Learning sed will take me even more time than editing the files manually, plus some of them are modified from their initial template and putting the additional code will make things even harder (at least that's what I think, not knowing the full extent of sed's capabilities). So, unless there's absolutely no other alternative sed is currently not an option. – Sledge Hammer Aug 01 '16 at 15:44
  • I just realized my answer wouldn't work because it didn't get around the server block issue – David Wilkins Aug 01 '16 at 16:17
  • You could use a CDN and block at the edge, before it hits the server. – Tim Aug 01 '16 at 21:00

0 Answers0