1

We run apache (on windows) and NGINX (on CentOS) development servers. I have the problem that Google some how keeps managing to get hold of the development addresses and indexes them (could it be from the Chrome address bar?) Is there a way of blocking all traffic from bots/spiders at a server level, before having to resort to individual robots.txt files in each site, or password only access?

A related problem is on the live environment (NGINX on CentOS) where we use a static asset domain to serve images and js etc, again, Google has gone and indexed this within it's search results, is there a way to prevent this?

Eric
  • 111
  • 2

1 Answers1

0

First of all you should provide a valid robots.txt file in the root of your domain. It's common way to ask google and other legal web crawlers to not to go through your website.

With nginx etc pretty easy to ban selected useragents:

if ($http_user_agent ~ (Googlebot|bingbot|whatever) ) {
    return 403;
}

you can put this code in a separate file and include it in every server block.

DukeLion
  • 3,259
  • 1
  • 18
  • 19
  • Thanks, I've put a robots.txt in, and the nginx config snippet is useful too, I'll drop that in. Is there a similar method for IIS? I still have the problem on the live machines of Google indexing the static asset domain. I can't use a robots.txt in that instance because it is the same folder as the primary domain. Should I look at something like the nginx snippet but add in a domain check portion that looks for the word "static" in the domain? – Eric Aug 12 '13 at 15:07
  • Why not to put IIS behind nginx? This way has a lot of benefits, not just user agent blocking. You can put this block of code into specific `server {}` or `location {}` section to enable it just for some domains – DukeLion Aug 12 '13 at 16:11
  • And you can put a list of folders you don't want to be scanned into robots.txt http://www.robotstxt.org/robotstxt.html – DukeLion Aug 12 '13 at 16:15
  • The live IIS is behind NGINX but the development one isn't and wouldn't really be practical to do so. With blocking certain folders this wouldn't stop Google from indexing actual pages under the static URL. Many thanks. – Eric Aug 13 '13 at 07:26
  • It is always recommended to have development configuration as close to production as possible. – DukeLion Aug 13 '13 at 09:28