20

I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed.

Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers?

Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to have to remember to copy over a certain setting if I make a new site.

Nick Messick
  • 3,202
  • 3
  • 30
  • 41

6 Answers6

39

Create a robots.txt file with the following contents:

User-agent: *
Disallow: /

Put that file somewhere on your staging server; your directory root is a great place for it (e.g. /var/www/html/robots.txt).

Add the following to your httpd.conf file:

# Exclude all robots
<Location "/robots.txt">
    SetHandler None
</Location>
Alias /robots.txt /path/to/robots.txt

The SetHandler directive is probably not required, but it might be needed if you're using a handler like mod_python, for example.

That robots.txt file will now be served for all virtual hosts on your server, overriding any robots.txt file you might have for individual hosts.

(Note: My answer is essentially the same thing that ceejayoz's answer is suggesting you do, but I had to spend a few extra minutes figuring out all the specifics to get it to work. I decided to put this answer here for the sake of others who might stumble upon this question.)

jsdalton
  • 6,555
  • 4
  • 40
  • 39
  • saved me a lot of time. Thnx. – Khuram May 07 '12 at 09:23
  • What is the `Alias` referring to? If I have several vhosts should I create an `Alias` for each? – nicoX Sep 28 '14 at 18:12
  • @nicoX: You do not need to create a separate `Alias` for each vhost. The one you create here will apply to all vhosts you create. – jsdalton Sep 29 '14 at 18:44
  • From the `httpd.conf` file: We have the `LoadModule vhost_alias_module modules/mod_vhost_alias.so` our `DocumentRoot /var/www/html` which is wrong as we are using `/var/www/vhosts` although that still works. We include our vhosts with `Include` with the path to its `httpd-include.conf` file. I included the `robots.txt` file for each *vhosts* in its root directory. And `httpd.conf` I have the Alias of the file to just one of my *vhosts* – nicoX Oct 03 '14 at 13:02
4

You can use Apache's mod_rewrite to do it. Let's assume that your real host is www.example.com and your staging host is staging.example.com. Create a file called 'robots-staging.txt' and conditionally rewrite the request to go to that.

This example would be suitable for protecting a single staging site, a bit of a simpler use case than what you are asking for, but this has worked reliably for me:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Dissuade web spiders from crawling the staging site
  RewriteCond %{HTTP_HOST}  ^staging\.example\.com$
  RewriteRule ^robots.txt$ robots-staging.txt [L]
</IfModule>

You could try to redirect the spiders to a master robots.txt on a different server, but some of the spiders may balk after they get anything other than a "200 OK" or "404 not found" return code from the HTTP request, and they may not read the redirected URL.

Here's how you would do that:

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect web spiders to a robots.txt file elsewhere (possibly unreliable)
  RewriteRule ^robots.txt$ http://www.example.com/robots-staging.txt [R]
</IfModule>
2

Could you alias robots.txt on the staging virtualhosts to a restrictive robots.txt hosted in a different location?

ceejayoz
  • 176,543
  • 40
  • 303
  • 368
2

To truly stop pages from being indexed, you'll need to hide the sites behind HTTP auth. You can do this in your global Apache config and use a simple .htpasswd file.

Only downside to this is you now have to type in a username/password the first time you browse to any pages on the staging server.

chazomaticus
  • 15,476
  • 4
  • 30
  • 31
  • there is a problem with this approach, when you want to expose some APIs to different services that don't support HTTP Auth. In this case you'll have to disable it for that specific host, which can lead to a mess in time. – Dan Bizdadea May 13 '14 at 12:50
1

Depending on your deployment scenario, you should look for ways to deploy different robots.txt files to dev/stage/test/prod (or whatever combination you have). Assuming you have different database config files or (or whatever's analogous) on the different servers, this should follow a similar process (you do have different passwords for your databases, right?)

If you don't have a one-step deployment process in place, this is probably good motivation to get one... there are tons of tools out there for different environments - Capistrano is a pretty good one, and favored in the Rails/Django world, but is by no means the only one.

Failing all that, you could probably set up a global Alias directive in your Apache config that would apply to all virtualhosts and point to a restrictive robots.txt

Kevin
  • 1,295
  • 2
  • 13
  • 9
0

Try Using Apache to stop bad robots. You can get the user agents online or just allow browsers, rather than trying to block all bots.

Greg
  • 316,276
  • 54
  • 369
  • 333