2

Our company has temp development urls that are being indexed by search engines. We need to get this to stop via a global htaccess file. By global, i mean i want to drop this access into our root that will apply the rules for each site. Every time we build a new site, i don't want to drop a htaccess file in that folder.

I am terrible at writing htaccess rules, otherwise i would have done it myself. I would appreciate any input from the community.

Here is an example temp url: 1245.temp.oursite.com

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} AltaVista [OR]
RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp
RewriteRule ^.*$ "http\:\/\/oursite\.com" [R=301,L]

I've tried playing with this, but like i stated above, i'm terrible at writing htaccess rules.

Edit The question is similar to this one, however mine involves sub-domains.

Community
  • 1
  • 1
Geoffrey
  • 457
  • 5
  • 21
  • 1
    You don't need to escape the destination part of the rule. Just use `http://oursite.com/`. – Mike Rockétt May 18 '15 at 16:08
  • possible duplicate of [Block all bots/crawlers/spiders for a special directory with htaccess](http://stackoverflow.com/questions/10735766/block-all-bots-crawlers-spiders-for-a-special-directory-with-htaccess) – Jan May 18 '15 at 16:09
  • Mike, wouldn't that prevent the bots from hitting our site? – Geoffrey May 18 '15 at 16:35
  • @Geoffrey see my edit to my answer and see if that's what you're looking for. – Panama Jack May 18 '15 at 16:41

2 Answers2

4

If you don't want search engines to index the sites, add a robots.txt file to those subdomains. It should contain:

User-agent: *
Disallow: /

All major search engines respect the Web Robots standard.

ceejayoz
  • 176,543
  • 40
  • 303
  • 368
  • We have over 3,000 temp urls. I do not want to drop/edit robots.txt for each of them. – Geoffrey May 18 '15 at 16:29
  • @Geoffrey You have over 3,000 sites on a server with no configuration management system? Brave. Should be relatively easy to serve a single shared robots file for all your temp URLs via the server config, though. – ceejayoz May 18 '15 at 16:48
  • I didn't pick the server config, or how it's done. I just started here a few months ago, and i agree that their setup is garbage. But i gotta work with what already exists. Your response is correct, however in this case it wouldn't make much sense for the server config we have. I also have already tried to get the host to adjust the httpd.conf to point to a robots.txt file, but they won't. – Geoffrey May 18 '15 at 16:57
2

If you simply want a universal file to block robots then you can use something like this. This is not specific to a domain.

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^.*(AltaVista|Googlebot|msnbot|Slurp).*$ [NC]
RewriteRule .* - [F,L]

Edit: If you're subdomains are accessible from the main root .htaccess file then you can use a method like this and any temp domain it should block access.

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^.*(AltaVista|Googlebot|msnbot|Slurp).*$ [NC]
RewriteCond %{HTTP_HOST} ^([0-9]+)\.temp\.oursite\.com$ [NC]
RewriteRule .* - [F,L]
Panama Jack
  • 24,158
  • 10
  • 63
  • 95
  • @ceejayoz Of course there are but I'm not a mind reader. I used what the OP put because he didn't specifically say he wanted to block them all. He could want just certain `bad ones` that he wants blocked. I answered his question in the context of what he was already doing. – Panama Jack May 18 '15 at 16:22
  • No, that's not how SO works. We don't pretend better solutions don't exist just because the OP doesn't know to ask about them. – ceejayoz May 18 '15 at 16:49
  • @ceejayoz so why was it down voted when it's what the OP wanted. That makes no sense. – Panama Jack May 18 '15 at 18:50
  • @Geoffrey no problem glad I could help you find your answer. – Panama Jack May 18 '15 at 18:50