Exclude one of subdomains from being crawled using Robots.txt

Question

We have an Umbraco website which has several sub-domains and we want to exclude one of them from being crawled in search engines for now. I tried to change my Robots.txt file but seems I am not doing it right.

Url: http://mywebsite.co.dl/

subdomain: http://sub1.mywebsite.co.dl/

My Robots.txt content is as follow:

User-agent: *
Disallow: sub1.*

What I have missed?

What is document root of sub domain? Is it same as the main domain? — anubhava, Mar 07 '14 at 16:00
If it is not then `Disallow: /` in sub domain's doc root should work. Where is the problem? — anubhava, Mar 10 '14 at 11:55

score 2 · Answer 1 · answered Mar 07 '14 at 15:53

2

The following code will block http://sub1.mywebsite.co.dl. from being indexed:

User-agent: *
Disallow: /sub1/

You can also add another robots.txt file in the sub1 folder with the following code:

User-agent: *
Disallow: /

and that should help as well.

answered Mar 07 '14 at 15:53

Howli

12,291
19
47
72

n Umbraco subdomains do not have separate folders. You can define as many as HostNames you want in Umbraco backend and you would have several subdomains for your website. – amir moradifard Mar 10 '14 at 11:49

score 0 · Answer 2 · edited May 23 '17 at 10:31

0

If you want to block anything on http://sub1.mywebsite.co.dl/, your robots.txt MUST be accessible on http://sub1.mywebsite.co.dl/robots.txt.

This robots.txt will block all URLs for all supporting bots:

User-agent: *
Disallow: /

edited May 23 '17 at 10:31

Community

1
1

answered Mar 08 '14 at 06:11

unor

92,415
26
211
360

In Umbraco subdomains do not have separate folders. You can define as many as HostNames you want in Umbraco backend and you would have several subdomains for your website. – amir moradifard Mar 10 '14 at 11:49
@amirmoradifard: It doesn’t matter how it’s implemented on the backend. The only thing that matters is the URL that gets used by visitors. So if someone visits a page accessible on `sub.example.com`, the robots.txt **must** be accessible from `sub.example.com/robots.txt`. No other place will work (but you may redirect, I guess). – unor Mar 10 '14 at 12:40
It does matter. As far as whole website contains one main root and one robots.txt. Howlin's answer worked for me. – amir moradifard Mar 12 '14 at 10:45
@amirmoradifard: Howlin’s first code will not work. Disallowing `/sub1/` does *not* block `http://sub1.mywebsite.co.dl/` (it would block `http://sub1.mywebsite.co.dl/sub1/` and anything after that). – unor Mar 12 '14 at 13:24
I used his suggestion and manipulated it to : Disallow: sub1.mywebsite.co.dl/* and yes its working. – amir moradifard Mar 12 '14 at 13:27
@amirmoradifard: It’s working where? This is not valid according to the robots.txt specification. `Disallow` can not contain domains, it contains (beginnings of) URL paths. Your code would block `http://sub1.mywebsite.co.dl/sub1.mywebsite.co.dl/`. – unor Mar 12 '14 at 13:55
Yes you are right IF you use Disallow: /sub1.mywebsite.co.dl/* but by getting rid of first slash, it works. At least on my case its working! – amir moradifard Mar 12 '14 at 14:21
@amirmoradifard: May I ask how you know that it’s working? Even without the beginning slash, it would still be the URL **path**, not the host. So this should not work with any conforming bot/parser. – unor Mar 13 '14 at 13:20

Exclude one of subdomains from being crawled using Robots.txt

2 Answers2