1

I am using Heroku pipes. So when I push my application it is pushed to staging app

https://appname.herokuapp.com/

and if everything is correct I promote that app to prodcution. There is no new build process. It is the same app that was build the first time for staging.

https://appname.com/

The thing is that this causes a problem with duplicate content. Sites are clones of each other. Exactly the same. I would like to exclude the staging app from Google indexing and search engine.

One way that I thought off was with robots.txt file.

For this to work I should write it like this

User-agent: *
Disallow: https://appname.herokuapp.com/

using the absolute path because this file will be on the server in staging and production application and I only wanna remove staging app from Google indexing and not touch the production one.

Is this the right way to do it?

Igor-Vuk
  • 3,551
  • 7
  • 30
  • 65

2 Answers2

4

No, the Disallow field can’t take full URL references. Your robots.txt would block URLs like these:

  • https://example.com/https://appname.herokuapp.com/
  • https://example.com/https://appname.herokuapp.com/foo

The Disallow value always represents the beginning of the URL’s path.

To block all URLs under https://appname.herokuapp.com/, you would need:

Disallow: /

So you have to use different robots.txt files for https://appname.herokuapp.com/ and https://appname.com/.

If you don’t mind bots crawling https://appname.herokuapp.com/, you could make use of noindex instead. But this would also require different behaviour for both sites. An alternative that doesn’t require different behaviour could be to make use of canonical. This conveys to crawlers which URL is preferred for indexing.

<!-- on https://appname.herokuapp.com/foobar -->
<link rel="canonical" href="https://appname.com/foobar" />
<!-- on https://appname.com/foobar -->
<link rel="canonical" href="https://appname.com/foobar" />
unor
  • 92,415
  • 26
  • 211
  • 360
  • thanks unor. That was my conclusion also. To block one site but not the other the only choice is canonical but since they are clones of each other it means that just as you said both of them would have canonical tag. Is it ok for ```appname.com``` to have canonical that goes also to ```appname.com```? Also do i need just a host name for cacnonical or one for each route in my application? – Igor-Vuk Aug 03 '18 at 03:12
  • @Igor-Vuk: Yes, self-referential `canonical` [is fine](https://stackoverflow.com/a/20437674/1591669). And you need to specify it on each page, not just the homepage (which is why my example uses `/foobar`). --- Note that you could probably also find a solution to serve different robots.txt file, so it’s not necessarily the only solution. And also note that `canonical` doesn’t *block*. – unor Aug 03 '18 at 13:00
  • If it doesnt block it means that google will still crwl it and index it, right? I am not sure how I would serve two different robots.txt file because whatever staging one has, the production one has also. – Igor-Vuk Aug 03 '18 at 21:08
  • @Igor-Vuk: Yes, search engines may still crawl (otherwise they couldn’t learn what the canonical URL for a non-canonical URL is) and index non-canonical URLs (if they think your canonical statement is wrong, or for whatever reason). – unor Aug 03 '18 at 22:01
-1

No, using what you've suggested would block all search engines / bots from accessing https://appname.herokuapp.com/.

Instead what you should use is:

User-agent: Googlebot
Disallow: /

This will only block Googlebot from accessing https://appname.herokuapp.com/. Keep in mind, bots can ignore the robots.txt file, this is more of a please than anything. But Google will follow your request.

EDIT

After seeing unor's advice, it is not possible to Disallow by URL, so I've changed that from my answer. You can however block by particular files e.g. /appname/ or you use / to stop Googlebot from accessing anything.

Joe
  • 4,877
  • 5
  • 30
  • 51
  • Think you haven't understood the OP's problem. OP will be using the same content (including robots.txt) for both staging and prod. So having `Disallow: /` would block everything on prod as well. Not an option. – Ash Feb 21 '19 at 01:40