How do you disallow crawling on origin server and yet have the robots.txt propagate properly?

Question

I've come across a rather unique issue. If you deal with scaling large sites and work with a company like Akamai, you have origin servers that Akamai talks to. Whatever you serve to Akamai, they will propagate on their cdn.

But how do you handle robots.txt? You don't want Google to crawl your origin. That can be a HUGE security issue. Think denial of service attacks.

But if you serve a robots.txt on your origin with "disallow", then your entire site will be uncrawlable!

The only solution I can think of is to serve a different robots.txt to Akamai and to the world. Disallow to the world, but allow to Akamai. But this is very hacky and prone to so many issues that I cringe thinking about it.

(Of course, origin servers shouldn't be viewable to the public, but I'd venture to say most are for practical reasons...)

It seems an issue the protocol should be handling better. Or perhaps allow a site-specific, hidden robots.txt in the Search Engine's webmaster tools...

Thoughts?

score 1 · Answer 1 · answered Apr 27 '12 at 03:03

If you really want your origins not to be public, use a firewall / access control to restrict access for any host other than Akamai - it's the best way to avoid mistakes and it's the only way to stop the bots & attackers who simply scan public IP ranges looking for webservers.

That said, if all you want is to avoid non-malicious spiders, consider using a redirect on your origin server which redirects any requests which don't have a Host header specifying your public hostname to the official name. You generally want something like that anyway to avoid issues with confusion or search rank dilution if you have variations of the canonical hostname. With Apache this could use mod_rewrite or even a simple virtualhost setup where the default server has RedirectPermanent / http://canonicalname.example.com/.

If you do use this approach, you could either simply add the production name to your test systems' hosts file when necessary or also create and whitelist an internal-only hostname (e.g. cdn-bypass.mycorp.com) so you can access the origin directly when you need to.

Sadly, Akamai IP addresses change so frequently, that the last time I asked for a white list, they said no. As such, a firewall is not a viable option. Also a permanent redirect is a similarly bad idea for origin traffic through a CDN. — milesvp, Jun 15 '16 at 23:28

How do you disallow crawling on origin server and yet have the robots.txt propagate properly?

1 Answers1