I've come across a rather unique issue. If you deal with scaling large sites and work with a company like Akamai, you have origin servers that Akamai talks to. Whatever you serve to Akamai, they will propagate on their cdn.
But how do you handle robots.txt? You don't want Google to crawl your origin. That can be a HUGE security issue. Think denial of service attacks.
But if you serve a robots.txt on your origin with "disallow", then your entire site will be uncrawlable!
The only solution I can think of is to serve a different robots.txt to Akamai and to the world. Disallow to the world, but allow to Akamai. But this is very hacky and prone to so many issues that I cringe thinking about it.
(Of course, origin servers shouldn't be viewable to the public, but I'd venture to say most are for practical reasons...)
It seems an issue the protocol should be handling better. Or perhaps allow a site-specific, hidden robots.txt in the Search Engine's webmaster tools...
Thoughts?