0

I've got an ajax rich website which has extensive _escaped_fragment_ portions for Ajax indexing. While all my _escaped_fragment_ urls do 301 redirects to a special module which then outputs the HTML snapshots the crawlers need (i.e. mysite.com/#!/content redirects to mysite.com/?_escaped_fragment_=/content which in turn 301s to mysite.com/raw/content), I'm somewhat afraid of users stumbling on those "raw" URLs themselves and making them appear in search engines.

In PHP, how do I make sure only robots can access this part of the website? (much like StackOverflow disallows its sitemap to normal users, and only lets robots access it)

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
Swader
  • 11,387
  • 14
  • 50
  • 84

1 Answers1

2

You can't, at least not reliably.

robots.txt asks spiders to keep out of parts of a site, but there is no equivalent for regular user agents.

The closest you could come would be to try to keep a whitelist of acceptable ip addresses or user agents and serve different content based on that … but that risks false positives.

Personally, I'd stop catering for old-IE, scrap the #! URIs and the escaped_fragment hack, switch to using pushState and friends, and have the server build the initial view for any given page.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • I'm afraid the project's requirement is old-IE compatibility. Is there a list or wildcard of non-robot user agents I should ban on the PHP side in order to accomplish the solution you propose? I wouldn't be too restrictive - of course someone can spoof a UA, but I would like to do my best to keep the "raw" urls out of search engines. – Swader Jul 30 '13 at 09:34
  • This looks promising, I'll be taking a look at it in the coming days: http://phpmaster.com/server-side-device-detection-with-browscap/ – Swader Jul 31 '13 at 03:27