1

My CakePHP (v2.5.5) application has a dynamic sitemap that uses various methods to generate links (such as /sitemap/career-center), and then passes these links to Router::url($generated_url, true).

My sitemap is supposed to be available at the url http://example.com/sitemap (the default route) - which is working fine. However, for some reason, Google is finding my sitemap at crazy urls such as:

  • http://www.example.com/index.php/forums/general/pt-ceus/js/views/jobs/general/img/og/pt-jobs/pt-ceus/general/general/sitemap

We don't even have a forum anywhere in our system, or any mention of one. pt-jobs, pt-ceus, and general are all different areas of our website. js, img, and so on, are directories with static assets. But for some reason, this is routing successfully to my sitemap.

In turn, the sitemap found at this random url is populating every link within it with the same gibberish, saturating Webmaster Tools with 500 errors. For some reason, the sitemap is accessible at that non-sense url, but the generated linked within the sitemap that use the same structure all produce errors (as expected).

My question is this:

  • Do you have any idea what is going on?
  • How is Google finding these random urls, and why in the world are they successfully routing to my sitemap?

If you need any more information, let me know and I'll update with that info.

  • Your routes configuration and your exact CakePHP version will probably be helpful. – ndm Mar 10 '15 at 16:36
  • I've added the version to the first line. Routes file is pretty huge - but all of the routes follow either the standard "/controller/actions/params" or "/alias". Nothing very complex. –  Mar 10 '15 at 19:43

1 Answers1

0

One coworkers has discovered the source of this issue.

Here is the info:

  • This bug is only occurring on certain environments. Not sure what causes the difference between Production, QA, etc, but this behavior does not occur in all cases.

Hitting a url such as: http://www.example.com/index.php/sdfasdfjklasdjfkl/x/asdkfjasd/asdfasdfeww/sitemaps/

Gives you the following:

[base] => /index.php/sdfasdfjklasdjfkl/x/asdkfjasd/asdfasdfeww
[webroot] => /index.php/sdfasdfjklasdjfkl/x/asdkfjasd/asdfasdfeww/ 

Upon inspection of the CakePHP file CakeRequest.php, the following comment is discovered:

 276  * If CakePHP is called with index.php in the URL even though
 277  * URL Rewriting is activated (and thus not needed) it swallows
 278  * the unnecessary part from $base to prevent issue #3318.
 279  *
 280  * @return string Base URL
 281  * @link https://cakephp.lighthouseapp.com/projects/42648-cakephp/tickets/3318

We don't know what this issue #3318 is - but it appears that the 'fix' to that issue causes these long crazy urls to work. In our case, this caused these strange urls to be reflected in the sitemap that was being generated.

Note: This doesn't answer how in the hell these crazy urls are generated and reached by Google to begin with, but it does explain why they work.

Our solution was simply disallowing urls with index.php in them, as url rewriting is enabled in our case.