1

I use rewrite on my nginx server to allow URLs like https://www.example.com/en/product/apple.html to pass en, product and apple.html to a single PHP script like so:

rewrite ^/([a-zA-Z0-9_\-]+)/([a-zA-Z0-9_\-]+)/(.+)$ /index.php?lang=$1&page=$2&part=$3&$query_string last;

As you can see, the third part, in this case apple.html, would match any characters. When this part contains URL encoded special characters nginx seem to be decoding them on the fly, PHP would not be able to detect whether the user entered with the encoded character in the URL or decoded. For example: /en/product/apples,oranges.html and /en/product/apples%2Coranges.html, PHP would read apples,oranges.html in both cases.

For the sake of not having 2 URLs with the same content: Can nginx rewrite the URL without decoding URL encoded special/reserved characters so PHP can determine whether it should redirect to the non-encoded URL? Or, perhaps even better, can it be configured to 301 redirect /en/product/apples%2Coranges.html to /en/product/apples,oranges.html?

PS. I know the better URL would be /en/product/apples-oranges.html and forget about the comma. But since the web allows us to form URLs with special characters such as comma, I'm interested in learning how to deal with them.

jonr
  • 23
  • 4
  • Everyone else has their application parse the URL. Why have you chosen this method? – Michael Hampton Jan 18 '18 at 19:30
  • @MichaelHampton I'm not sure I understand your question. You mean why don't I have an apple.html file in a folder called /en/product/? – jonr Jan 18 '18 at 19:42
  • No, I mean why aren't you fully implementing front controller routing in your application? – Michael Hampton Jan 18 '18 at 19:52
  • I guess because the site was coded 15 years ago and has not changed very much since then, except front end of course. I'm sure lots of things would have been done differently today. – jonr Jan 18 '18 at 20:03
  • Oh, the dreaded legacy codebase. Now it makes sense....more or less. :) As a reminder, RFC 3986 specifies that: "Implementations must not percent-encode or decode the same string more than once". It also specifies that, at least in the particular case of the percent encoded comma, the URLs are identical (so redirecting would be meaningless). – Michael Hampton Jan 18 '18 at 20:15
  • Ok, by definition they are the same. However, Google Search Console gives an error on International Targeting if, say the originating URL /en/ with encoded URL points to an alternate /fr/ site without encoded URL, since the alternate site doesn't point back to the originating URL, it's really pointing back to the originating URL only now without encoding. http://hreflang.ninja gives the same error. Hence, at least in some means, they are considered being two different URLs. Quite disturbing! The only workaround I could think of except _not using comma_ would be a redirect. – jonr Jan 18 '18 at 21:21

0 Answers0