Linux unicode/umlauts in URL

Question

We have a website, where some pictures name using unicode, e.g. wildkräuter2_big.jpg.

Problem is - when anybody trying to access it - Apache 2.4 returns a 404 error:

$ curl -r 0-99 http://domain.tld/wp-content/uploads/2014/11/wildkräuter2_big.jpg

in Apache's log:

40...168 - - [30/Jun/2016:13:27:36 +0000] "GET /wp-content/uploads/2014/11/wildkr%C3%A4uter2_big.jpg HTTP/1.0" 404 22295 "-" "curl/7.35.0"`

%C3%A4 here - is an ä, as Deutsch - Unicode Tabelle says.

If execute GET with %C3%A4 - it will not work. If execute GET with an a%CC%88 - it will work:

$ curl -r 0-99 http://domain.tld/wp-content/uploads/2014/11/wildkra%CC%88uter2_big.jpg ��▒ExifII��Duckyd��http://ns.adobe.com/xap/1.0/<?xpacket begin="

I'm not sure - from where I got an a%CC%88 code - but it works.

So, two "same" URLs:

http://domain.tld/wp-content/uploads/2014/11/wildkra%CC%88uter2_big.jpg - this works

http://domain.tld/wp-content/uploads/2014/11/wildkr%C3%A4uter2_big.jpg - this does not work.

Both a%CC%88 and %C3%A4 means same - an ä letter.

This site was migrated from other agency and we haven't information about its setup.

Our current server works under Ubuntu 14.04, with LANG=de (apache2 was restarted after LANG was changed, but not whole Linux server) as locale and ext4 filesystem:

# su -s /bin/bash www-data

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=de
LANGUAGE=
LC_CTYPE="de"
LC_NUMERIC=uk_UA.UTF-8
LC_TIME=uk_UA.UTF-8
LC_COLLATE="de"
LC_MONETARY=uk_UA.UTF-8
LC_MESSAGES="de"
LC_PAPER=uk_UA.UTF-8
LC_NAME=uk_UA.UTF-8
LC_ADDRESS=uk_UA.UTF-8
LC_TELEPHONE=uk_UA.UTF-8
LC_MEASUREMENT=uk_UA.UTF-8
LC_IDENTIFICATION=uk_UA.UTF-8
LC_ALL=

Ignore my other comment, had a brainfart and forgot that the U+ codepoints had to be calculated from UTF-8. %CC%88 is U+0308, which is combining diaresis, which means "add an umlaut to the previous character". Thus "a%CC%88" *looks* the same as "%C3%A4" but the actual bytes in the filename on the disk is different, so the first one is not found. I'm not putting an answer since I don't know what to tell you to fix this, other than be consistent in Unicode [normalization](http://www.modernperlbooks.com/mt/2013/01/why-unicode-normalization-matters.html) — DerfK, Jun 30 '16 at 16:32

Linux unicode/umlauts in URL

0 Answers0