1

In a PHP project I use the idn_to_utf8 function to convert domaine name from punycode to unicode string.

But sometimes this function return the punycode and not the unicode string.

Example :

echo idn_to_utf8('xn--fiq57vn0d561bf5ukfonh1o');
// Return : xn--fiq57vn0d561bf5ukfonh1o
// It should return : 中島第2駐輪場
echo idn_to_utf8('xn--fiqu6mnndw87c3ucbt0a1ea684a');
// Return : 中味鋺自転車置場

There are libraries which correctly convert punycode (http://idnaconv.phlymail.de/index.php?encoded=xn--fiq57vn0d561bf5ukfonh1o&decode=%3C%3C+Decode&lang=de) but I prefer use a PHP function than a library.

Do you have any ideas of origins of this problem ?

Edit / Solution and Explanation : To summarize and explain the problem : This code show the problem :

echo idn_to_ascii('吉津第2自転車置場');
?><br /><?php
echo idn_to_utf8(idn_to_ascii('吉津第2自転車置場'));
?> Should be : 吉津第2自転車置場 <br /><?php

This code displays the following :

xn--2-958a11kws1a96p50fgxenr6afga

吉津第2自転車置場 (Should be) : 吉津第2自転車置場

To be more clear : When we get the punycode of 吉津第2自転車置場, before convert this string PHP convert it to 吉津第2自転車置場 (The character "2" is different). So, with idn_to_ascii function we can't convert all unicode characters because PHP convert certain unicode character to others (in this example PHP converts 2 to 2 (sorry for sounding of this "two to "two").

Community
  • 1
  • 1
Samuel Dauzon
  • 10,744
  • 13
  • 61
  • 94

2 Answers2

1

This works fine. I think characters [A-Z0-9] cannot be used.

echo idn_to_utf8('xn--2-kq6aw43af1e4y9boczagup'); // 中島第2駐輪場

Factually, our chromes will automatically convert 中島第2駐輪場.com into 中島第2駐輪場.com before accessing.

UPDATED:
A normalization rule named NAMEPREP seems to be provided: https://www.nic.ad.jp/ja/dom/idn.html

UPDATED:
That seems to be invaild... Validation Result

mpyw
  • 5,526
  • 4
  • 30
  • 36
  • 1
    Thanks for this answer. But the followinf name : 銘備前国長船与三左衛門尉祐定為栗山与九郎作之 don't contain any character [A-Z0-9] but it can't be converted. Otherwise, thank you because your response allow me to find the RFC3454. I didn't found any php function to convert non-ascii character to ascii character (like 2 to 2). No php function exists to use NAMEPREP and a search only returns some homemade libraries. – Samuel Dauzon Oct 24 '14 at 09:55
  • 1
    @jedema `銘備前国長船与三左衛門尉祐定為栗山与九郎作之` is invalid – mpyw Oct 24 '14 at 10:38
  • 1
    @jedema You should use [this encoder](http://mct.verisign-grs.com/convertServlet?input=%E9%8A%98%E5%82%99%E5%89%8D%E5%9B%BD%E9%95%B7%E8%88%B9%E4%B8%8E%E4%B8%89%E5%B7%A6%E8%A1%9B%E9%96%80%E5%B0%89%E7%A5%90%E5%AE%9A%E7%82%BA%E6%A0%97%E5%B1%B1%E4%B8%8E%E4%B9%9D%E9%83%8E%E4%BD%9C%E4%B9%8B) instead of one you are using. – mpyw Oct 24 '14 at 10:40
  • Thanks for the encoder. These I use are too permissive. But Do you a PHP function to convert 2 to 2. I don't think I'm the only one to have this problem. Thank you again. – Samuel Dauzon Oct 24 '14 at 10:47
  • [mb_convert_kana](http://php.net/manual/fr/function.mb-convert-kana.php) can convert most of them, except special chars (e.g. `。`) – mpyw Oct 24 '14 at 11:56
  • Thank you but my script must accept all unicode characters. I will use the library I talked in the question. I valid you answer because it helps me and these comment can help everyone. Thank you. – Samuel Dauzon Oct 24 '14 at 12:14
  • 2
    The specific normalisation IDNA does that converts 2 to 2 is Unicode Normalization Form KC. PHP intl ext: http://php.net/manual/en/normalizer.normalize.php (FORM_KC) – bobince Oct 24 '14 at 16:22
0

Without PECL/intl or PECL/idn, I had trouble getting the built-in idn_to_utf8() to work!

This alternative: IdnaConv.net, works well for me. Taking the domain name as a whole:

include(__DIR__.'/IdnaConvert.php');$IDNA=new \Mso\IdnaConvert\IdnaConvert();
$domain='xn--b1amarcd.xn--ehq889crwebw5c4qa.net';//'новини.三明治餐馆.net';
$parts=explode('.',$domain);$utf8parts=[];
foreach($parts AS $part){
    if(\substr($part,0,4)==='xn--'){
        $utf8parts[]=$IDNA->decode($part);
    }else{
        $utf8parts[]=$part;
}   }
$utf8domain=implode('.',$utf8parts);
Matthew Slyman
  • 346
  • 3
  • 9
  • PECL/intl is now on my hosting plan. The idn_to_utf8() function is easier to use: you just feed in the entire IDN encoded domain name, and get the UTF8 answer returned! – Matthew Slyman Jan 13 '16 at 17:42