0

Why is there some IDN not reversible :

String domain = "aʼnċăwb7rňuħ.eu";
System.out.println(domain);
domain = IDN.toASCII(domain);
System.out.println(domain);
domain = IDN.toUnicode(domain);
System.out.println(domain);

It displays :

aʼnċăwb7rňuħ.eu
xn--anwb7ru-93a5e8ozmq2m.eu
aʼnċăwb7rňuħ.eu

As you can see, the second character has been splitted !

Thanks

2 Answers2

2

This is by design. From what I can tell, the 2nd character in your string is a \u0149 codepoint. According to the latest Unicode code charts:

this character is deprecated and its use is strongly discouraged

The Unicode code chart says that the deprecated code point is equivalent to \u02bc followed by \u006e.

The according to the javadocs, first step that IDN.toASCII(String) does is to use the RFC 3491 stringprep / nameprep algorithm to process the characters in the input string. The RFC abstract says:

This document describes how to prepare internationalized domain name (IDN) labels in order to increase the likelihood that name input and name comparison work in ways that make sense for typical users throughout the world. This profile of the stringprep protocol is used as part of a suite of on-the-wire protocols for internationalizing the Domain Name System (DNS).

(In other words, stringprep is designed to make it harder to create tricky domain names that look like one thing and mean something different.)

In fact, if you drill down, you will find that the prescribed mapping in stringprep tables for \u0149 is \u02bc \u006e ; i.e. the equivalent defined in the Unicode code charts.

And ... that is what is happening.


Summary

  1. Your expectation that you can round-trip IDNs is ill-founded.
  2. You shouldn't be using that character anyway, since it is deprecated. (Certainly, it is a bad idea to use it in an IDN!)
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

The IDN toASCII procedure is inherently non-reversible, as it involves performing Unicode normalization (to form NFKC) as part of the process. In general, multiple Unicode character sequences can have the same normalized form; the IDN toUnicode procedure will produce one of these forms from an ACE label, but there is no guarantee that it will be the same one that was originally encoded.

If the result of toUnicode(toASCII(x)) does differ from x then the two are nevertheless equivalent for IDN's purposes, and they should furthermore be Unicode compatibility equivalents of each other. Generally speaking, they will be rendered similarly by Unicode fonts. In that sense, it is a bit surprising that there is a noticeable difference in your case, but the bottom line is that your apparent expectation of reversibility is unfounded.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157