I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.
Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?
I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.
Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?
I suppose you mean percent encoding here.
RFC 3986, section 2.5 is pretty clear about this (emphasis mine):
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".
Therefore, this should be UTF-8.
Also, beware of URLEncoder.encode()
; while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:
This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format
which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded
is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.
A reference from a HTML point of view.
The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):
We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:
- Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
- Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).
Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset
attribute is specified.
On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding
setting.
UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.
Go for UTF-8, also you can achieve the same thing by URLEncoder.encode(string, encoding)
In addition, you can refer This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'
Encode your URL to escape special characters. There are several websites that can do this for you. E.g. http://www.url-encode-decode.com/