4

I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.

Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?

Nathan
  • 8,093
  • 8
  • 50
  • 76
Edi
  • 327
  • 4
  • 16
  • [URL encoding](http://en.wikipedia.org/wiki/Percent-encoding) is something completely different than character encoding. – Jesper Mar 26 '15 at 10:49

5 Answers5

5

I suppose you mean percent encoding here.

RFC 3986, section 2.5 is pretty clear about this (emphasis mine):

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

Therefore, this should be UTF-8.

Also, beware of URLEncoder.encode(); while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:

This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format

which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.

Community
  • 1
  • 1
fge
  • 119,121
  • 33
  • 254
  • 329
2

A reference from a HTML point of view.

The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset attribute is specified.

On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding setting.

holmis83
  • 15,922
  • 5
  • 82
  • 83
0

UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.

wblades
  • 64
  • 4
0

Go for UTF-8, also you can achieve the same thing by URLEncoder.encode(string, encoding)

In addition, you can refer This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'

Community
  • 1
  • 1
-2

Encode your URL to escape special characters. There are several websites that can do this for you. E.g. http://www.url-encode-decode.com/

AuthenticReplica
  • 870
  • 15
  • 39