What character set should be used for URL encoding?

Question

I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.

Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?

[URL encoding](http://en.wikipedia.org/wiki/Percent-encoding) is something completely different than character encoding. — Jesper, Mar 26 '15 at 10:49

score 5 · Accepted Answer · edited Oct 07 '21 at 05:57

I suppose you mean percent encoding here.

RFC 3986, section 2.5 is pretty clear about this (emphasis mine):

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

Therefore, this should be UTF-8.

Also, beware of URLEncoder.encode(); while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:

This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format

which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.

holmis83 · Answer 2 · 2017-08-14T15:29:10.930

A reference from a HTML point of view.

The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.

Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset attribute is specified.

On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding setting.

score 0 · Answer 3 · answered Mar 26 '15 at 10:53

0

UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.

answered Mar 26 '15 at 10:53

wblades

64
4

score 0 · Answer 4 · edited May 23 '17 at 12:31

0

Go for UTF-8, also you can achieve the same thing by URLEncoder.encode(string, encoding)

In addition, you can refer This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'

edited May 23 '17 at 12:31

Community

1
1

answered Mar 26 '15 at 10:54

Chirag Visavadiya

547
9
18

score -2 · Answer 5 · answered Mar 26 '15 at 10:53

-2

Encode your URL to escape special characters. There are several websites that can do this for you. E.g. http://www.url-encode-decode.com/

answered Mar 26 '15 at 10:53

AuthenticReplica

870
15
39

What character set should be used for URL encoding?

5 Answers5

Linked