When escaping a string with HTML entities, can I safely skip encoding chars above Unicode 127 if I use UTF-8?

Question

When outputting a string in HTML, one must escape special characters as HTML entities ("&<>" etc.) for understandable reasons.

I've examined two Java implementations of this: org.apache.commons.lang.StringEscapeUtils.escapeHtml(String) net.htmlparser.jericho.CharacterReference.encode(CharSequence)

Both escape all characters above Unicode code point 127 (0x7F), which is effectively all non-English characters.

This behavior is fine, but the strings it produces are non-human-readable when the characters are non-English (for example, in Hebrew or Arabic). I've seen that when chars above Unicode 127 aren't escaped like this, they still render correctly in browsers - I believe this is because the html page is UTF-8 encoded and thus these characters are understandable to the browser.

My question: Can I safely disable escaping Unicode characters above code point 127 when escaping HTML entities, provided my web page is UTF-8 encoded?

score 6 · Accepted Answer · answered Feb 09 '11 at 10:08

You only need to use HTML entities under two circumstances:

To escape a character that has a special meaning in HTML (e.g. <)
To display a character that doesn't belong to the document encoding (e.g., the € symbol in a ISO-8859-1 document)

Given that UTF-8 can represent all Unicode characters, only first case apply.

When typing HTML manually you may find practical to insert an HTML entity now and then if your editor and/or keyboard won't allow you to type certain character (it's easier to just type © rather than trying to figure out how to type an actual ©) but when escaping text automatically you just make the page size grow ;-)

I know little about Java but other languages have different functions to encode special chars and all possible entities.

Joachim Sauer · Answer 2 · 2011-02-09T09:59:42.183

If your send the encoding in the mime-type header:

Content-Type: text/html; charset=utf-8

then the browser will interpret your source as UTF-8 and you can send all those characters as normal UTF-8 encoded bytes.

Alternatively, you can specify the encoding in the header of your HTML page like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

This has the advantage that the information is stored with the HTML page if the user safes it and re-opens it from his harddisk at a later time.

Personally I'd do both (send the right header and add the meta-tag to your HTML page). It should be fine as long as the two places agree about the encoding.

Update: HTML 5 has added a new syntax for specifying the encoding:

<meta charset="utf-8">

When escaping a string with HTML entities, can I safely skip encoding chars above Unicode 127 if I use UTF-8?

2 Answers2