In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta>
tag, or out-of-band via an HTTP/MIME Content-Type
header or other means, it does not matter. As such, "ABCģķī"
would be an accurate representation of "ABCģķī"
only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī"
or "ABCģķī"
instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī"
would be represented as either "ABC#291;ķī"
or "ABCģķī"
.
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString
(which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte()
(or equivalent) to convert the WideString
to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr()
for decimal notation or IntToHex()
for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar
in the WideString
, outputting unreserved characters as-is and character references for reserved values, using IntToStr()
for decimal notation or IntToHex()
for hex notation.
If you need HTML 5:
A. if the WideString
does not contain any surrogate pairs, then simply loop through each WideChar
in the WideString
, outputting unreserved characters as-is and character references for reserved values, using IntToStr()
for decimal notation or IntToHex()
for hex notation.
B. otherwise, convert the WideString
to UTF-32 using WideStringToUCS4String()
, then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr()
for decimal notation or IntToHex()
for hex notation.