Escape Unicode Character 'POPCORN' to HTML Entity

Question

I have a string with an emoji in it

I love

I need to escape that popcorn emoji with it's html entity so I get

I love &#x1f37f;

I'm am writing my code in Java and I have been trying different StringEscapeUtils libraries but haven't gotten it to work. Please help me figure out what I can use to escape special characters like Popcorn.

For reference:

Unicode Character Information

Unicode 8.0 (June 2015)

If the receiving system expects an HTML document with a document encoding of US-ASCII, why not just serialize the entire document as such? Why focus on specific characters? — Tom Blodget, Aug 18 '19 at 19:57

Elliott Frisch · Answer 1 · 2019-08-17T02:58:10.837

2

It's a little hacky, because I don't believe there is a ready made library to do this; assuming you can't simply use UTF-8 (or UTF-16) on your HTML page (which should be able to render as is), you can use Character.codePointAt(CharSequence, int) and Character.offsetByCodePoints(CharSequence, int, int)¹ to perform the conversion if the given character is outside the normal ASCII range. Something like,

String str = "I love ";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
    char ch = str.charAt(i);
    if (ch > 127) {
        sb.append(String.format("&#x%x;", Character.codePointAt(str, i)));
        i += Character.offsetByCodePoints(str, i, 1) - 1;
    } else {
        sb.append(ch);
    }
}
System.out.println(sb);

which outputs (as requested)

I love &#x1f37f;

¹_{Edited based on helpful comments from Andreas.}

edited Aug 17 '19 at 02:58

answered Aug 17 '19 at 02:05

Elliott Frisch

198,278
20
158
249

I’m not actually rendering this on an html page. I’m passing it to another system and my focus is on keeping the behavior the same as a legacy system. – Matt Urtnowski Aug 17 '19 at 02:09
You should encode anything above 127, not 255, so the result only consists of ASCII characters. – Andreas Aug 17 '19 at 02:41
1

`Character.codePointCount(str, i, i + 1)` always returns `1`. I believe you meant `i = Character.offsetByCodePoints(str, i, 1) - 1;`, with the `-1` at the end needed to offset the `i++` in the `for` loop. --- To see the problem, insert e.g. `ň` in the string, and the character immediately following will be skipped. – Andreas Aug 17 '19 at 02:49
I would prefer using `str.codePoints()` to get a stream and process the code points that way. Using `codePointCount` and `offsetByCodePoints` is too low-level, tedious, and easy to get wrong. – David Conrad Aug 17 '19 at 16:20

user11809641 · Answer 2 · 2019-08-22T00:26:37.487

1

Normally the emoji4j library works. It has a simple htmlify method for HTML encoding.

For example:

String text = "I love ";

EmojiUtils.htmlify(text); //returns "I love &#127871"

EmojiUtils.hexHtmlify(text); //returns "I love &#x1f37f"

edited Aug 22 '19 at 00:26

answered Aug 17 '19 at 02:05

user11809641

815
1
11
22

This doesn't really answer the question. Given a string that contains an emoji plus other characters, this doesn't provide any way to escape that string. – David Conrad Aug 19 '19 at 16:48
1

@DavidConrad Thanks for pointing that out! I edited my answer so it uses the library's method for converting emojis to HTML. – user11809641 Aug 22 '19 at 00:27

Sergey Vyacheslavovich Brunov · Answer 3 · 2019-08-17T04:33:09.677

You may use the unbescape library: unbescape: powerful, fast and easy escape/unescape operations for Java.

Example

Add the dependency into the pom.xml file:

<dependency>
    <groupId>org.unbescape</groupId>
    <artifactId>unbescape</artifactId>
    <version>1.1.6.RELEASE</version>
</dependency>

The usage:

import org.unbescape.html.HtmlEscape;
import org.unbescape.html.HtmlEscapeLevel;
import org.unbescape.html.HtmlEscapeType;

<…>

final String inputString = "\uD83C\uDF7F";
final String escapedString = HtmlEscape.escapeHtml(
    inputString,
    HtmlEscapeType.HEXADECIMAL_REFERENCES,
    HtmlEscapeLevel.LEVEL_2_ALL_NON_ASCII_PLUS_MARKUP_SIGNIFICANT
);

// Here `escapedString` has the value: `&#x1f37f;`.

For your use case, probably, either HtmlEscapeType.HTML4_NAMED_REFERENCES_DEFAULT_TO_HEXA or HtmlEscapeType.HTML5_NAMED_REFERENCES_DEFAULT_TO_HEXA should be used instead of HtmlEscapeType.HEXADECIMAL_REFERENCES.

score 1 · Accepted Answer · answered Aug 17 '19 at 16:39

I would use CharSequence::codePoints to get an IntStream of the code points and map them to strings, and then collect them, concatenating to a single string:

public String escape(final String s) {
    return s.codePoints()
        .mapToObj(codePoint -> codePoint > 127 ?
            "&#x" + Integer.toHexString(codePoint) + ";" :
             new String(Character.toChars(codePoint)))
    .collect(Collectors.joining());
}

For the specified input, this produces:

I love &#x1f37f;

Escape Unicode Character 'POPCORN' to HTML Entity

4 Answers4

Example