4

I have string that looks like this abc and I want to split it to single chars/strings.

static List<String> split(String text ) {
    List<String> list = new ArrayList<>(text.length());
    for(int i = 0; i < text.length() ; i++) {
        list.add(text.substring(i, i + 1));
    }
    return list;
}

public static void main(String... args) {
    split("a\uD83D\uDC4Fb\uD83D\uDE42c")
            .forEach(System.out::println);
}

As you might already notice instead of and I'm getting two weird characters:

a
?
?
b
?
?
c
MAGx2
  • 3,149
  • 7
  • 33
  • 63
  • Those are not UTF-16 characters, that's the problem. Those are UTF-32 code points. – rustyx Jul 05 '18 at 08:50
  • As the answers show, that can be done reasonably easily. Once you try to dabble into combining characters that render to single glyphs though, it becomes a whole other kind of hell. – kumesana Jul 05 '18 at 09:38

3 Answers3

6

As per Character and String APIs docs you need to use code points to correctly handle the UTF multi-byte sequences.

"abc".codePoints().mapToObj(Character::toChars).forEach(System.out::println);

will output

a

b

c
Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
6

The following will do the job:

List<String> split(String text) {
    return text.codePoints()
            .mapToObj(Character::toChars)
            .map(String::valueOf)
            .collect(Collectors.toList());
}
Tomasz Linkowski
  • 4,386
  • 23
  • 38
0

There is an Open source MgntUtils library (written by me) that has a utility that translates any string into unicodes and vise-versa (handling correctly code-points) this can help you handling your problem as well as understand the internal work going on behind the sciences. Here is an example:

the code below

String result = "abc";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

would produce the following:

\u0061\u1f44f\u0062\u1f642\u0063
abc

Here is te link to the article that explains about the MgntUtils library and where to get it (including javadoc and source code): Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison. Look for paragraph "String Unicode converter"

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36