Correct way to remove last grapheme of CharSequence

Question

The code:

val plainText = "plainText"
val plainTextWithEmoji = "plainText"

println("plainText=$plainText, length=${plainText.length}")
println("plainTextWithEmoji=$plainText, length=${plainTextWithEmoji.length}")

// Output:
// plainText=plainText, length=9
// plainTextWithEmoji=plainText, length=15

This code imply that emoji character's length is 2, not 1.

When I want to remove the last character's:

If I call plainText.subSequence(0, plainTextWithEmoji.length - 1), the result is wrong, because emoji character length is more than 1.

To call subSequence and get the correct result, do this: plainText.subSequence(0, plainTextWithEmoji.length - 2)

But in general, We can not know if the last character's length is 1. When we want to remove the last character, simply call charSequence.subSequence(0, charSequence.length - 1) will return a wrong result.

So, it is any way to remove last grapheme of CharSequence? Thx!

if you will use UTF32 encoding for emoji and text you should see that each of them have same size, should be 4. also proper emoji should be 16-bit hex code, so U+0061 = (binary)0000 0000 0110 0001. so it should have length of 4 in UTF-8 encoding, i wonder why it's only 2 in your example — Morph21, Mar 24 '22 at 10:45
@Szprota21 that emoji is `U+1F970`, so it actually takes two UTF "characters". — Kayaman, Mar 24 '22 at 11:01
What language is that? Java doesn't have those string formatting features. — Generous Badger, Apr 08 '22 at 10:35

score 2 · Answer 1 · answered Apr 08 '22 at 10:35

Finally, I find the solution inspired by this post. Since UTF-8 is variable length, to call CharSequence.subSequence and get correct result, we can get every grapheme's start index in this sentence by magic BreakIterator:

fun CharSequence.removeLast(): CharSequence {
    val graphemeStartIndexes = computeGraphemesStartIndexes(this)
    return this.subSequence(0, graphemeStartIndexes.last())
}

private fun computeGraphemesStartIndexes(sequence: CharSequence): List<Int> {
    val breakIterator = BreakIterator.getCharacterInstance()
    breakIterator.setText(sequence.toString())
    val graphemesStartIndexes = mutableListOf<Int>()

    val start = breakIterator.first()
    graphemesStartIndexes.add(start)
    while (breakIterator.next() != BreakIterator.DONE) {
        graphemesStartIndexes.add(breakIterator.current())
    }
    return graphemesStartIndexes.apply { removeAt(size - 1) }
}

Example:

val plainTextEmojiSequence = "Hello"
val plainTextOnlySequence = "Hi~!"

println(plainTextEmojiSequence.removeLast()) // "Hello"
println(plainTextOnlySequence.removeLast())  // "Hi~"

This is not techinically related to UTF-8, but the rest of the answer is correct. There's two issues: first String use UTF-16 so a single Unicode codepoint can stretch over two `char` values and second multiple Unicode codepoints can make up a single grapheme. Your code solves both of those. — Generous Badger, Apr 08 '22 at 10:36
@GenerousBadger Thank you for you to help me located the issue clearly. `BreakIterator` can detect single grapheme is make up by one or two codepoints. so by calling `BreakIterator` we can remove any grapheme at the sentence correctly :) — ZSpirytus, Apr 08 '22 at 10:44

Correct way to remove last grapheme of CharSequence

1 Answers1