1

The code:

val plainText = "plainText"
val plainTextWithEmoji = "plainText"

println("plainText=$plainText, length=${plainText.length}")
println("plainTextWithEmoji=$plainText, length=${plainTextWithEmoji.length}")

// Output:
// plainText=plainText, length=9
// plainTextWithEmoji=plainText, length=15

This code imply that emoji character's length is 2, not 1.

When I want to remove the last character's:

If I call plainText.subSequence(0, plainTextWithEmoji.length - 1), the result is wrong, because emoji character length is more than 1.

To call subSequence and get the correct result, do this: plainText.subSequence(0, plainTextWithEmoji.length - 2)

But in general, We can not know if the last character's length is 1. When we want to remove the last character, simply call charSequence.subSequence(0, charSequence.length - 1) will return a wrong result.

So, it is any way to remove last grapheme of CharSequence? Thx!

ZSpirytus
  • 339
  • 2
  • 10
  • 1
    if you will use UTF32 encoding for emoji and text you should see that each of them have same size, should be 4. also proper emoji should be 16-bit hex code, so U+0061 = (binary)0000 0000 0110 0001. so it should have length of 4 in UTF-8 encoding, i wonder why it's only 2 in your example – Morph21 Mar 24 '22 at 10:45
  • @Szprota21 that emoji is `U+1F970`, so it actually takes two UTF "characters". – Kayaman Mar 24 '22 at 11:01
  • 1
    What language is that? Java doesn't have those string formatting features. – Generous Badger Apr 08 '22 at 10:35

1 Answers1

2

Finally, I find the solution inspired by this post. Since UTF-8 is variable length, to call CharSequence.subSequence and get correct result, we can get every grapheme's start index in this sentence by magic BreakIterator:

fun CharSequence.removeLast(): CharSequence {
    val graphemeStartIndexes = computeGraphemesStartIndexes(this)
    return this.subSequence(0, graphemeStartIndexes.last())
}

private fun computeGraphemesStartIndexes(sequence: CharSequence): List<Int> {
    val breakIterator = BreakIterator.getCharacterInstance()
    breakIterator.setText(sequence.toString())
    val graphemesStartIndexes = mutableListOf<Int>()

    val start = breakIterator.first()
    graphemesStartIndexes.add(start)
    while (breakIterator.next() != BreakIterator.DONE) {
        graphemesStartIndexes.add(breakIterator.current())
    }
    return graphemesStartIndexes.apply { removeAt(size - 1) }
}

Example:

val plainTextEmojiSequence = "Hello"
val plainTextOnlySequence = "Hi~!"

println(plainTextEmojiSequence.removeLast()) // "Hello"
println(plainTextOnlySequence.removeLast())  // "Hi~"
ZSpirytus
  • 339
  • 2
  • 10
  • This is not techinically related to UTF-8, but the rest of the answer is correct. There's two issues: first String use UTF-16 so a single Unicode codepoint can stretch over two `char` values and second multiple Unicode codepoints can make up a single grapheme. Your code solves both of those. – Generous Badger Apr 08 '22 at 10:36
  • @GenerousBadger Thank you for you to help me located the issue clearly. `BreakIterator` can detect single grapheme is make up by one or two codepoints. so by calling `BreakIterator` we can remove any grapheme at the sentence correctly :) – ZSpirytus Apr 08 '22 at 10:44