UTF-16 string : how to process over U+10000?

Question

As we know, UTF-16 is variable-length when there is a character over U+10000.

However, .Net, Java and Windows WCHAR UTF-16 string is treated as if they are fixed-length... What happens if I use over U+10000?

And if they process over U+10000, how do they process? For example, in .Net and Java char is 16bit. so one char cannot process over U+10000..

(.net, java and windows is just example.. I'm talking about how to process over U+10000. But I think I'd rather know how they process over U+10000, for my understanding)

thanks to @dystroy, I know how they process. But there is one problem: If string use UTF-16 surrogate, a random access operation, such as str[3], is O(N) algorithm because any character can be 4-byte or 2-byte! How is this problem treated?

TLDR : some characters simply are spread over more than one Java char... — Denys Séguret, Feb 13 '14 at 08:40
@dystroy Um.. Do we have no chance about random access is O(N)? that sounds bad.. — ikh, Feb 13 '14 at 08:59
What kind of answer do you really want ? Yes accessing a random code point in a string is costly. — Denys Séguret, Feb 13 '14 at 08:59
@dystroy that's an answer which I want, even if I seem not to be satisfied... Anyway, if so, O(N) is required at each string-related operations to support over U+10000? oh;; — ikh, Feb 13 '14 at 09:03
If you have specific question about efficiency of access, you should ask it separately. — Jukka K. Korpela, Feb 13 '14 at 10:40

score 2 · Accepted Answer · edited May 23 '17 at 12:15

I answered the first part of the question in this QA : Basically, some characters simply are spread over more than one Java char.

To answer the second part related to random access to unicode points str[3], there are more than one method :

charAt is careless and only handle chars in a fast and obvious way
codePointAt returns a 32 bits int (but need a char index)
codePointCount counts code points

And yes, counting the code points is costly and basically O(N). Here's how it's done in Java :

2665    static int More ...codePointCountImpl(char[] a, int offset, int count) {
2666        int endIndex = offset + count;
2667        int n = 0;
2668        for (int i = offset; i < endIndex; ) {
2669            n++;
2670            if (isHighSurrogate(a[i++])) {
2671                if (i < endIndex && isLowSurrogate(a[i])) {
2672                    i++;
2673                }
2674            }
2675        }
2676        return n;
2677    }

UTF-16 is a bad format to deal with code points, especially if you leave the BMP. Most programs simply don't handle code points, which is the reason this format is usable. Most String operations are fast because they don't deal with code points : all standard API take char indexes as arguments, not worrying about what kind of rune points they do have behind.

Thank you! so, Is ignoring over U+10000 not bad choice? I guess I should consider UTF-32 if I work with over U+10000.. — ikh, Feb 13 '14 at 09:11
I don't understand your need. What do you do with your code points ? — Denys Séguret, Feb 13 '14 at 09:12
Well, I'm writing a string library and I think I should know this problem perfectly.. — ikh, Feb 13 '14 at 09:19
Not knowing your problem, you probably shouldn't use UTF-32. UTF-8 is most often the way to go. — Denys Séguret, Feb 13 '14 at 09:20

score 2 · Answer 2 · answered Feb 13 '14 at 09:01

2

Usually this problem is not treated at all. Many languages and libraries that use UTF-8 or UTF-16 do substrings or indexes by accessing code units, not code points. That is str[3] will just return the surrogate character in that case. Of course access is constant-time in that case, but for anything outside the BMP (or ASCII) you have to be careful what you do.

If you're lucky there are methods to access code points, e.g. in Java String.codePointAt. And in this case you have to scan the string from the start and determine code point boundaries.

Generally, even accessing code points doesn't gain you very much, though, only at library level. Strings often are used eventually to interact with the user and in that case graphemes or visual string length become more important than code points. And you have even more processing to do in that case.

answered Feb 13 '14 at 09:01

Joey

344,408
85
689
683

Oh, in my question, `str[3]` means the third character.. and java is just example. I'm talking about HOW to process, not what java method processes. – ikh Feb 13 '14 at 09:04
ASCII ? Hopefully we have more than ASCII in 32 bits. Maybe this part of your answer isn't clear. – Denys Séguret Feb 13 '14 at 09:05
@dystroy: UTF-8 and UTF-16 are variable-length. One makes access to ASCII characters straightforward, the other makes access to BMP characters straightforward. For everything else you'll have to handle the variable-length issue. That's what was meant there. That being said, unless you process strings by the tens or hundreds of gibibytes you're unlikely to really notice the difference in naïve indexing or actually finding code points. – Joey Feb 13 '14 at 09:07
Oh, you were talking of UTF-8. OK. I think this makes the answer more complex though. – Denys Séguret Feb 13 '14 at 09:08
@ikh: It makes little difference. Java is just an example in my answer too. If you want to, I can add C# or Qt to that as well, but *it doesn't matter*. Most things that gained unicode support early and thus are stuck with UTF-16 go the easy route and ignore the code point route (unless they're specifically built for text processing, but then they're probably using UTF-32 internally anyway) and just give you code points. Python is a nice exception to that rule by even completely hiding the underlying byte representation of Unicode strings. – Joey Feb 13 '14 at 09:10
@Joey Um.. so I don't have to think about this problem, and just do as if UTF-16 is fixed-length, right? – ikh Feb 13 '14 at 09:24
@ikh, none of us knows what you're up to. If in doubt, just use a decent framework or library that supports Unicode and ignore anything that isn't a problem yet. – Joey Feb 13 '14 at 09:43
@Joey In fact, I'm writing "string library"... – ikh Feb 13 '14 at 10:21

UTF-16 string : how to process over U+10000?

2 Answers2