11

In a C program, I want to sort a list of valid UTF-8-encoded strings in Unicode code point order. No collation, no locale-awareness.

So I need a compare function. It's easy enough to write such a function that iterates over the unicode characters. (I happen to be using GLib, so I'd iterate withg_utf8_next_char and compare the return values of g_utf8_next_char.)

But what I'm wondering, out of curiousity and possibly simplicity and efficiency, is: will a simple byte-for-byte strcmp (or g_strcmp) actually do the same job? I'm thinking that it should, since UTF-8 encodes the most significant bits first, and a code point that needs encoding in N+1 bytes will have a larger initial byte than a code point that needs to be encoded in N bytes.

But maybe I'm missing something? Thanks in advance.

skagedal
  • 2,323
  • 23
  • 34

1 Answers1

12

Yes, UTF-8 preserves codepoint order, so you can just use strcmp. That's one of the (many) beautiful points of UTF-8.

One caveat is that codepoints in Unicode are UTF-32 values, and some people who talk about collating Unicode strings in "codepoint" order are actually using the word "codepoint" incorrectly to mean "UTF-16 code unit". If you want the order to match UTF-16 code unit collation, a good bit more work is involved.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Thanks a lot! I was about to follow up on my use case and how I don't think the caveat applies, and then saw that this information is right there in [the standard](http://www.w3.org/TR/xml-c14n#DocumentOrder) I'm trying to implement: "Lexicographic comparison, which orders strings from least to greatest alphabetically, is based on the UCS codepoint values, which is equivalent to lexicographic ordering based on UTF-8." `:-)` – skagedal Aug 20 '13 at 08:38