Does Java String.getBytes("UTF-8") preserve lexicograhpical order?

Question

If I have a lexicographical sorted list of Java Strings [s1,s2,s3,s4, ...., sn], and then convert each String into a byte array using UTF-8 encoding bx = sx.getBytes("UTF-8"), is the list of byte arrays [b1,b2,b3,...bn] also lexicographical sorted?

since UTF-8 is a variable width encoding, I would say that the sort order will not be preserved — Dmitry B., Aug 15 '12 at 23:11
I'm not sure your question makes any sense; how would you sort bits/bytes lexicographically? The character set you map those bits/bytes to is the determining factor. — Brian Roach, Aug 15 '12 at 23:14
@Brian Roach Lexicographical order on byte arrays is similar to that on Strings. Just replace "character at x" with "byte at x". See e.g. http://stackoverflow.com/questions/5108091/java-comparator-for-byte-array-lexicographic — Carsten, Aug 15 '12 at 23:32
@Dmitry Not necessarily. I do not need to compare all bytes, only until the first difference. Since UTF-8 is reversible the first difference in length for 2 characters should imply difference in bytes of their encoding. I'm however not sure this is enough to preserve order. — Carsten, Aug 15 '12 at 23:39
@DmitryBeransky: But UTF-8 was specifically designed to preserve sort order nevertheless. — Mechanical snail, Aug 16 '12 at 00:32

Mechanical snail · Accepted Answer · 2012-08-16T00:30:59.383

6

Yes. According to RFC 3239:

The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. Of course this is of limited interest since a sort order based on character numbers is almost never culturally valid.

As Ian Roberts pointed out, this applies for "true UTF-8 (such as String.getBytes will give you)", but beware of DataInputStream's fake UTF-8, which will sort [U+000000] after [U+000001] and [U+00F000] after [U+10FFFF].

edited Aug 16 '12 at 00:30

answered Aug 15 '12 at 23:51

Mechanical snail

29,755
14
88
113

For completeness, note that this is correct for _true_ UTF-8 (such as `String.getBytes` will give you) but not necessarily for the "[modified UTF-8](http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8)" used by `DataInputStream` and friends. – Ian Roberts Aug 16 '12 at 00:17
@IanRoberts: Right. In fact modified UTF-8 sorts [U+000000] after [U+000001] and [U+00F000] after [U+10FFFF]. – Mechanical snail Aug 16 '12 at 00:31
Are you sure this answer is correct? Isn't the normal Java lexicographical order for Strings based on UTF-16 rather than Unicode code points? – R.. GitHub STOP HELPING ICE Oct 14 '13 at 06:20

score -2 · Answer 2 · answered Aug 15 '12 at 23:13

-2

You get a list/array of objects X, in a given orden.

You create a new list/array Y of such objects, applying a method.

Y will have the ordering that you created it with (normally you will have just kept X order). No reordering happens.

Also, lexycographical ordering for a byte[] is meaningless.

answered Aug 15 '12 at 23:13

SJuan76

24,532
6
47
87

3

lexicographical ordering for a byte[] is not meaningless. See e.g. http://stackoverflow.com/questions/5108091/java-comparator-for-byte-array-lexicographic – Carsten Aug 15 '12 at 23:34

Does Java String.getBytes("UTF-8") preserve lexicograhpical order?

2 Answers2