3

What is the theory behind unicode sorting? I understand how it works, but I don't understand why they decided on this standard for collation sorting.

It seems that when you have two strings to compare, using ucol_strcolliter() for example:

ucol_strcollIter(collator, &stringIter1, &stringIter2, &Status)

Then, say I the two strings are:

string string1 = "hello"
string string2 = "héllo"

Under the "Secondary" collation strength, string1 should be ordered before string2. Where string1 and string2 are compared on their secondary strength.

<1 hello
<2 héllo

BUT

If you have trailing spaces, like:

string string1 = "hello  "
string string2 = "héllo "

then the accented hello (string2) will be placed before string1. And, both are compared on their primary weight.

<1 héllo  
<1 hello 

Why does the unicode collation algorithm take into account the trailing spaces?

Is there some reason behind this?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
user3404884
  • 65
  • 1
  • 10

3 Answers3

4

This is an old question but I'll answer for others in the future.

The original 'they' is the International Organization for Standardization, who published ISO-14651, a standard for collation of text in any encoding scheme but with a goal of supporting Unicode. This standard was largely implementation independent.

Then the Unicode Consortium published the Unicode Collation Algorithm, which is compatible with ISO-14651 but goes much farther in terms of implementation details.

Collation depends on language sorting rules and collation classes usually take locale as a parameter. The default sort order is defined in DUCET, as mentioned previously. If you use the ICU4J library it will be synchronized with DUCET.

The comparison algorithm is based on a minimum of 3 levels for compliance with ISO-14651. The levels are defined as follows.

  1. Base characters (e.g. a, b, c, d)
  2. Accents
  3. Case / Variants
  4. Punctuation
  5. Identical

Most characters are normalized before comparison. So an accented 'á' will be normalized to an 'a' for level-1 comparison. Level-2 is used as a tie-breaker.

The default rules are there for a reason but can be customized for individual use cases. Note that languages sort differently and sort order does not typically match the order in which characters appear in Unicode. Language sort order does not equal binary sort order.

Refer to the Unicode Collation Algorithm for a very detailed explanation.

Trent Wood
  • 41
  • 2
1

Probably the best TP would be this.

You can try various option combinations with the ICU Collation Demo. (give "alternate=shifted" a try)

Scott Russell
  • 76
  • 1
  • 3
0

Because the space character has a primary collation weight of 0x0209. (reference Default Unicode Collation Element Table, search # SPACE)

Spaces, trailing or not, are part of the string.

Random832
  • 37,415
  • 3
  • 44
  • 63
  • Yes, that does make sense. I guess what I am trying to figure out is, why ICU Collation doesn't use lexicographical ordering. It seems that if you use lexicographic ordering, then the extra spaces shouldn't make a difference. But, ICU uses the entire string to form the sortkey instead, which effects the weights significantly. Is there a TP (Technical Paper) on this? – user3404884 Dec 04 '14 at 22:35
  • I'm not sure what you mean by lexicographic ordering. – Random832 Dec 04 '14 at 22:35
  • By that I mean each character is compared against the other in sequence. H -> H, e -> é, .... etc. – user3404884 Dec 04 '14 at 22:52
  • Yes, and space > null. It looks like there is an option to ignore spaces and certain punctuation marks, but I can't find example code, and it would ignore space in all positions rather than only at the end of the string. – Random832 Dec 04 '14 at 22:53