How to tweak ICU's UnicodeString::caseCompare (or get the same effect)

Question

I'm not super familiar with how case-folding/case-insensitive comparisons work, and ICU in general.

Right now, we have some methods that wrap various overloads of UnicodeString::caseCompareand I want to change them to do something slightly-different: I want dotted & dotless i's to compare equal (regardless of case).

I know that ICU has a collation API, but I'm not sure how to start off with exactly the same rules as UnicodeString::caseCompare, and modify from there.

Shawn · Answer 1 · 2018-09-10T23:18:17.200

I don't see a way to do this using the C++ UnicodeString class.

You have to drop down to the lower level string-as-an-array-of-UChars functions from unicode/ustring.h. In particular, u_strCaseCompare() is probably what you want, or u_strcasecmp() combined with UnicodeString's getTerminatedBuffer() method.

Documentation for the U_FOLD_CASE_EXCLUDE_SPECIAL_I option:

Use the modified set of mappings provided in CaseFolding.txt to handle dotted I and dotless i appropriately for Turkic languages (tr, az).

I think that means to treat them as equivalent.

Edit with actual testing:

#include <stdio.h>
#include <stdlib.h>
#include <unicode/ustring.h>
#include <unicode/stringoptions.h>

void comp(const char *a, const char *b) {
  UChar s1[10], s2[10];
  UErrorCode err = U_ZERO_ERROR;
  int32_t len1, len2;
  u_strFromUTF8(s1, 10, &len1, a, -1, &err);
  u_strFromUTF8(s2, 10, &len2, b, -1, &err);
  printf("%s <=> %s: %d (Without special i) %d (With special i)\n", a, b,
         u_strCaseCompare(s1, len1, s2, len2, 0, &err),
         u_strCaseCompare(s1, len1, s2, len2, U_FOLD_CASE_EXCLUDE_SPECIAL_I, &err));
}

int main(void) {
  const char *lc_dotted_i = "i";
  const char *lc_dotless_i = "\u0131";
  const char *uc_dotless_i = "I";
  const char *uc_dotted_i = "\u0130";

  comp(lc_dotted_i, lc_dotless_i);
  comp(uc_dotted_i, uc_dotless_i);
  comp(lc_dotted_i, uc_dotted_i);
  comp(lc_dotless_i, uc_dotless_i);
  comp(lc_dotted_i, uc_dotless_i);
  comp(lc_dotless_i, uc_dotted_i);
  return 0;
}

Results:

i <=> ı: -200 (Without special i) -200 (With special i)
İ <=> I: 1 (Without special i) -200 (With special i)
i <=> İ: -1 (Without special i) 0 (With special i)
ı <=> I: 200 (Without special i) 0 (With special i)
i <=> I: 0 (Without special i) -200 (With special i)
ı <=> İ: 200 (Without special i) 200 (With special i)

My understanding of that option is that it makes 'normal' i map to capital I with dot, and 'normal' capital I with lowercase I without do — Bwmat, Sep 10 '18 at 23:08
@Bwmat I whipped up a quick test program and included it and the results in my answer to show actual behavior. (The 0's are the cases where the two characters are considered equal). So not quite what you're looking for. — Shawn, Sep 10 '18 at 23:19
@Bwmat I tried messing with collation and normalization equivalence too, with no luck. Next best suggestion is replacing all non-dotted is with dotted ones (or vis versa) before doing comparisons. — Shawn, Sep 11 '18 at 00:35
That's what I was thinking too... But what about combining characters? I think I read somewhere that you can form a dotted I by adding a dot to a dotless I with a combining character, probably need to do some normalization somehow — Bwmat, Sep 11 '18 at 01:07

How to tweak ICU's UnicodeString::caseCompare (or get the same effect)

1 Answers1