PCRE2 doesn't match umlauts using "with diaeresis" and "combining diaeresis"

Question

I'm using pcre2 (build with unicode support) in a C++ program. I could successfully match umlauts including mixed upper-/lowercase:

auto s1 = "äöü";
auto s2 = "ÄÖÜ";
pcre2_compile(s1, PCRE2_ZERO_TERMINATED, PCRE2_CASELESS | PCRE2_UTF, ...);
pcre2_match(s2...);  // match!

However different encodings of the same letter cause problems. Example:

auto s1 = "\xC3\x9C";      // U WITH DIAERESIS = Ü
auto s2 = "\x55\xCC\x88";  // U COMBINING DIAERESIS = Ü

are not recognized to be the same.

Example: MacOS reports umlauts in directory names using 3 bytes ("combining") while user input from a search mask comes using 2 bytes ("with"). No match. Is there a way to make pcre2 "see the equality" without doing some sort of normalization before? I hoped pcre2 would handle that internally.

Edit: @Eljay: Thanks for your answer. I think you are technically right, but I guess a have to find an alternative. Normalizing the "needle" before searching is surely ok, but all the "haystacks" too? For short filenames this might be ok, but for crawling through gigabytes of text it seems too expensive. Idea #1: Since the software only needs to search in latin based text it looks manageable to use a "A-Z equivalence table" with pairs of "with XXX" and "combining XXX". I tried a few examples like "(?:Ü|Ü)" (first Ü encoded 2 bytes "U with..." and second Ü encoded 3 bytes "U combining..."). That works. Idea #2: Since the software is supposed to run on (current) versions of MacOS only, moving from PCRE/C++ to a bridged Swift function would kill the problem too. I checked a few examples - no special preparations needed: their regex-engine just matches, no matter which internal representation is used. Just the way I hoped pcre would do.

You'll need to [normalize](https://unicode.org/faq/normalization.html) the strings. Either NFC or NFD. (Macintosh file system favors NFD, Windows file system favors NFC. Neither are strongly enforced.) — Eljay, Nov 19 '22 at 01:11

PCRE2 doesn't match umlauts using "with diaeresis" and "combining diaeresis"

0 Answers0