different results of regex matching among ICU library, Rust and PCRE(https://regexr.com/)

Question

here is the pattern I used :

"\w+|[^\w\s]+"

when I match string "abc.efg" and "戦場のヴァルキュリア3" using PCRE in https://regexr.com/, it give me results like this:

"abc" "." "efg" => 3 parts

"戦場のヴァルキュリア" "3" => 2 parts

that looks like right.

But when I using icu like this :

    //std::string ldata = "abc.efg";
    std::string ldata = "戦場のヴァルキュリア3";
    std::string m_regex = "\\w+|[^\\w\\s]+";
    UErrorCode         status = U_ZERO_ERROR;
    icu::RegexMatcher  matcher(m_regex.c_str(), 0, status);
    icu::StringPiece   data((char*)ldata.data(), ldata.length());
    icu::UnicodeString input = icu::UnicodeString::fromUTF8(data);
    matcher.reset(input);
   
    
    int count = 0;
    while (matcher.find(status) && U_SUCCESS(status))
    {
        auto start_index = matcher.start(status);
        auto end_index   = matcher.end(status);
        count++;   
    }

the input string "abc.efg" give me:

"abc" "." "efg" => 3 parts

but the input string "戦場のヴァルキュリア3" give me :

"戦場のヴァルキュリア3" => 1 part

when I using rust like this:

impl Pattern for &Regex {
    fn find_matches(&self, inside: &str) -> Result<Vec<(Offsets, bool)>> {
        if inside.is_empty() {
            return Ok(vec![((0, 0), false)]);
        }

        let mut prev = 0;
        let mut splits = Vec::with_capacity(inside.len());
        for m in self.find_iter(inside) {
            if prev != m.start() {
                splits.push(((prev, m.start()), false));
            }
            splits.push(((m.start(), m.end()), true));
            prev = m.end();
        }
        if prev != inside.len() {
            splits.push(((prev, inside.len()), false))
        }
        Ok(splits)
    }
}

the input string "abc.efg" give me:

"abc" "." "efg" => 3 parts

but the input string "戦場のヴァルキュリア3" give me :

"戦場のヴァルキュリア3" => 1 part

why ICU and Rust match "戦場のヴァルキュリア3" give different result from PCRE(https://regexr.com/)

It looks that "戦場のヴァルキュリア3" should be matched into 2 part.

Masklinn · Answer 1 · 2023-06-26T11:36:39.807

5

ICU and regex use unicode semantics by default, which means e.g. for \w they use unicode-aware definition of "word characters".

For Regex it's

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}\p{Join_Control}]

For ICU it's

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d]

where per tr44 Alphabetic is:

Lowercase + Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic

CJK characters are generally categorised as "letter, other" (Lo), hence are part of \w in a unicode-aware classification. So is "3", obviously. Hence a single group, because it all matches \w+ just fine.

PCRE does not use unicode semantics by default¹ hence it does not treat "戦場のヴァルキュリア" as letters.

regex supports non-unicode matching (using either the bytes-based engines, or the (?-u:) flag), I don't know whether ICU does though I rather doubt it as it would quite defeat the point.

If you want specifically ASCII matching, just ask for that.

Or is it that you misunderstand what \w does and thought it didn't include numbers? And thus that PCRE matched "戦場のヴァルキュリア" to \w+ and "3" to [^\w\s]+? Because what it does is the exact opposite.

1: PCRE2_UCP allows enabling unicode semantics

edited Jun 26 '23 at 11:36

answered Jun 26 '23 at 07:04

Masklinn

34,759
3
38
57

Thanks very much ! I got what you mean! But if I want to get the same result with ICU Library using PCRE style regex, how should I write the pattern? – Damons Jun 26 '23 at 09:29
1

`[a-zA-Z0-9]` will only match ascii letters and should more or less correspond to the non-unicode-aware `\w`. – Masklinn Jun 26 '23 at 10:05
2

If you want to get the same result with ICU, you can indeed write out the character class. But ICU (and Rust's regex crate) both support character class set operations (PCRE2 does not). So you can write `[\w&&\p{ascii}]` to get the ASCII-only variant. You can also do `[[^\w\s]&&\p{ascii}]`. Note that the latter is limited to matching ASCII, where as Rust's regex crate and PCRE2 will treat `[^\w\s]` as matching any individual byte not in `[\w\s]`, including invalid UTF-8. – BurntSushi5 Jun 26 '23 at 11:25
3

As for PCRE2, you can make it behave the same as Rust regex and ICU by enabling the `PCRE2_UCP` option (along with `PCRE2_UTF`). That will make `\w` Unicode-aware. See: https://www.pcre.org/current/doc/html/pcre2pattern.html – BurntSushi5 Jun 26 '23 at 11:27
@SvenMarnach thanks I guess I didn't scroll low enough, I'll fix the answer. – Masklinn Jun 26 '23 at 11:34
@BurntSushi5 pcre's spec does seem rather less complicated (so either a lot less or a lot more permissive, looks like less as it's missing M, Pc, and Join_Control) than icu and regex though: "any character that matches `\p{L}` or `\p{N}`, plus underscore" – Masklinn Jun 26 '23 at 11:43
Yes, that's correct. Not all regex engines that have a Unicode-aware `\w` define it the same way. There's a fair bit of subtle variance unfortunately. – BurntSushi5 Jun 26 '23 at 13:35

different results of regex matching among ICU library, Rust and PCRE(https://regexr.com/)

1 Answers1