here is the pattern I used :
"\w+|[^\w\s]+"
when I match string "abc.efg" and "戦場のヴァルキュリア3" using PCRE in https://regexr.com/, it give me results like this:
"abc" "." "efg" => 3 parts
"戦場のヴァルキュリア" "3" => 2 parts
that looks like right.
But when I using icu like this :
//std::string ldata = "abc.efg";
std::string ldata = "戦場のヴァルキュリア3";
std::string m_regex = "\\w+|[^\\w\\s]+";
UErrorCode status = U_ZERO_ERROR;
icu::RegexMatcher matcher(m_regex.c_str(), 0, status);
icu::StringPiece data((char*)ldata.data(), ldata.length());
icu::UnicodeString input = icu::UnicodeString::fromUTF8(data);
matcher.reset(input);
int count = 0;
while (matcher.find(status) && U_SUCCESS(status))
{
auto start_index = matcher.start(status);
auto end_index = matcher.end(status);
count++;
}
the input string "abc.efg" give me:
"abc" "." "efg" => 3 parts
but the input string "戦場のヴァルキュリア3" give me :
"戦場のヴァルキュリア3" => 1 part
when I using rust like this:
impl Pattern for &Regex {
fn find_matches(&self, inside: &str) -> Result<Vec<(Offsets, bool)>> {
if inside.is_empty() {
return Ok(vec![((0, 0), false)]);
}
let mut prev = 0;
let mut splits = Vec::with_capacity(inside.len());
for m in self.find_iter(inside) {
if prev != m.start() {
splits.push(((prev, m.start()), false));
}
splits.push(((m.start(), m.end()), true));
prev = m.end();
}
if prev != inside.len() {
splits.push(((prev, inside.len()), false))
}
Ok(splits)
}
}
the input string "abc.efg" give me:
"abc" "." "efg" => 3 parts
but the input string "戦場のヴァルキュリア3" give me :
"戦場のヴァルキュリア3" => 1 part
why ICU and Rust match "戦場のヴァルキュリア3" give different result from PCRE(https://regexr.com/)
It looks that "戦場のヴァルキュリア3" should be matched into 2 part.