multi-byte characters in libc regcomp and regexec

Question

Is there anyway to get libc6's regexp functions regcomp and regexec to work properly with multi-byte characters?

For instance, if my pattern is the utf8 characters 猫机+猫, finding a match on the utf8 encoded string 猫机机机猫 will fail, where it should succeed.

I think this is because the character 机's byte representation is \xe6\x9c\xba, and the + is matching one or more of the byte \xba. I can make this instance work by putting parenthesis around each multibyte character in the pattern, but since this is for an application I can't require users to do this.

Is there a way to flag a pattern or string to match as containing utf8 characters? Perhaps telling libc to store the pattern as wchar instead of char?

I can do that, but I am hoping for a solution that doesn't require the user to change the pattern in such a way. Thank you though! I edited the question to reflect your comment. — bill_e, Jan 23 '15 at 17:59
Why not just use codepoints `\x{nnnnnnn}` ? That is, if the regex engine should support Unicode. Usually the regex and target string should use the same encoding, but its not a good idea to use literal Unicode chars within a regex string. If the engine supports it though, it reads the char in char units, not byte units. — , Jan 29 '15 at 07:24
No, these options don't work because I'm hoping to use this within an application that shouldn't require users to alter their regexps. Does this mean there is no support for multibyte chars in libc? Is there another extremely common c library I could use instead? — bill_e, Feb 02 '15 at 17:14

Regular Jo · Answer 1 · 2015-02-21T16:53:17.907

Can you use a regex to build your regex? Here's a javascript example, (though I know you aren't using js):

function Examp () {
  var uString = "猫机+猫+猫ymg+sah猫";
  var plussed = uString.replace(/(.)(?=[\+\*])/ig,"($1)");
  console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
  uString = "猫机+猫*猫ymg+s\\a+I+h猫";
  plussed = uString.replace(/(\\?.)(?=[\+\*])/ig,"($1)");
  console.log("You can even take this a step further and account for a character being escaped, if that's a consideration.")
  console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
}

<input type="button" value="Run" onclick="Examp()" />

Champignac · Accepted Answer · 2021-07-28T15:21:41.097

According to its manual page, glibc understands POSIX regexp. There is no unicode support in POSIX regexp per se. See this answer for an excerpt of the standard that enlightens this point. This means that you can also forget about UTF. This means also that whatever locale environment you're in, multi-byte characters won't fit.

The post I've mentionned (as well as this one) suggests you use some unicode-aware regexp library, such as pcre. If you're interested, pcre provides a fake posix interface, with the addition of a non-standard REG_UTF flag. You won't have to rewrite your code, except for the #include directive, and the addition of REG_UTF at compile step.

Hope this covers your needs.

score 0 · Answer 3 · edited May 23 '17 at 11:59

0

Is there a way to flag a pattern or string to match as containing utf8 characters?

I suspect that LC_CTYPE environment variable (or other related locale settings) is the way to make regcomp/regexec understand your encoding.

At least, grep program seems to take it into account, as shown in https://stackoverflow.com/a/40809461/94687; I haven't tested this with regcomp function.

edited May 23 '17 at 11:59

Community

1
1

answered Nov 26 '16 at 23:35

imz -- Ivan Zakharyaschev

4,921
6
53
104

multi-byte characters in libc regcomp and regexec

3 Answers3