using regular expression with unicode string in C

Question

I'm currently using regular expression on unicode strings but I just need to match ASCII characters thus effectively ignore all unicode characters and until now functions in regex.h work fine (I'm on linux so the encoding is utf8). But can someone confirm if its really ok to do so? Or do I need a regex library on Unicode (like ICU?)

UTF-8 encodes the non-ASCII characters in a way that they will *never* match ASCII characters, so if that's all you're searching or matching on it should be safe. Of course now that I've said so, someone will come along to tell me I'm wrong - I welcome a counter-example. — Mark Ransom, Dec 12 '16 at 04:18

Schwern · Answer 1 · 2016-12-12T05:21:16.613

UTF-8 is a variable length encoding; some characters are 1 byte, some 2, others 3 or 4. You know now many bytes to read by the prefix of each character. 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes, 11110 for 4 bytes.

If you try to read a UTF-8 string as ASCII, or any other fixed-width encoding, things will go very wrong... unless that UTF-8 string contains nothing but 1 byte characters in which case it matches ASCII.

However since no bytes in UTF-8 contain a null byte, and none of the extra bytes can be confused with ASCII, and if you really are only matching ASCII, you might be able to get away with it... but I wouldn't recommend it because there are such better regex options than POSIX, they're easy to use, and why leave a hidden encoding bomb in your code for some sucker to deal with later? (Note: that sucker may be you)

Instead, use a Unicode aware regex library like Perl Compatible Regular Expressions (PCRE). PCRE is Unicode aware by passing the PCRE2_UTF flag to pcre2_compile. PCRE regex syntax is more powerful and more widely understood than POSIX regexes, and PCRE has more features. And PCRE comes with Gnome Lib which itself provides a feast of very handy C functions.

I'm just a newbie when it comes to unicode. I just prefer not using external library if possible, that's why I wonder if i'm gonna get a way with that one. Anyway, I will try PCRE, thanks for your advice. — AtheS21, Dec 12 '16 at 18:31
@AtheS21 Standard C doesn't have much in the way of Unicode support. It doesn't have much in the way of a lot of things. Rather than pulling in bits one by one, I'd recommend looking into Gnome Lib or some other 3rd party library that supplies all the missing pieces. — Schwern, Dec 12 '16 at 20:05

Remo.D · Answer 2 · 2020-11-15T19:04:43.270

You need to be careful about your patterns and about the text your going to match.

As an example, given the expression a.b:

"axb" matches 
"aèb" does NOT match

The reason is that è is two bytes long when UTF-8 encoded but . would only match the first one.

So as long as you only match sequences of ASCII characters you're safe. If you mix ASCII and non ASCII characters, you're in trouble.

You can try to match a single UTF-8 encoded "character" with something like:

([\xC0-\xDF].|[\xE0-\xEF]..|\xF0...|.)

but this assumes that the text is encoded correctly (and, frankly, I never tried it).

using regular expression with unicode string in C

2 Answers2

Linked