My C program reads & writes data exclusively in UTF-8. It also uses the POSIX regex functions. I want these functions, in particular regexec(3)
, always to match UTF-8 text so that [:alpha:]
will match alphabetic characters including those in UTF-8, e.g., 'é'
(Latin small letter 'e' with acute accent). regexec(3)
is locale-sensitive and, if the current locale is indeed UTF-8, this works the way I want).
However, I've also read that forcing UTF-8 for an entire program is the wrong thing to do. (Among other things, I assume system-generated errors will not be in the user's preferred locale.)
So what about if I force UTF-8 only for the call to regexec(3)
and then put it back the way it was, e.g., given this:
void setlocale_utf8( void ) {
if ( !setlocale( LC_CTYPE, "UTF-8" ) && !setlocale( LC_CTYPE, "UTF8" ) ) {
fprintf( stderr, "setlocale() failed\n" );
// do something
}
}
do this:
setlocale_utf8();
int err_code = regexec( re, utf8_str_to_match, re->re_nsub + 1, match, 0 );
setlocale( LC_CTYPE, "" ); // put locale back to the user's preferred locale
Is this an OK way to ensure regexec(3)
is always matching using UTF-8? Is there a better way?