Forcing UTF-8, but just for regexec()

Question

My C program reads & writes data exclusively in UTF-8. It also uses the POSIX regex functions. I want these functions, in particular regexec(3), always to match UTF-8 text so that [:alpha:] will match alphabetic characters including those in UTF-8, e.g., 'é' (Latin small letter 'e' with acute accent). regexec(3) is locale-sensitive and, if the current locale is indeed UTF-8, this works the way I want).

However, I've also read that forcing UTF-8 for an entire program is the wrong thing to do. (Among other things, I assume system-generated errors will not be in the user's preferred locale.)

So what about if I force UTF-8 only for the call to regexec(3) and then put it back the way it was, e.g., given this:

void setlocale_utf8( void ) {
  if ( !setlocale( LC_CTYPE, "UTF-8" ) && !setlocale( LC_CTYPE, "UTF8" ) ) {
    fprintf( stderr, "setlocale() failed\n" );
    // do something
  }
}

do this:

setlocale_utf8();
int err_code = regexec( re, utf8_str_to_match, re->re_nsub + 1, match, 0 );
setlocale( LC_CTYPE, "" ); // put locale back to the user's preferred locale

Is this an OK way to ensure regexec(3) is always matching using UTF-8? Is there a better way?

Note: perhaps it is better, when done, to change the locale to the _previous_ setting (put it back the way it was), rather than the user's preferred locale — chux - Reinstate Monica, Jan 16 '17 at 17:08
@chux If I never set the locale at all, then the previous setting will be the "C" locale since, according to the setlocale(3) man page: "By default, C programs start in the "C" locale." — Paul J. Lucas, Jan 16 '17 at 17:15
True, but this function could simply operate without that condition: "prior code never set the locale at all" to operate correctly. It simple needs to set the locale, do its operations and then _restore_. You expressed concern about "forcing UTF8 for an entire program". By restoring, rather than setting to default, code avoids "forcing an entire program to not elsewhere change locale". — chux - Reinstate Monica, Jan 16 '17 at 17:50
@chux: OK, if I restore, _then_ is this an OK thing to do to solve my problem? — Paul J. Lucas, Jan 16 '17 at 18:38

Forcing UTF-8, but just for regexec()

0 Answers0