regcomp
(from glibc) is a POSIX function for compiling regular expressions.
int regcomp(regex_t *restrict preg, const char *restrict pattern,
int cflags);
There are some constructions in regular expressions which depend on the idea of a single character, for example [abc]
.
If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters.
Here I illustrate this idea with grep
(which must not be the same in this respect as the C function regcomp
):
$ { echo Г; echo Д; } | egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=C egrep '[Д]'
Г
Д
$
LANG
is the default value if any of the specific locale variables are not set, so the question is: which one of them would affect the regcomp
's idea about the encoding.
$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$