What does constitute one character for regcomp? Which multibyte encoding does determine this?

Question

regcomp (from glibc) is a POSIX function for compiling regular expressions.

     int regcomp(regex_t *restrict preg, const char *restrict pattern,
     int cflags);

There are some constructions in regular expressions which depend on the idea of a single character, for example [abc].

If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters.

Here I illustrate this idea with grep (which must not be the same in this respect as the C function regcomp):

$ { echo Г; echo Д; } | egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=C egrep '[Д]'
Г
Д
$

LANG is the default value if any of the specific locale variables are not set, so the question is: which one of them would affect the regcomp's idea about the encoding.

$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$

score 0 · Answer 1 · answered Nov 25 '16 at 16:47

As for grep (which must not have the same behavior as regcomp), it seems to honor LC_CTYPE for this decision:

$ { echo Г; echo Д; } | LANG=en_US.utf8 egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[Д]'
Г
Д
$

What does constitute one character for regcomp? Which multibyte encoding does determine this?

1 Answers1

Linked