Does POSIX regex.h provide unicode or basically non-ascii characters?

Question

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.

Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.

My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).

Any help would be appreciated..

Not that I know of, but the plan 9 regex library is; a unix port is at http://swtch.com/plan9port/unix/ under `libregexp9` — Dave, Jan 04 '12 at 14:03

praetorian droid · Accepted Answer · 2012-01-04T18:31:15.043

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.

Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.

#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char** argv) {
  int ret;
  regex_t reg;
  regmatch_t matches[10];

  if (argc != 3) {
    fprintf(stderr, "Usage: %s regex string\n", argv[0]);
    return 1;
  }

  setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */

  if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
    char buf[256];
    regerror(ret, &reg, buf, sizeof(buf));
    fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
    return 1;
  }

  if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
    int i;
    char buf[256];
    int size;
    for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
      if (matches[i].rm_so == -1) break;
      size = matches[i].rm_eo - matches[i].rm_so;
      if (size >= sizeof(buf)) {
        fprintf(stderr, "match (%d-%d) is too long (%d)\n",
                matches[i].rm_so, matches[i].rm_eo, size);
        continue;
      }
      buf[size] = '\0';
      printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
             strncpy(buf, argv[2] + matches[i].rm_so, size));

    }
  }

  return 0;
}

Usage example:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$

The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

i think you misunderstand me . i wanna do that : ./reg 'ç' 'çilek45' — iyasar, Jan 05 '12 at 15:25
So what the problem? The code above prints: `0: 0-2: 'ç'` with your parameters. That is, it works. — praetorian droid, Jan 05 '12 at 15:30
Does anyone have any idea on how to make this thread safe? It seems that `uselocale` has no effect with these — sehe, Feb 25 '16 at 01:09

score 8 · Answer 2 · answered Jan 04 '12 at 14:02

Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.

From the standard, IEEE Std 1003.1-2008:

Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.

Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.

score 0 · Answer 3 · answered Jan 04 '12 at 20:57

0

If you really mean "Standard", i.e. std::regex from C++11, then all you need to do is switch to std::wregex (and std::wstring of course).

answered Jan 04 '12 at 20:57

MSalters

173,980
10
155
350

They're talking about the regex.h system interface as specified by the POSIX standard – osvein Jan 17 '18 at 22:41

Does POSIX regex.h provide unicode or basically non-ascii characters?

3 Answers3

Linked