10

Hi i am using Standard Regex Library (regcomp, regexec..). But now on demand i should add unicode support to my codes for regular expressions.

Does Standard Regex Library provide unicode or basically non-ascii characters? I researched on the Web, and think not.

My project is resource critic therefore i don't want to use large libraries for it (ICU and Boost.Regex).

Any help would be appreciated..

sehe
  • 374,641
  • 47
  • 450
  • 633
iyasar
  • 1,241
  • 3
  • 13
  • 27
  • 1
    Not that I know of, but the plan 9 regex library is; a unix port is at http://swtch.com/plan9port/unix/ under `libregexp9` – Dave Jan 04 '12 at 14:03

3 Answers3

9

Looks like POSIX Regex working properly with UTF-8 locale. I've just wrote a simple test (see below) and used it for matching string with a cyrillic characters against regex "[[:alpha:]]" (for example). And everything working just fine.

Note: The main thing you must remember - regex functions are locale-related. So you must call setlocale() before it.

#include <sys/types.h>
#include <string.h>
#include <regex.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char** argv) {
  int ret;
  regex_t reg;
  regmatch_t matches[10];

  if (argc != 3) {
    fprintf(stderr, "Usage: %s regex string\n", argv[0]);
    return 1;
  }

  setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */

  if ((ret = regcomp(&reg, argv[1], 0)) != 0) {
    char buf[256];
    regerror(ret, &reg, buf, sizeof(buf));
    fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf);
    return 1;
  }

  if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) {
    int i;
    char buf[256];
    int size;
    for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) {
      if (matches[i].rm_so == -1) break;
      size = matches[i].rm_eo - matches[i].rm_so;
      if (size >= sizeof(buf)) {
        fprintf(stderr, "match (%d-%d) is too long (%d)\n",
                matches[i].rm_so, matches[i].rm_eo, size);
        continue;
      }
      buf[size] = '\0';
      printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo,
             strncpy(buf, argv[2] + matches[i].rm_so, size));

    }
  }

  return 0;
}

Usage example:

$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
... (skip)
LC_ALL=
$ ./reg '[[:alpha:]]' ' 359 фыва'
0: 5-7: 'ф'
$

The length of the matching result is two bytes because cyrillic letters in UTF-8 takes so much.

praetorian droid
  • 2,989
  • 1
  • 17
  • 19
8

Basically, POSIX regexes are not Unicode aware. You can try to use them on Unicode characters, but there might be problems with glyphs that have multiple encodings and other such issues that Unicode aware libraries handle for you.

From the standard, IEEE Std 1003.1-2008:

Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol.

Maybe libpcre would work for you? It's slightly heavier than POSIX regexes, but I would think it lighter than ICU or Boost.

cha0site
  • 10,517
  • 3
  • 33
  • 51
0

If you really mean "Standard", i.e. std::regex from C++11, then all you need to do is switch to std::wregex (and std::wstring of course).

MSalters
  • 173,980
  • 10
  • 155
  • 350