Portable way to use the regex(3) functions on a wide char string in C

Question

There are functions like regwcomp(3) etc. on some systems, but this does not seem to be a portable solution at the moment. When there is a wchar_t string, what is the suggested portable solution (not Linux or GNU specific) to use the regex(3) functions (which normally work with char strings only)? In my case it is not really necessary that the pattern or text to match is non-7-bit ASCII, the problem is that the code used wchar_t for other reasons.

[`regcomp()`](http://man7.org/linux/man-pages/man3/regcomp.3.html) et al. are POSIX.1-2001, that's why they are portable. Unfortunately, POSIX has no wide character regular expression support. The only real portable option is to use [`iconv()`](http://man7.org/linux/man-pages/man3/iconv.3.html) to convert between UTF-8 and wchar_t (`iconv_open("UTF-8//TRANSLIT", "WCHAR_T")` for a handle for conversion to UTF-8, and `iconv_open("WCHAR_T//TRANSLIT", "UTF-8")` for conversion back to wchar_t. — Nominal Animal, Apr 01 '16 at 22:39
I currently use `wcstombs`(3) and `mbstowcs`(3) to convert between char and wchar_t strings. The problem is, that when I use `regexec`(3) on a converted string, I can't simply assign the `regmach_t` responses to possitions in the wchar_t string. This would be the same when using `iconv()` (IMHO). — user3224237, Apr 02 '16 at 10:35
True. Of course, you can temporarily insert a '\0' at each position, and use [`mbstowcs(NULL, beginning, 0)`](http://man7.org/linux/man-pages/man3/mbstowcs.3.html) to count the number of wide characters to that position. For the general multibyte character set case, you cannot just count the number of wide characters between two positions, as that loses the shift state. UTF-8 on the other hand is trivial (no shift state), so a single pass over the string could convert all match positions to wchar_t positions. — Nominal Animal, Apr 02 '16 at 11:05
In case it is not clear to OP or others, after initializing the locale one can use [`nl_langinfo(CODESET)`](http://man7.org/linux/man-pages/man3/mbstowcs.3.html) to obtain the character set or encoding used by the current locale, in form that should be acceptable to [`iconv_open()`](http://man7.org/linux/man-pages/man3/iconv_open.3.html) (although you might wish to append the `//TRANSLIT` and/or `//IGNORE` suffixes). — Nominal Animal, Apr 02 '16 at 11:07
I use `setlocale(LC_ALL, "");` to determine the locale. Indeed I'm interested in UTF-8 only, support of other locales (except `C`) is not intented. Ok, I might tinker something with `wcstombs` conversion and possition guessing. It's too bad that `regwcomp` etc. is not POSIX... — user3224237, Apr 02 '16 at 11:15

Rob Arthan · Answer 1 · 2018-10-10T21:09:31.703

If anyone else has this problem, feel free to borrow the functions my_regwcomp and my_regwexec that I had to write recently. You can find them in this source file in the ProofPower system. These functions simulate the regwcomp and regwexec functions of Free BSD using the POSIX regcomp and regexec functions.

PS: my code is part of a Motif application, if you replace XtMalloc, XtRealloc and XtFree by malloc, ralloc and free it should work in any standard C/C++ development framework. Please add a comment to this answer if you need any help getting my functions working in your environment.

Portable way to use the regex(3) functions on a wide char string in C

1 Answers1