2

I am using the code below to try and match symbols using regex, (as an example, I am trying to match the circle star symbol, http://graphemica.com/%E2%9C%AA)

#include <boost/regex.hpp>

//...
std::wstring text = L"a✪c";
auto re = L"(\\p{S}|\\p{L})+?";
boost::wregex r(re);
boost::regex_token_iterator<std::wstring::const_iterator>
  i(boost::make_regex_token_iterator(text, r, 1)), j;
while (i != j)
{
  std::wstring x = *i;
  ++i;
}
//...

The byte value of text is {97, 10026, 99}, (or `{0x61,0x272A, 0x63}'). So it is a valid symbol.

The code matches the 2 letters, 'a' 0x61 and 'c'``0x63, but not the symbol (0x272A). I have tried it with a couple of other symbols and none of them work, (© for example).

What am I missing here?

FFMG
  • 1,208
  • 1
  • 10
  • 24
  • Interesting, both `✪` and `©` belong to `\p{So}` category. What if you just use `\\p{So}`, or `auto re = L"[\\p{So}\\p{S}\\p{L}]";`? (Your `+?` is redundant if you want to match 1 symbol at a time). – Wiktor Stribiżew Jul 22 '16 at 11:29
  • It does not seem to work, I get the following error `Escape sequence was neither a valid property nor a valid character class name. The error occurred while parsing the regular expression: '(\p{So}>>>HERE>>>)'.` – FFMG Jul 22 '16 at 11:48

1 Answers1

3

The Boost.Regex documentation explicitly states that there's no support for Unicode-specific character classes when using boost::wregex.

If you want this functionality, you'll need to build Boost.Regex with ICU support enabled then use the boost::u32regex type instead of boost::wregex.

ildjarn
  • 62,044
  • 9
  • 127
  • 211