1

We are trying to match the German string.

Munich tausendschöne Jungfräulein ausendschçne

We are able to match it with a PCRE regex which uses positive lookahead and a sequence of multiple UTF-8 codepoints.

For example, (?=.+(\x{0068}\x{00F6})){1}.

However, when we add any of the UTF-8 literals, ö, ä, ç into the PCRE regex, pcre_compile() complains about invalid UTF-8 regex string.

using a C/C++ PCRE regex with PCRE_UTF8, PCRE_UCP, PCRE_CASELESS options activated which uses the UTF-8 literals, ö, ä, ç, What might be a valid PCRE regex which uses the UTF-8 literals ö or ä or ç?

paercebal
  • 81,378
  • 38
  • 130
  • 159
Frank
  • 1,406
  • 2
  • 16
  • 42
  • The biggest problem to start with is that the string is not German at all, it doesn't even look German. "Munich" is "München", "tausend" and "schöne" are probably meant to be two words, there's no verb, nobody used the word "Jungfräulein" since approximately the 17th (which is singular, as opposed to tausend, but otherwise correct), and I have never seen a word even _remotely_ similiar to "ausendschçne". C-cedil is not used in German, it does not make any sense matching for it. Bevore trying to write a parser for a sample, one should have a sample that matches. – Damon Jun 30 '12 at 10:06
  • Damon, You make some good points. I will try to find a sample of a legitimate German phrase. However, as Giuseppe D'Angelo points out in answer below,the execution charset of our compiler is not yet set to properly output UTF-8 sequences. THank you. – Frank Jun 30 '12 at 12:22

1 Answers1

0

The PCRE developer Giuseppe D'Angelo answered our question on the pcre mailing list:

It is possible, but you must ensure that the execution charset of your compiler is set to properly output UTF-8 sequences. Is it the case? Try getting an hex dump of the string literal you're passing to pcre_compile (eventually, try looking at the assembler output).

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
Frank
  • 1,406
  • 2
  • 16
  • 42