Is it possible for C/C++ PCRE to match 2 or more UTF-8 codepoints which are far apart from each other in a UTF-8 String?

Question

Good afternoon, We are using the latest C/C++ version of PCRE on WINDOWS Visual Studio 8.0 and 9.0 with PCRE_CASELESS, PCRE_UTF8, PCRE_UCP. When we use the PCRE regex [\x{00E4}]{1} we are able to match Standard Latin code point U+00E4 with the string DAS tausendschÃ¶ne JungfrÃ¤ulein, also known as 44 41 53 20 74 61 75 73 65 6E 64 73 63 68 C3 B6 6E 65 20 4A 75 6E 67 66 72 C3 A4 75 6C 65 69 6E. Now we would like to match both the codepoints U+00E4(i,e.C3 B6) and U+00F6 (i.e. C3 A4) so we can implement a simple prototype C/C++ search and replace operation $1 $2. Is this possible to do? Thank you.

We are now using the PCRE regex [\x{00F6}\x{00E4}]{1,} with the following C++ function:

void cInternational::RegExSearchReplace(cOrderedList *RegExList_,char **Input_) {
    const char *replacement;
    char substitution[dMaxRegExSubstitution];
    int subString;
    cPCRE *regEx;
    unsigned char* Buffer;

    Buffer = new unsigned char[1024];
    if (*Input_[0]!='\x0' && RegExList_->ResetIterator()) {
        do {
            regEx=new cPCRE();
            regEx->SetOptions(PCRE_CASELESS);
            if (regEx->Compile(RegExList_->GetCharacterField(1))) {
                // Search for Search RegEx:
                while (regEx->Execute((char *)Buffer)>0) {

                   // Found it, get Replacement expression:
                   replacement=RegExList_->GetCharacterField(2);
                    int subLen=0;
// Build substitution string by finding each $# in replacement and replacing
//   them with the appropriate found substring. Other characters in replacment
//   are sent through, untouched.
    for (int i=0;replacement[i]!='\x0';i++) {
if (replacement[i]=='$' && isdigit(replacement[i+1])) {
      subString=atoi(replacement+i+1);
      if (regEx->HasSubString(subString)) {
strncpy(substitution+subLen,
       *Input_+regEx->GetMatchStart(),
        regEx->GetMatchEnd() - regEx->GetMatchStart());

        subLen+=(regEx->GetMatchEnd() - regEx->GetMatchStart()
     }
     i++
  } else {
     substitution[subLen++]=replacement[i];
  }
}
substitution[subLen]='\x0';

// Adjust the size of Input_ accordingly:
int sizeDiff=strlen(substitution)-(regEx->GetMatchEnd()-regEx->GetMatchStart());
if (sizeDiff>0) {
    char *newInput=new char[strlen(*Input_)+sizeDiff+1];
    strcpy(newInput,*Input_);
    delete[] *Input_;
    *Input_=newInput;
}

memmove(*Input_ + regEx->GetMatchStart() + 1,
        *Input_+regEx->GetMatchEnd() + 1,
        regEx->GetMatchEnd()- regEx->GetMatchStart());
strncpy(*Input_,substitution,strlen(substitution));
(*Input_)[strlen(substitution)] = '\x0';
Buffer = Buffer + regEx->GetMatchEnd();
}
}
delete regEx;
} while (RegExList_->Next());
}
}

How would you like to match them? "If both appear once anywhere in a string", "If either appear anywhere in a string", "If both appear next to each other in a string", ... — Ditmar Wendt, Jun 25 '12 at 20:40
@Daman, We would like to match them if they both appear once anywhere in a string? Thank you for your reply. — Frank, Jun 25 '12 at 20:42

Ditmar Wendt · Accepted Answer · 2012-06-25T20:57:41.997

2

Using PCRE, the regex you would use to match those appearing anywhere in a string is the following: \x{00E4}.*\x{00F6}

Explanation:

\x{00E4} matches the first unicode character you want to find.

. matches any character.

* modifies the previous period to match 0 or more times. This will allow the second unicode character to be any number of characters away.

\x{00F6} matches the second unicode character you want to find.

This will match if they appear at all. Let me know how it works, if you need it to do something else, etc. (For example: this doesn't seem all that useful for a search and replace operation. It's just going to tell you if those characters exist in the string at all. You'd need to modify the regex to do a substitution.)

edited Jun 25 '12 at 20:57

answered Jun 25 '12 at 20:51

Ditmar Wendt

668
4
15

Thank you for answer. I just tried your regex '\x{00E4}.*\x{00F6}'. PCRE compiles it successfully but does not match any UTF8 codepoint characters. Please advise us what to try to make you answer excute okay. Thank you again. – Frank Jun 25 '12 at 21:11
Daman, I just reversed the order in your regex to '\x{00F6}.*\x{00E4}'. Your PCRE regex now compiles and matches at byte positions 14 and 27 which is correct. Could you please suggest some pseudo code to implement the search and replace? I just accepted your answer. THank you. – Frank Jun 25 '12 at 21:17
Regarding the reverse working: The regex in this post is order-sensitive. It'll look for U+00E4 first, and then U+00F6 appearing after. Is order of appearance constant in your search? If not, the reversed version of my regex should be adequate. – Ditmar Wendt Jun 25 '12 at 21:35
Daman, Thank you for your reply. THe order of appearance of UTF-8 multibyte characters is random in the strings presented to the PCRE regex processor. Is there a PCRE regex that allows the two mutually exclusive orders of appearance to both be successfuly matched? Thank you. – Frank Jun 25 '12 at 21:44
Ah, OK, here's an order-insensitive regex: `(?=.*\x{00F6})(?=.*\x{00E4})`. It uses `?=` or **positive lookahead** to make sure both are matched. Regarding the pseudocode, I looked for your "cPCRE" library and it appears there is a cPCRE::Replace function, have you tried using that? It appears you'd use it after executing a regex. – Ditmar Wendt Jun 25 '12 at 22:10
Daman, Thank you for your reply. I tried your order-insensitive regex (?=.*\x{00F6})(?=.*\x{00E4}), PCRE compiles it successfully but it matches DAS tausendschÃ¶ne JungfrÃ¤ulein at pcre match start byte position 0 and prec match end byte position 0 rather than (14,27). Please advise us to how to get a match at byte positions 14 and 27 which we obtained with the PCRE REGEX '\x{00F6}.*\x{00E4}'--------We found the pcrecpp.cc code for replace. Thank you for your help – Frank Jun 25 '12 at 22:36
I fear it isn't possible to match the position and fact that both characters are there through a single regex, when done order-insensitively. The order-insensitive regex will only tell you they are there. I'd recommend doing a different search and replace for each character separately. I've consulted a few regex gurus as well regarding this, and this is what they recommend. Why do you need to match these characters specifically? If you're replacing both of these characters, it can definitely be done through separate replacements. If you're looking for certain words, search the entire word. – Ditmar Wendt Jun 25 '12 at 23:51
This may give you some garbage matches as well, but try this: `(?=.*(a))(?=.*(b))` – Ditmar Wendt Jun 25 '12 at 23:52
Daman, Thank you for consulting with some regex gurus about how to search and replace multiple UNICODE UTF-8 characters in a PCRE regex. Also, I appreciate your help in formatting my C++ code snippet. I just tried (?=.*(a))(?=.*(b)) with DAS tausendschÃ¶ne JungfrÃ¤ulein and thr match start result was 0 and the match end result was 0. Regards, Frank. – Frank Jun 26 '12 at 01:13

score 1 · Answer 2 · answered Jun 26 '12 at 17:16

I sent an email to the developer of PCRE, Phip Hazel last night. Mr. Hazel delives that is it is possible to implement order insensitive PCRE regexes such as \x{00f6}.?\x{00e4} | \x{00e4}.?\x{00f6}

The explanation is shown below. Thank you for your help, Damon. Regards, Frank

From: Philip Hazel Date: Tue, Jun 26, 2012 at 8:55 AM To: Frank Chang Cc: pcre-dev@exim.org

On Mon, 25 Jun 2012, Frank Chang wrote:

Good evening, We are using C/C++ PCRE 8.30 with PCRE_UTF8 | PCRE_UCP | PCRE_COLLATE.Here's an order-insensitive

regex: '(?=.\x{00F6})(?=.\x{00E4})' It tries to use uses ?= or positive lookahead to make sure both UTF-8 code points are matched in either order.

PCRE_compile() returns OK and PCRE_execute() returns OK on the string DAS tausendschÃ¶ne JungfrÃ¤ulein . In hex, it is 44 41 53 20 74 61 75 73 65 6E 64 73 63 68 C3 B6 6E 65 20 4A 75 6E 67 66 72 C3 A4 75 6C 65 69 6E. However, GetMatchStart() returns 0 and GetMatchEnd() returns 0 instead of GetMatchStart() = 14 and GetMatchEnd() = 27 which we obtain when we use the PCRE '\x{00F6}.*\x{00E4}' regex. Please advise us if it is possible to do order insensitive matching of multiple UTF-8 code points in a PCRE regex. THank you.

I have run your regex through the basic pcretest program, and it matches. This confirms your finding with PCRE_compile() and PCRE_execute().

Since your regex consists entirely of assertions, the actual matched string is empty (as pcretest shows). You need to modify your regex to actually match something if you want a match start and end to be given to you. If what you want is the string between these two code points, in either order, something simple like

\x{00f6}.?\x{00e4} | \x{00e4}.?\x{00f6}

(ignore white space) should do what you want.

I realize that this example may be a simplification of your real application, and my simple suggestion does not scale very well. But the main point stands: if you want to extract strings, your regex must do some actual matching, not just assertions.

Philip

-- Philip Hazel

score 0 · Answer 3 · edited Apr 23 '13 at 14:50

0

We wrote a PCRE order insensitive regex.

(?=.+(\x{00F6})){1}(?=.+(\x{00E4})){1}

That appears to function correctly.

edited Apr 23 '13 at 14:50

Cyril Gandon

16,830
14
78
122

answered Jun 26 '12 at 13:53

Frank

1,406
2
16
42

Is it possible for C/C++ PCRE to match 2 or more UTF-8 codepoints which are far apart from each other in a UTF-8 String?

3 Answers3