1

I was trying to do a regex replace with boost::regex, but it doesn't seem to be working.

Here is the regex expression:

(\\w+,\\d+,\\d+,\\d+\tscript\t)(.+)(#)(.+)(\t\\d+(,\\d+)?(,\\d+)?,{)

And the formatter:

$1\"$2\"$3\"$4\"$5

The code: (getInput() returns a string with content that should match)

std::string &Preprocessor::preprocess()
{
    std::string &tempString = getInput();
    boost::regex scriptRegexFullName;
    const char *scriptRegexFullNameReplace = "$1\"$2\"$3\"$4\"$5";

    scriptRegexFullName.assign("(\\w+,\\d+,\\d+,\\d+\tscript\t)(.+)(#)(.+)(\t\\d+(,\\d+)?(,\\d+)?,{)");

    tempString = boost::regex_replace(tempString, scriptRegexFullName, scriptRegexFullNameReplace, boost::match_default);

    return tempString;
}

When I put the following test cases on this website:

alberta,246,82,3    script  Marinheiro#bra2 100,{
brasilis,316,57,3   script  Marinheiro#bra1 100,{
brasilis,155,165,3  script  Orientação divina#bra1  858,{

The output of the website is correct:

alberta,246,82,3    script  "Marinheiro"#"bra2" 100,{
brasilis,316,57,3   script  "Marinheiro"#"bra1" 100,{
brasilis,155,165,3  script  "Orientação divina"#"bra1"  858,{

But with boost::regex the output is:

alberta,246,82,3    script  "Marinheiro#bra2    100,{
brasilis,316,57,3   script  Marinheiro#bra1 100,{
brasilis,155,165,3  script  Orientação divina#bra1  858,{

What am I doing wrong, anyone knows?

Thanks for the help.

RenatoUtsch
  • 1,449
  • 1
  • 13
  • 20
  • I suspect you're getting hit by locale headaches... – Billy ONeal Aug 11 '13 at 00:26
  • What happens if you escape the trailing `{`? `{` is a regex metacharacter. – Billy ONeal Aug 11 '13 at 00:31
  • @BillyONeal, nothing changes when I escape the trailing {... but I'll add it anyways, didn't know it was a regex metacharacter. – RenatoUtsch Aug 11 '13 at 00:59
  • @BillyONeal It is possible... the file I am opening uses Windows-1252 (ANSI) character set, and when I try to specify a character like "ç" or "á", the regex doesn't work. But the problem is that even without these characters, the regex is still not working. If I change the two `.+` to `[a-fA-F0-9_ ]` it works, but I need to add support for other characters, and doing `[a-fA-F0-9_áÁàÀâÂãÃéÉíÍóÓôÔúÚüÜçÇ ]` is not working. – RenatoUtsch Aug 11 '13 at 01:03

1 Answers1

2

The problem come from your first (.+) which is greedy and grab all he can, probably until the last # of the subject string.

You can try with this pattern:

const char *scriptRegexFullNameReplace = "$1\"$2\"#\"$3\"$4";

scriptRegexFullName.assign("(\\p{L}+,\\d+,\\d+,\\d+\\s+script\\s+)([^#]+)#(\\S+)(\\s+\\d+,\\{)");

Notices:

  • the escape of the curly bracket is probably uneeded, try to remove it.
  • p{L} stand for any unicode letter but you can try replace it by [^,] if it is a problem
  • You can replace all + by ++ for more performances (no backtracks allowed)
  • No need to capture the sharp to replace it by itself, it is the reason why the pattern has only four capturing groups
  • instead of using (.+?) (the dot with a lazy quantifier), it is better for performances to use a greedy quantifier with a reduced character class: [^#] that will match all characters until the first #
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • This was my first thought, but because there are non-optional characters later in the regex I couldn't figure out how matched if the `+` was too greedy, I didn't go that route. Nice work. – Billy ONeal Aug 11 '13 at 05:40
  • Following your advices I ended up with this: `(\\w++,\\d++,\\d++,\\d++\tscript\t)([^#]++)#([^\t]++)(\t\\d++(,\\d++)?(,\\d++)?,{)` (the \t's are part of the terrible old language I'm working on, so they need to stay there). It worked flawlessly. I didn't think about using [^#], that's a great idea. Thanks for the help, I learned some good things about regex with your answer. – RenatoUtsch Aug 11 '13 at 05:44