0

I have patterns like the following

header line
  a = b
  c = d
  c = e
  f = g

I've come up with the pattern

std::string pat = 
"((.*)(\n|\r\n)(\\s|\\t)*?(?<name>([a-z]{1,100}))\\s+=)"
"((.*)(\n|\r\n)(\\s|\\t)*?(?<!\\k<name>{1,100})\\s+=)";

Using ICU's regex I get U_REGEX_LOOK_BEHIND_LIMIT. I thought the {1,100} is what I needed but that has no effect. How do I get the look behind to take the limit I'm giving?

Or is there a simpler way to do this? If it's not clear, I want a pattern that matches whenever the first word of a line is different to the first word of the previous line, so it would match when it encounters c = d and again when it encounters f = but wouldn't for c = e.

zcourts
  • 4,863
  • 6
  • 49
  • 74
  • 1
    ICU regex does not support non-fixed-width look-behind. – Wiktor Stribiżew Sep 26 '15 at 13:46
  • You might consider trying `std::regex` in C++11. – John Zwinck Sep 26 '15 at 14:00
  • @stribizhev I found that out just before posting that's why I changed the capture group from `[a-z]+` to `[a-z]{1,100}` thinking that'd make it fixed-width. @JohnZwinck - I can't switch to `std::regex` at the moment, it'd take a great deal of time to do the switch. From first glance though at http://en.cppreference.com/w/cpp/regex , it doesn't look like C++11 regex has unicode support which is a requirement – zcourts Sep 26 '15 at 14:12
  • Split by newline, check if the first word is the same as the word from the previous line. Only .NET regex (or Python regex module) and Java can handle look-behinds of variable width. – Wiktor Stribiżew Sep 26 '15 at 15:17
  • @stribizhev: ICU supports bounded variable-length lookbehinds like Java does, so `(?<![a-z]{1,100})` would be legal. The problem is the backreference. It doesn't matter how complex the referenced group is, backreferences are not allowed in lookbehinds. – Alan Moore Sep 26 '15 at 20:43
  • @AlanMoore any suggestions on achieving the comparison without the back reference? – zcourts Sep 27 '15 at 01:00
  • So, that is what *The length of possible strings matched by the look-behind pattern must not be unbounded (no \* or + operators.)* mean... Still, there is no way to do that with regex, I also noticed the back-reference in look-behind issue later when trying my hand at it. – Wiktor Stribiżew Sep 27 '15 at 08:09

1 Answers1

1

Try this regex:

^\h*(?<name>\w++)\h*=.*\R(?=\h*(?<good>(?!\k<name>\b)\w++\h*=.*$))

DEMO

I've basically turned your solution on its head. I match the previous line in the normal way, then matches the current line in a lookahead. The lookahead lets me look at the whole line without advancing the current match position. This is why the next match attempt starts on the next line, not the one after it.

Although the lookahead doesn't consume what it matches, you can still capture parts of the matched text in groups. Here I've captured the current line in group named good.

A word about some of my other changes: \R is the platform-neutral newline construct, which is much more robust than (\n|\r\n). \h matches horizontal whitespace characters like spaces and TABs, but not vertical whitespace, like linefeeds and carriage returns. Note that \h is not the same as (\s|\t). Many new users assume \s only matches the space character, but it actually matches any whitespace characters, horizontal or vertical.

Here's the regex as a C string literal:

"(?m)^\\h*(?<name>\\w++)\\h*=.*\\R(?=(?<good>\\h*(?!\\k<name>\\b)\\w++\\h*=.*$))"

Note that it doesn't work on the first line, but I'm assuming it doesn't need to.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156