Folks, I'm trying to use regular expressions to process a large set of number strings and match digit sequences for particular patterns where some digits are repeated in groups. Part of the requirement is to ensure uniqueness between sections of the given pattern.
An example of the kind of matching I'm trying to achieve
ABBBCCDD
Interpret this as a set of digits. But A,B,C,D cannot be the same. And the repetition of each is the pattern we're trying to match.
I've been using regular expressions with negative look-ahead as part of this matching and it works but not all the time and I'm confused as to why. I'm hoping someone can explain why its glitching and suggest a solution.
So to address ABBBCCDD I came up with this RE using negative look-ahead using groups..
(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}
To break this down..
(.) single character wildcard group 1 (A)
(?!\1{1,7}) negative look-ahead for 1-7 occurrences of group 1 (A)
(.) single character wildcard group 2 (B)
\2{2} A further two occurrences of group 2 (B)
(?!\2{1,4}) Negative look-ahead of 1-4 occurrences of group 2 (B)
(.) single character wildcard group 3 (C)
\3{1} One more occurrence of group 3 (C)
(?!\3{1,2}) Negative look-ahead of 1-2 occurrences of group 3 (C)
(.) single character wildcard group 4 (D)
\4{1} one more occurrence of group 4 (D)
The thinking here is that the negative look-aheads act as a means of verifying that a given character is not found where it's unexpected. So A gets checked in the next 7 chars. Once B and it's 2 repetitions are matched, we're negativdely looking ahead for B in the next 4 chars. Finally once the pair of Cs is matched, we're looking in the final 2 for a C as a means of detecting a mismatch.
For test data, this string "01110033" matches the expression. But it shouldn't because the '0' for A is repeated in the C position.
I ran checks of this expression in Python and with grep in PCRE mode (-P). Both matched the wrong pattern.
I put the expression in https://regex101.com/ along with the same test string "01110033" and it also matched there. I don't have enough rating to post images of this or of variations I tried with the test data. So here are some text grabs from command-line runs with grep -P
So our invalid expression that repeats A in CC position gets through..
$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110033
$
Changing DD to 11, copying BBB, we also find that gets through despite B having a forward negative check..
$ echo "01110011" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110011
$
Now change DD to "00", copying the CC digits and low and behold it doesn't match..
$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
Delete the forward-negative check for CC "(?!\3{1,2})" from the expression and our repeat of the C digit in the D position makes it through.
$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(.)\4{1}'
01110000
$
Back to the original test number and switch CC digits to the same use of '1' from B. It doesn't get through.
$ echo "01111133" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
And to play this out for the BBB group, set the B digits to the same 0 as encountered for A. Also fails to match..
$ echo "00002233" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
Then take out the negative lookahead for A and we can this to match..
$ echo "00002233" | grep -P '(.)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
00002233
$
So it seems to me that the forward negative check is working but that it only works with the next adjacent set or its intended lookahead range is cut short in some form presumably by the extra things we're trying to match.
If I add an additional lookahead on A right after B and its repetition have been processed, we get it to avoid matching on the CC part reusing the A digit..
$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\1{1,4})(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
To take this further, then after matching the CC set, I would need to repeat the negative lookaheads for A and B again. This just seems wrong.
Hopefully an RE expert can clarify what I'm doing wrong here or confirm if negative-lookahead is indeed limited based on what I'm observing