7

Folks, I'm trying to use regular expressions to process a large set of number strings and match digit sequences for particular patterns where some digits are repeated in groups. Part of the requirement is to ensure uniqueness between sections of the given pattern.

An example of the kind of matching I'm trying to achieve

ABBBCCDD 

Interpret this as a set of digits. But A,B,C,D cannot be the same. And the repetition of each is the pattern we're trying to match.

I've been using regular expressions with negative look-ahead as part of this matching and it works but not all the time and I'm confused as to why. I'm hoping someone can explain why its glitching and suggest a solution.

So to address ABBBCCDD I came up with this RE using negative look-ahead using groups..

(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}

To break this down..

(.)           single character wildcard group 1 (A)
(?!\1{1,7})   negative look-ahead for 1-7 occurrences of group 1 (A)
(.)           single character wildcard group 2 (B)
\2{2}         A further two occurrences of group 2 (B)
(?!\2{1,4})   Negative look-ahead of 1-4 occurrences of group 2 (B)
(.)           single character wildcard group 3 (C)
\3{1}         One more occurrence of group 3 (C)
(?!\3{1,2})   Negative look-ahead of 1-2 occurrences of group 3 (C)
(.)           single character wildcard group 4 (D)
\4{1}         one more occurrence of group 4 (D)

The thinking here is that the negative look-aheads act as a means of verifying that a given character is not found where it's unexpected. So A gets checked in the next 7 chars. Once B and it's 2 repetitions are matched, we're negativdely looking ahead for B in the next 4 chars. Finally once the pair of Cs is matched, we're looking in the final 2 for a C as a means of detecting a mismatch.

For test data, this string "01110033" matches the expression. But it shouldn't because the '0' for A is repeated in the C position.

I ran checks of this expression in Python and with grep in PCRE mode (-P). Both matched the wrong pattern.

I put the expression in https://regex101.com/ along with the same test string "01110033" and it also matched there. I don't have enough rating to post images of this or of variations I tried with the test data. So here are some text grabs from command-line runs with grep -P

So our invalid expression that repeats A in CC position gets through..

$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110033
$

Changing DD to 11, copying BBB, we also find that gets through despite B having a forward negative check..

$ echo "01110011" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110011
$

Now change DD to "00", copying the CC digits and low and behold it doesn't match..

$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$

Delete the forward-negative check for CC "(?!\3{1,2})" from the expression and our repeat of the C digit in the D position makes it through.

$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(.)\4{1}'
01110000
$

Back to the original test number and switch CC digits to the same use of '1' from B. It doesn't get through.

$ echo "01111133" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$

And to play this out for the BBB group, set the B digits to the same 0 as encountered for A. Also fails to match..

$ echo "00002233" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$ 

Then take out the negative lookahead for A and we can this to match..

$ echo "00002233" | grep -P '(.)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
00002233
$ 

So it seems to me that the forward negative check is working but that it only works with the next adjacent set or its intended lookahead range is cut short in some form presumably by the extra things we're trying to match.

If I add an additional lookahead on A right after B and its repetition have been processed, we get it to avoid matching on the CC part reusing the A digit..

$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\1{1,4})(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$

To take this further, then after matching the CC set, I would need to repeat the negative lookaheads for A and B again. This just seems wrong.

Hopefully an RE expert can clarify what I'm doing wrong here or confirm if negative-lookahead is indeed limited based on what I'm observing

Cormac Long
  • 263
  • 2
  • 5

4 Answers4

0
(.)(?!.{0,6}\1)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}

   ^^^^^^^^

Change your lookahead to disallow match when \1 appears anywhere in the string.See demo.You can similarly modify other parts as well in your regex.

https://regex101.com/r/vV1wW6/31

vks
  • 67,027
  • 10
  • 91
  • 124
  • Thanks for that. That will specifically work for the example I showed. Some of the more elaborate patterns however will repeat A,B,C,D later on in the sequence. But based on your response I just found that using ..(?!.{0,6}\1) would also work as a means of testing my group in the next 7 characters by pitching it as 0-6 wildcards followed by my group character. So I'll run with that approach for now. thanks again for the speedy response. – Cormac Long Sep 29 '15 at 12:27
0

NOTE: updated.

As vks already noted, your negative lookaheads weren't excluding what you thought -- \1{1,7} for example is only going to exclude A, AA, AAA, AAAA, AAAAA, AAAAAA, and AAAAAAA. I think you want the lookaheads to be .*\1, .*\2, .*\3, etc.

But here's another idea: It's easy to prefilter out ANY line that has non-adjacent repeated characters:

grep -P -v '(.)(?!\1).*\1'

And then your regexp on the result is MUCH simpler: .{1}.{3}.{2}.{2}

And in fact the whole thing can be combined using the first as a negative pre-lookahead constraint:

(?!.*(.)(?!\1).*\1).{1}.{3}.{2}.{2}

Or if you need to capture the digits as you did originally:

(?!.*(.)(?!\1).*\1)(.){1}(.){3}(.){2}(.){2}

But note that those digits will now be \2 \3 \4 \5, since \1 is in the lookahead.

Jeff Y
  • 2,437
  • 1
  • 11
  • 18
  • 1
    Thanks for that clarificaiton. There's a set of over 50 patterns I need to match and they go all over the place with the patterns. In some cases, the same character will be repeated further on. So I ended up having to modify vks's solution to be (?!.{0,6}\1) which would test for 0-6 wilds followed by \1 as a replacement for what I thought \1{1,7} was supposed to do for me. And the other expressions then just became variants of the same trick – Cormac Long Sep 29 '15 at 15:50
  • You're welcome. That's the thing -- an outline of the _complete_ problem space is what you/we need before any general correct solution can be had. And if it's really "all over the place with the patterns", no general solution exists -- you'll have to do it piecemeal. – Jeff Y Sep 29 '15 at 16:41
0

Based on the feedback so far, I'm giving another answer that does not rely on doing arithmetic based on total length and that will self-containedly identify any sequence of 4 unique character/digit groups in the length sequence 1,3,2,2 anywhere in a string:

/(?<=^|(.)(?!\1))(.)\2{0}(?!\2)(.)\3{2}(?!\2|\3)(.)\4{1}(?!\2|\3|\4)(.)\5{1}(?!\5)/gm
 ^^^^^^^^^^^^^^^^ this is a look-behind that makes sure we're starting with a new character/digit
                 ^^^^^^^^ this is the size-1 group; yes the \2{0} is superfluous
                         ^^^^^^ this ensures the next group is unique
                               ^^^^^^^^ this is the size-3 group
etc.

Let me know if this is closer to your solution. If so, and if all of your "patterns" consist of sequences of the group sizes you're looking for (like 1,3,2,2), I can come up with some code that will generate the corresponding regexp for any such input "pattern".

Jeff Y
  • 2,437
  • 1
  • 11
  • 18
0

just some details here on what the eventual solution looked like for me..

So fundamentally (?!\1{1,7}) was not what I had thought it would be and was the entire cause of the issues I had encountered. Sincere appreciations to you guys for finding that issue for me.

The example I had shown was 1 from about 50 I had to formulate from a set of patterns.

It ended up as..

ABBBCCDD
09(.)(?!.{0,6}\1)(.)\2{2}(?!.{0,3}\2)(.)\3{1}(?!.{0,1}\3)(.)\4{1}

So once \1 (A) was captured, I tested negative lookahead of 0-6 wildchars preceding A. Then I capture \2 (B), its two repetitions and then give B negative lookahead of 0-3 wilds + B and so on.

It keeps the focus oriented around looking forward negatively to make sure the caught groups do not repeat where they are not supposed to. Then the subsequent captures and their recurrence patterns will do the rest in ensuring the match.

Other examples from the final set:

ABCCDDDD
(.)(?!.{0,6}\1)(.)(?!.{0,5}\2)(.)\3{1}(?!.{0,3}\3)(.)\4{3}

AABBCCDD
(.)\1{1}(?!.{0,5}\1)(.)\2{1}(?!.{0,3}\2)(.)\3{1}(?!.{0,1}\3)(.)\4{1}

ABCCDEDE
09(.)(?!.{0,6}\1)(.)(?!.{0,5}\2)(.)\3{1}(?!.{0,3}\3)(.)(?!\4{1})(.)\4{1}\5{1}
Cormac Long
  • 263
  • 2
  • 5