14

Was trying to get words with consecutive repeated letters occurring twice or thrice. Not able find a way to use quantifier and capture group using ERE

$ grep --version | head -n1
grep (GNU grep) 2.25

$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words


Works with -P though

$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed

$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians


Thanks Casimir et Hippolyte for coming up with simpler input and regex to test this behavior

$ echo 'aazbb' | grep -E '(([a-z])\2[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])\2[a-z]*){3}' || echo 'No match'
aazbbycc

$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}' || echo 'No match'
No match

Same behavior seen with sed as well

$ sed --version | head -n1
sed (GNU sed) 4.2.2

$ echo 'aazbb' | sed -E '/(([a-z])\2[a-z]*){2}/! s/.*/No match/'
aazbb    
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*/! s/.*/No match/'
aazbbycc

$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){3}/! s/.*/No match/'
No match


Related search links, I checked some of them, but didn't get anything close to this question

If this is solved in newer version of grep or sed, let me know. Also, if the issue is seen in non-GNU implementations

Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • 1
    Note also that: `echo 'aazbb' | grep -m5 -xiE '(([a-z])\2[a-z]*){2}` works and `echo 'aazbbycc' | grep -m5 -xiE '(([a-z])\2[a-z]*){3}` doesn't. I suspect grep to silently abort patterns with a too high complexity. – Casimir et Hippolyte Apr 23 '17 at 17:07
  • @CasimiretHippolyte seems like it, thanks for this input.. I will try to search more on these lines today :) – Sundeep Apr 24 '17 at 00:32
  • 1
    About your comment in Ed Morton answer, grep in BRE and ERE modes works in a totally different way (than with -P) that doesn't use the backtracking mechanism (in short all possible paths are stored and the longest wins). – Casimir et Hippolyte Apr 29 '17 at 14:43
  • @CasimiretHippolyte I didn't know that, thanks :) so could the issue seen is possibly because of this implementation difference? elsewhere, someone found and informed me this nugget: `echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,3}){3}'` works but not `echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,4}){3}'` – Sundeep Apr 29 '17 at 15:12
  • 1
    Just a heads up, the gnu 3 docs say back references are problematic and could quietly die due to stack overflow. There could also be a recursion limit of 2. –  May 01 '17 at 00:43
  • @sln could you add the exact words from docs? are you referring to `In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.` and `Back-references are very slow, and may require exponential time.` – Sundeep May 01 '17 at 02:10
  • 1
    @Sundeep - Yeah, I think that's it. But, about the artificial limit. A lot of time writers will set that limit to a default, rather than _wait_ that exponential time to find out. Your group 2 construct is simple, however they usually don't make a distinction, it could be complex. I think the problem is backreferences in a nested group construct. You may be able to configure global grep environment parameters to change those, ie. stack size, recursion limit, etc.. This however I am not sure of. –  May 01 '17 at 15:30
  • @sln thanks... `problem is backreferences in a nested construct` yeah I think so too... but with addition that there is also particular sort of quantifiers used inside... will look into env parameters... eventually, I think I will send a mail to bug-grep@gnu.org for clarification – Sundeep May 01 '17 at 15:35
  • @Sundeep - I installed gnugrep32 and ran some tests to narrow it down. I added some results in my updated post. –  May 01 '17 at 19:41

4 Answers4

2

I suppose -E doesn't allow Quantifiers, that's why it works only with -P


to match 2 or more consecutive groups of repeated letters:

grep -P '(?:([a-z])\1*([a-z])\2){1}' /usr/share/dict/words

to match 3 or more consecutive groups of repeated letters:

grep -P '(?:([a-z])\1*([a-z])\2){2}' /usr/share/dict/words

Options:

-P, --perl-regexp         PATTERN is a Perl regular expression
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • 2
    `-E` switches to ERE (extended regular expressions) and the quantifiers available are: `?`, `*`, `+`, `{n}`, `{m,n}` – Casimir et Hippolyte Apr 23 '17 at 16:03
  • ERE does allow quantifiers... `echo 'ac abc abbc abbbc' | grep -Eo 'ab{1,2}c'` , with groups as well.. `grep -xE '([a-d][r-z]){3}' /usr/share/dict/words` – Sundeep Apr 23 '17 at 16:05
  • I wasn't sure about it, that's why I used *"I suppose"*, I'll update my answer. Tks. – Pedro Lobito Apr 23 '17 at 16:09
  • @PedroLobito thanks for your input.. am interested in why the ERE version is not working, question edited to add more examples and info.. – Sundeep Apr 26 '17 at 09:29
0

Update

After searching around, I installed gnugrep32 on my windows box, then ran
some tests:

I read this from an old SO post:

Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep

So, we use [a-z]{0,20} as a test instead of [a-z]* or [a-z]*? where the ? is ignored (wtf?)

Below are incremental tests useing the overal (){n} to see how far it will go before it STOPS BACKTRACKING
into frames.


Min to work

(([a-z])\2[a-z]{0,20}){1}   len = 2    rr
(([a-z])\2[a-z]{0,20}){2}   len = 4    rrrr
(([a-z])\2[a-z]{0,20}){3}   len = 25   rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){4}   len = 47   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){5}   len = 69   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){6}   len = 91   rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

From {3} to {6} the delta lengths are equal to 22.

This happens to be the full length of the capture frame expression ([a-z])\2[a-z]{0,20}
when it does not backtrack into previous frames.

Conclusion is that it automatically stops backtracking after 2 frames.

It makes sence given that for example, out of 20 frames, it gets to 16, and finds it cannot match.
Shoud it go back to frame 1 and adjust there and try it all over agaqin.

Why yes it should.
However, it has now consumed so much memory, the bloated pig has to unwind it all.
This could take forever with this old archaic utility.
Hey, better cap it to 2 frames.

Of course, there is no test case for (([a-z])\2[a-z]*){3} since the greedy quantifier *
will consume the entire line on the second frame if they are all [a-z] and never even
start a third frame.

  • @Sundeep - What's the `E`RE stand for? Maybe you didn't see this part of my answer `It didn't work with (([a-z])\2[a-z]*){3}` but go ahead and downvote all you want, I don't care. –  Apr 28 '17 at 19:08
  • I didn't downvote... ERE is **Extended Regular Expressions** you can see [this manual](http://www.gnu.org/software/grep/manual/html_node/Regular-Expressions.html#Regular-Expressions) for documentation... so, if you use `echo 'aazbbycc' | grep -E '(?:([a-z])\1[a-z]*){2}'` you'd get syntax error on GNU grep – Sundeep Apr 29 '17 at 02:05
  • hey, can you explain this bit more? how are you testing this and what do you mean by frame? looks like you've found a way to know what is happening within the engine... also, [Casimir mentioned in comments](https://stackoverflow.com/questions/43572924/ere-adding-quantifier-to-group-with-inner-group-and-back-reference?noredirect=1#comment74438627_43572924) that `grep doesn't use the backtracking mechanism (in short all possible paths are stored and the longest wins)` – Sundeep May 02 '17 at 04:32
  • 1
    @Sundeep - It makes absolutely no sense. echo "aabbXcc" | grep -E "(([a-z])\2.{0,3}){3}" `no match` echo "aabbXXcc" | grep -E "(([a-z])\2.{0,3}){3}" `"aabbXXcc"` –  May 02 '17 at 21:15
0
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words

Because you search for a double group (twice the same) that have a (at least) double letter inside. Something like abbcabbc [(...) = "abbc" 2 times] and not 2 (eventually similar) group that have each a double letter inside likeabbcdeef.

with 2 back ref:

$ grep -iE '[a-z]*([a-z])\1{1,}[a-z]*([a-z])\2{1,}[a-z]*`
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
  • I am not able to understand your answer... `([a-z]*([a-z])\2[a-z]*){2}` is aimed to match words like `Abbott`, `Annette` etc... not only `abbcabbc`... I don't see how that can be done without using back-reference... `[a-z]{2,}` means any letter 2 or more times... which will match `ab` or `xyz` also.. not restricted to repeated letters like `ee` or `oo` – Sundeep May 02 '17 at 08:51
  • you are right, sorry. the back ref is mandatory (was busy on another project using same kind of issue vbut using variable [so always the same content]. I remove first (non back referenced regex) – NeronLeVelu May 02 '17 at 14:06
  • no probs.... the expanded version would be `[a-z]*([a-z])\1[a-z]*[a-z]*([a-z])\2[a-z]*` as already mentioned in question... or `[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*` to remove redundant character class in middle... for 3 such repeated pairs, `[a-z]*([a-z])\1[a-z]*([a-z])\2[a-z]*([a-z])\3[a-z]*` and so on... my question is why `([a-z]*([a-z])\2[a-z]*){2}` or `([a-z]*([a-z])\2[a-z]*){3}` won't work... because that is much compact and clearer to write – Sundeep May 02 '17 at 14:16
  • probably my wording wasn't clear in question.. but I am not searching for `([a-z])\1{1,}` ... `ee` or `oo` is enough, no need for `eee` or `eeee` etc... `Abbott` contains two repeated letters `bb` and `tt`... and `Chattahoochee` contains three pairs... `tt` and `oo` and `ee` – Sundeep May 02 '17 at 14:18
  • *.. my question is why* see begin of reply. mainly because you search twice the same *whole* pattern not 2 group containing each a double letter inside. (exact opposite error than my first reply) – NeronLeVelu May 03 '17 at 12:59
  • no... `echo 'abbcdeef' | grep -P '([a-z]*([a-z])\2[a-z]*){2}'` will match because it is the pattern itself which is repeated not the matched pattern... ERE either has bug or limitation because of which this example doesn't work – Sundeep May 03 '17 at 13:17
  • `echo "aacddf" | grep -E '(.*([a-z])\2.*)\1'` failed and `echo "aacddf" | grep -E '(.*([a-z])\2.*)(.*([a-z])\4.*)'` works. so it's the -P that have a different behavior (in my old 3.6 version it's still experimental with warning about behavior) – NeronLeVelu May 03 '17 at 13:59
  • are you still saying that quantifier on capture group repeats the matched string and not the regex itself? refer to examples in second half of my question... also, for simpler case, `echo 'aacddf' | grep -xE '(([a-z])\2[a-z]){2}'` and `echo 'aacddfeex' | grep -xE '(([a-z])\2[a-z]){3}'` – Sundeep May 03 '17 at 14:18
  • in fact, it seems that this is the first `[a-z]*` that cause the problem. If you use juste `[a-z]` on `abbcdeef` it works. Regex try to take the longest possible expression and in this case, the longest seems to discard 'small' sub pattern case. `echo 'abbcdeef' | grep -xE '([a-z]{1,5}([a-z])\2[a-z]*){2}'` failed where `{1,4}` works. So you are right in your general `(...){2}` but there is an internal limitation (bug ?) to use it as open bar. – NeronLeVelu May 04 '17 at 05:15
  • yeah, and I will try asking about in mailing list and update here if I get an answer – Sundeep May 04 '17 at 06:02
0

I filed an issue https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 and the manual is now updated to reflect such issues.

From https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs:

Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.

Sundeep
  • 23,246
  • 2
  • 28
  • 103