Grep or the like: overlapping matches

Question

For:

echo "the quick brown fox" | grep -Po '[a-z]+ [a-z]+'

I get:

the quick
brown fox

but I wanted:

the quick
quick brown
brown fox

How?

I'm no expert, but I don't think you can do this with grep. You should try writing a perl or awk script. — gmoshkin, Jun 13 '17 at 14:40
With Perl, you can do it easily, not with grep, because grep does not allow access to capturing group contents. — Wiktor Stribiżew, Jun 13 '17 at 14:45
If you use something where you can print the groups, the overlap pattern is `([a-z]+)(?=( [a-z]+))` where you print `$1$2` — , Jun 13 '17 at 14:52
what is the final goal? to match each consecutive pair of words? could be there odd number of words or some digits in between? — RomanPerekhrest, Jun 13 '17 at 15:17

tso · Accepted Answer · 2017-06-19T10:31:32.823

2

with awk:

 awk '{for(i=1;i<NF;i++) print $i,$(i+1)}' <<<"the quick brown fox"

update: with python:

#!/usr/bin/python3.5
import re
s="the quick brown fox"
matches = re.finditer(r'(?=(\b[a-z]+\b \b[a-z]+\b))',s)
ans=[i.group(1) for i in matches]
print(ans) #or not print
for i in ans:
    print(i)

output:

['the quick', 'quick brown', 'brown fox']
the quick
quick brown
brown fox

edited Jun 19 '17 at 10:31

answered Jun 17 '17 at 18:52

tso

4,732
2
22
32

The question was intended in a more general sense. I guess I didn't really make this clear. Put generally, I want to run regexes over inputs in such a way that all possible matches are returned. The behaviour I'm seeing doesn't return "quick brown" even though it's a valid match. – Adrian May Jun 18 '17 at 20:16

waldir · Answer 2 · 2019-02-16T02:04:59.030

Simply reusing the original solution to get the markov chain:

echo "the quick brown fox" | grep -Po '[a-z]+ [a-z]+'
echo "the quick brown fox" | sed 's/^[a-z]* //' | grep -Po '[a-z]+ [a-z]+'

The second line (namely sed) removes the first word of the input. Therefore, rest of the command generates the missing pairs.

The same approach could also be generalized using sed's ability to run loops:

 echo pattern1pattern2 | sed ':start;s/\(pattern1\)\(pattern2\)/<\1|\2>\2/;t start' | grep -o '<[^>]*>' | tr -d '<>|'

This solution will work with partially overlapping patterns where pattern2 can be overlapped by next match. It assumes <>| to be reserved auxiliary characters. Furthermore it assumes that the pattern1pattern2 regex cannot match any string that is matched by pattern2 alone.

The sed substitues pattern1pattern2 with <pattern1|pattern2>pattern2 and repeats this substitution as long as any matches are found (the branching t command allows matching previously substituted strings, unlike the g option). I.e., in every iteration, one <pattern1|pattern2> group is left behind indicating our matches, while an instance of pattern2 can still be matched within next match. Finally, we pick the groups using the original approach and strip the auxiliary marks.

`-Po` means the same thing OP meant (i.e., Perl Regexp for -P, and "output match only" for -o). The regexes are exactly the same ones too. However, I will add explanation of the sed... — waldir, Feb 16 '19 at 00:51

score 0 · Answer 3 · answered Jun 17 '17 at 21:48

0

another awk:

awk '{print $1,$2 RS $2,$3 RS $3,$4}' <<<"the quick brown fox"

    the quick
    quick brown
    brown fox

answered Jun 17 '17 at 21:48

Claes Wikner

1,457
1
9
8

My comment above from 30 seconds ago applies here too. – Adrian May Jun 18 '17 at 20:16

Grep or the like: overlapping matches

3 Answers3