1

I wrote a Regex using pcregrep, and everything behaved as expected until I added a positive lookahead.

Scenario:

I have the following text file:

a
b
c
a
c

Goal:

I want to use a Regex with pcregrep to return a line containing a and a line containing c with a line containing b between them that is not captured. So it would capture the first three lines (a, b, c) and return the first (a) and third (c) line. It would not capture the fourth and fifth line because there is no b line between them. So the output would be:

a
c

What I've tried

If I run pcregrep -M 'a\nb\nc\n' (command 1), this captures and returns:

a
b
c

as expected. So I now want to modify this to capture the b line with a positive lookahead. I tried this: pcregrep -M 'a\n(?=(b\n))c\n' (command 2). However, this returns nothing.

My question:

Why does command 2 not return the expected output, where command 1 does? How can I return the desired result? I know there are ways to do this other than pcregrep, but please note that I want to use pcregrep because I'll be extending the functionality to solve similar problems.

halfer
  • 19,824
  • 17
  • 99
  • 186
gkeenley
  • 6,088
  • 8
  • 54
  • 129
  • 1
    Keep in mind, that when using a lookahead, you do __not match__ the characters in the lookahead. You only assert (without matching the character) that the lookahead pattern is there. You _still have to __match__ the entire pattern_, and again, the lookahead does __not__ match, only asserts. – K.Dᴀᴠɪs May 31 '19 at 18:56
  • @K.Dᴀᴠɪs Understood, thanks for that. So I'm trying to use a non-capturing group now, like this: pcregrep -M 'a\n(?:(b\n))c\n'. This, however, still returns 'a', 'b', 'c'. Do you know how I can get it to return just 'a' 'c'? – gkeenley May 31 '19 at 18:59

2 Answers2

2

You can use 2 capture groups with -o option:

pcregrep -M -o1 -o2 '(a\n)b\n(c)\n' file

a
c

Details:

  • (...): In regex it is used for capturing groups
  • -o1 -o2: prints only capture group #1 and #2

Note that your regex a\n(?=(b\n))c\n won't work because lookahead is just assertion with zero-width match. Your regex asserts presence of b\n after a\n which is fine but it attempts to match c\n right after a\n and this is where matching fails.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Yes I have written in my answer that `-o` prints `only-matching` part of matched string and `-oN` prints capture group #N – anubhava May 31 '19 at 19:12
  • Last clarification, what if instead of just having a single line with `b` in the middle, I wanted to have an arbitrary number of `b` lines? At the moment I just have `b\n`, but to have zero or more lines like that, I think I'd have to do `(b\n)*`, right? But I don't want the `b\n` inside the parentheses to be captured. How would I do that with the strategy you've given me? – gkeenley May 31 '19 at 19:37
  • ...would I have to consider it a capturing group as well, and then do `-o1 -o3` in order to ignore the middle one? – gkeenley May 31 '19 at 19:38
  • ...also how does the Regex engine treat it if you have nested parentheses like (a(abc)*) ? Would the outer one be -o1 and the inner be -o2? – gkeenley May 31 '19 at 19:41
  • 1
    1. You can use a non-capturing group to keep group numbering same: `pcregrep -M -o1 -o2 '(a\n)(?:b\n)+(c)\n' f`. 2. For `nested parentheses` number starts at opening parenthesis. – anubhava May 31 '19 at 19:53
  • one more question: if I want to capture a line that does not start with 'a', I've been able to do that with pcregrep -M '^[^a]'. However, if all my lines start with letters surrounded by quotes, like "a", "b", "c" rather than a, b, c, pcregrep -M '^[^\"a\"]' doesn't work. Do you know why? – gkeenley May 31 '19 at 20:37
  • You can use: `pcregrep -M '^"[^a]'` – anubhava Jun 01 '19 at 04:23
  • What if I want to print a separator between the `-o` groups? For example: if `-o1` matches 'hello' and `-o2` matches 'world', I want to print "hello,world". Is it possible? – Ahmed Hussein Jan 23 '20 at 13:11
  • 1
    I knew how to do it.. it can be done with `--om-separator=","` – Ahmed Hussein Jan 23 '20 at 13:17
1

Why does command 2 not return the expected output, where command 1 does? Because command 2 is a different expression

(?=…) is a ZERO WIDTH lookahead

what you specified is: I want an a, followed by a linefeed followed by a bfollowed by a linefeed. At that position I also want a c followed by a linefeed.

P.S. to just get the a and c maybe this will help?

pcregrep -M 'a\nb\nc\n' | pcregrep -M 'a|c'

Skeeve
  • 7,188
  • 2
  • 16
  • 26
  • Understood, thanks for that. So I'm trying to use a non-capturing group now, like this: pcregrep -M 'a\n(?:(b\n))c\n'. This, however, still returns 'a'\n, 'b'\n, 'c'\n. Do you know how I can get it to return just 'a'\n 'c'\n? – gkeenley May 31 '19 at 19:03
  • I've added a proposal in my answer @gkeenley – Skeeve May 31 '19 at 19:05
  • This does work! Could you explain the logic behind how the "|" works in that? – gkeenley May 31 '19 at 19:12
  • 1
    second `pcregrep` can be replaced by `grep -E '^(a|c)$'` but IMO using more than one command to achieve this is inefficient – anubhava May 31 '19 at 19:20
  • @gkeenley the first pcregrep spits out to stdout. The pipe ("|") feeds the stdout as stdin to the second pcregrep. – Skeeve Jun 01 '19 at 22:14
  • @anubhava - Sure… But as I do not know pcregrep this is the best solution I could think of besides going to perl instead… – Skeeve Jun 01 '19 at 22:15