1

I am trying to search for a regex with lookahead its not working in pcregrep or grep

I want to search for bits of sections

  • which may span over multiple lines,
  • which start with PQXY at the beginning of a line and
  • end with OFEJ at the end of the line and
  • does not contain either PQXY or OFEJ in between

Generall i use the following in sublime text find and works well

(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)

Now i want to find the count of such occurences so i am trying to use grep or pcergrep, both are not working.

pcregrep -c "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)" file.txt
zsh: event not found: PQXY|OFEJ).)

and with grep

$ grep -c -zoP "(?s)(^PQXY(?:(?!PQXY|OFEJTRANS).)*OFEJTRANS\n)" CB_raw_testing_21_feb_CORRECTIONS_0002.txt
zsh: event not found: PQXY|OFEJTRANS).)

How can i do this

Answer based on @paxdiablo and @anubha.

The main error was the single quotes as addressed by @paxdiablo

$ pcregrep -c -M '(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt 
0

The regex solution is to add (?s) based on @anubha. Ofcourse \n also works instead of (\R|\z)

$ pcregrep -c -M '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt
11726
Santhosh
  • 9,965
  • 20
  • 103
  • 243

2 Answers2

2

zsh: event not found: PQXY|OFEJ).)

Since this is zsh raising the error, it's almost certainly because it's trying to process the stuff within the double quotes. To protect it from that, you should use single quotes, such as:

pcregrep -c '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' file.txt

I don't have pcregrep installed but here's a transcript showing the problem with just echo:

pax> echo "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ)"
zsh: event not found: PQXY|OFEJ).)

pax> echo '(?s)(^PQXY(?:(?OFEJ)'
(?s)(^PQXY(?:(?OFEJ)

In terms of solving the problem rather than using a specific tool, I would actually opt for awk(a) in this case. You can do something like:

awk '/^PQXY/     { s = $0; c = 1; next}
     /OFEJ$/     { if (c == 1) { print s""ORS""$0; c = 0 }; next }
     /OFEJ|PQXY/ { c = 0; next }
     c == 1      { s = s""ORS""$0 }' inputFile

This works by using a string and flag to control lines collected and state, initially they are an empty string and zero.

Then, for each line:

  • If it starts with PQXY, store the line and set the collection flag, then go to next input line.
  • Otherwise, if it ends with OFEJ and you're collecting, output the collected section and stop collecting, then go to next input line.
  • Otherwise, if it has either of the strings in it, stop collecting, move to next input line.
  • Otherwise, if collecting, append current line and move (implicitly) to next input line.

I've tested this with some limited test data and it seems to work okay. Here's the bash script(b) I used for testing, you can add as many test cases as you need to be comfortable it solves your problem.

for i in \
    "PQXY 1\nabc\n2 OFEJ\n" \
    "PQXY 1\nabc\n2 OFEJx\n" \
    "PQXY 1\nabc\n  PQXY \n2 OFEJ\n" \
    "PQXY 1\nabc\n  OFEJ \n2 OFEJ\n" \
    "PQXY 1\nabc\ndef\nPQXY 2\n2 OFEJ\n" \
; do
    echo "$i:"
    printf "$i" | awk '
        /^PQXY/     { s = $0; c = 1; next}
        /OFEJ$/     { if (c == 1) { print s""ORS""$0; c = 0 }; next }
        /OFEJ|PQXY/ { c = 0; next }
        c == 1      { s = s""ORS""$0 }' | sed 's/^/    /
    '
done

Here's the output so you can see it in action:

PQXY 1\nabc\n2 OFEJ\n:
    PQXY 1
    abc
    2 OFEJ
PQXY 1\nabc\n2 OFEJx\n:
PQXY 1\nabc\n  PQXY \n2 OFEJ\n:
PQXY 1\nabc\n  OFEJ \n2 OFEJ\n:
PQXY 1\nabc\ndef\nPQXY 2\n2 OFEJ\n:
    PQXY 2
    2 OFEJ

(a) In my experience, if you've tried three things with a grep-style regex without success, it's usually faster to move to a more advanced tool :-)


(b) Yes, I know it's written in bash rather than zsh but that's because:

  • it's a test program to show you that awk works, hence the language used is irrelevant; and
  • I'm far more comfortable with bash tahn zsh :-)
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • It solves the error. But the count is shown zero. Its not correct. Can you check the updated question – Santhosh Feb 24 '20 at 05:32
  • 1
    @Santhosh, the regex being wrong is a *different* issue, unrelated to the error you got. You should generally ask *one* question per question and, if there are more problems, ask another question. That both keeps questions and answers compatible and better targets specific problems for people searching in future. – paxdiablo Feb 24 '20 at 05:36
  • you anwered the main question. the regex part should be another question – Santhosh Feb 24 '20 at 05:52
  • thanks for the awk solution. I love to do things with awk but only because of lookaheads i use perl – Santhosh Feb 24 '20 at 05:53
  • thanks for showing with a sample data how to use the awk. – Santhosh Feb 24 '20 at 06:12
2

Using gnu grep:

grep -ozP '(?ms)^PQXY(?:(?!PQXY|OFEJ).)*OFEJ(\R|\z)' file
  • You must use -z option to treat input and output data as sequences of lines, each terminated by a zero byte.

  • Make sure to use single quotes for your pattern so that shell's history module doesn't attempt to process !.

  • Added (?m) (MULTILINE) modifier to allow use of ^ and $ in regex for each line
  • Used (\R|\z) to allow ending pattern to end without newline at the end of file. \R matches any ind of line break including unicode characters and \z matches end of input.

Working Demo


Equivalent solution in pcregrep

pcregrep -M '(?s)^PQXY(?:(?!PQXY|OFEJ).)*OFEJ(\R|\z)' file

-M enables multiline optio in pcregrep.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • It solves the error but shows wrong results. The `(?:(?!PQXY|OFEJ).)` is not working – Santhosh Feb 24 '20 at 05:36
  • @SanthoshYedidi: I have added a working demo of above solution https://ideone.com/lAKMJi – anubhava Feb 24 '20 at 05:40
  • Yes the pcregrep command works as per your answer. I updated the question with what is working for me. But grep is not working – Santhosh Feb 24 '20 at 05:54
  • I am checking it some sample text – Santhosh Feb 24 '20 at 06:02
  • Since your main question of `event not found` got answered, I suggest you post a different question to solve secondary problem :) – anubhava Feb 24 '20 at 06:03
  • https://pastebin.com/hUbce498 (this is a sample text). Checked with `grep -c -zoP '(?ms)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ(\R|\z))' test.txt` it shows `1`, whereas with `pcregrep -c -M '(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)' test.txt` it shows correct result `2` – Santhosh Feb 24 '20 at 06:08
  • Here i am still using the same regex `(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)` which i used for my sublime text. In the question i was not able to show that i already tried `pcregrep -c -M "(?s)(^PQXY(?:(?!PQXY|OFEJ).)*OFEJ\n)" test.txt` but failed due to double quotes. So the double quotes confused me that i am not using the right regex. When you showed its working for you then i got the faith back. But for grep its not working – Santhosh Feb 24 '20 at 06:15
  • I have already shown in a demo that is working fine and since you've accepted other answer without solving your actual problem there is no point in having extended discussion in comments section. I suggest you use proposed awk solution. – anubhava Feb 24 '20 at 06:22