4

Suppose we are doing a multiline regex pattern search on a bunch of files and we want to extract the matches from grep. By default, grep outputs matches separated by newlines, but since we are doing multiline patterns this creates the inconvenience that we cannot easily extract the individual matches.

Example

grep -rzPIho '}\n\n\w\w\b' | od -a

Depending on the files in your filetree, this may yield an output like

0000000   }  nl  nl   m   y  nl   }  nl  nl   i   f  nl   }  nl  nl   m
0000020   y  nl   }  nl  nl   m   y  nl   }  nl  nl   i   f  nl   }  nl
0000040  nl   m   y  nl
0000044

As you can see, we cannot split on newlines to obtain the matches for further processing, since the matches contain newline characters themselves.

What doesn't work

Now the --null (or -Z) only works in conjunction with -l, which makes grep only list filenames instead of matches, so that doesn't help here.

Note, this is not a duplicate of Is there a grep equivalent for find's -print0 and xargs's -0 switches?, because the requirements in that question are different, allowing it to be answered using alternative techniques.

So, how can we make this work? Maybe use grep in conjuction with other tools?

Community
  • 1
  • 1
chtenb
  • 14,924
  • 14
  • 78
  • 116
  • 1
    I'm going to go with "you can't" here if `grep` itself can't do that for you (and who's to say you don't have `NUL` in your matched data to begin with). You've abused `grep` a fair bit to make this work already. I'd either use the `od` (or similar) output or use a tool other than `grep` that you can control the output with better (`awk` or `perl` or whatever). – Etan Reisner Mar 17 '16 at 17:06
  • A tuple of file name, byte offset, and match length would allow you to collect the actual matches when you need them. I don't think this is doable with `grep` but implementing this in Python or Perl should not be hard. – tripleee Mar 17 '16 at 17:13
  • Can you add sample text to your question, and expected output? I'd also recommend using `awk` for this. – miken32 Mar 17 '16 at 18:46
  • @EtanReisner Yes, our files could contain null characters, but for ascii source code (I usually grep source code) null characters are not quite common, while newline characters are :) Also, I don't feel I abused grep to do this. Grep is a pattern match engine, and I just want to match patterns and extract them. – chtenb Mar 18 '16 at 09:22
  • The problem is now solved definitely: http://stackoverflow.com/a/36090268/1546844 – chtenb Mar 18 '16 at 17:06

3 Answers3

5

So I filed this issue as a feature request in the GNU grep bug mailing list, and it appeared to be a bug in the code.

It has been fixed and pushed to master, so it will be available in the next release of GNU grep: http://git.savannah.gnu.org/cgit/grep.git/commit/?id=cce2fd5520bba35cf9b264de2f1b6131304f19d2

To summarize: this patch makes sure that the -z flag not only works in conjunction with -l, but also with -o.

chtenb
  • 14,924
  • 14
  • 78
  • 116
1

What comes into my mind would be to use a group separator, for example something like:

grep -rzPIho '}\n\n\w\w\b' $FILE -H | sed "s/^$FILE:/\x0/"
bufh
  • 3,153
  • 31
  • 36
  • Yeah, that looks pretty straightforward. Not entirely proof if you have very short files, but +1. Needs some extra logic when doing more files btw – chtenb Mar 18 '16 at 10:10
  • Yes, this is not entirely fool-proof and needs to be improved; this also depend of the content of the file. Please let us know if you come up with a better solution :^) – bufh Mar 18 '16 at 10:12
  • Check my last answer – chtenb Mar 18 '16 at 17:06
1

Here is another way to do this, which should be more foolproof than what @bufh posted, but which is also more complicated and slower.

$ grep -rIZl '' --include='*.pl'| xargs -0 cat | dos2unix | tr '\n\0' '\0\n' \
      | grep -Pao '}\x00\x00\w\w\b' | tr '\0\n' '\n\0' | od -a

The dos2unix is obviously only needed when working with windows line endings. So the punchline here is that we swap null bytes with newlines in the input, have grep match on nullbytes instead and swap things back.

0000000   }  nl  nl   m   y  nul   }  nl  nl   i   f  nul   }  nl  nl   m
0000020   y  nul   }  nl  nl   m   y  nul   }  nl  nl   i   f  nul   }  nl
0000040  nl   m   y  nul
0000044
chtenb
  • 14,924
  • 14
  • 78
  • 116