2

I have a text file with inconsistent formatting, but the relevant sections look like:

     CDS             complement(99074..99808)
                     /note="important in cell to cell spread of the virus, a
                     tegument protein"
                     /codon_start=1

As part of an existing bash pipeline, I need to remove the pattern of /note="anything" to get

     CDS             complement(99074..99808)
                     /codon_start=1

I've tried several methods to inverse grep, but the closest only works if the match is not spanning multiple lines:

perl -ne '/\/\bnote\b\="[^"]+"/||print' file.txt

I can match the strings I wish to remove by checking with the following perl one-liner, but so far I cannot combine the two methods to invert the match and remove the strings that span multiple lines:

perl -0777 -ne 'print "$1\n" while ( /(\s+\/\bnote\b\="[^"]+")/sg )' file.txt

Doing the first one-liner as -0777 results in no output.

1 Answers1

2

The simple approach involves reading the entire stream into memory. This is done by telling Perl to treat the whole file as a single line using -0777 or the new -g.

perl -0777pe's{^\s*/note="[^"]*"\n}{}mg'

Doing it a line at a time is more complicated since it requires a flag to indicate whether we're in the string or not.

perl -ne'
   $f ||= m{^\s*/note="};
   print if !$f;
   $f &&= !m{"$};
'
ikegami
  • 367,544
  • 15
  • 269
  • 518