0

I have a text file which contains 2 lines of a sample DNA sequence, usingpcregrep, I want to find patterns matching "CCC" especially the patterns that span through multiple lines (see end of line 1 to the beginning of line 2 in test.txt below) .

test.txt:

AGAGUGGCAAUAUGCGUAUAACGAUUAUUCUGGUCGCACCCGCCAGAGCAGAAAAUAUUGGGGCAGCGCC
CAUGCUGGGUCGCACAUGGAUCUGGUGAUAUUAUUGAUAAUAUUAAAGUUUUCCCGACAUUGGCUGAAUA

Using Command:

pcregrep -M --color "C[\n]?C[\n]?C" test.txt

Returns:

AGAGUGGCAAUAUGCGUAUAACGAUUAUUCUGGUCGCA**CCC**GCCAGAGCAGAAAAUAUUGGGGCAGCG**CC**

**C**CAUGCUGGGUCGCACAUGGAUCUGGUGAUAUUAUUGAUAAUAUUAAAGUUUU**CCC**GACAUUGGCUGAAUA

It seems to correctly highlight the 2 C's in line 1, however, it highlights the first C in line 2 and then proceeds to print out the second line entirely; giving me a duplication of C.

What am I doing wrong here and how can I avoid the duplication of 'C' in line 2?

  • Will this work for you `pcregrep -M --color "(?<!C)(C\RCC|CC\RC)(?!C)" test.txt` ? – Julio Aug 05 '20 at 16:59
  • I used lookbehind and lookeaged assertions to make sure no extra Cs can be found before and after the 3 Cs, that is, you match exactly 3Cs.If more than 3Cs is impossible in a DNA sequence (I don't know about it), then you may remove the lookeahead and lookbehind assertions – Julio Aug 05 '20 at 17:03

1 Answers1

0

Try with this:

pcregrep -M --color "(?<!C)(C\RCC|CC\RC)(?!C)" test.txt

I'm assuming that you want to find exactly 3 Cs and no more, and that more than 3C is possible. If that is not possible, or you don't care about matching more than 3C's, you may use this simpler regex instead:

pcregrep -M --color "C\RCC|CC\RC" test.txt

Explanation:

(?<!C)   # Negative lookbehind: Don't match if there's a C before the match
(              # One of these:
      C\RCC    #   C + any kind of new line + CC
    | CC\RC    #  CC + any kind of new line + C
)
(?!C)    # Negative lookahead: Don't match it there's a C after the match

See demo here.

Julio
  • 5,208
  • 1
  • 13
  • 42