0

Input file:

>AMSF107-09|Perciformes|COI-5P|GU661092
TAGTA-
>AMSF114-09|Perciformes|COI-5P|GU661101
C-ACGC
>ANGBF3683-12|Haemulon_sp._B_JJT-2012|COI-5P|JQ741244
-GCAGTT-CA-

I want to replace the hyphens in TAGTA-, C-ACGC, and -GCAGTT-CA- with N's but leave the headers (the lines that start with >) intact. I'm looking for a regex that will match a hyphen next to an A,C,G, or T but exclude matches that begin with the > character.

Desired output

>AMSF107-09|Perciformes|COI-5P|GU661092
TAGTAN
>AMSF114-09|Perciformes|COI-5P|GU661101
CNACGC
>ANGBF3683-12|Haemulon_sp._B_JJT-2012|COI-5P|JQ741244
NGCAGTTNCAN

EDIT: I know the very basics in regex. So far I've tried (ACGT)?\-(ACGT)? but that matches every hyphen.

cooldood3490
  • 2,418
  • 7
  • 51
  • 66

2 Answers2

1

This matches a hyphen preceded by A, C, G or T: (?<=[ACGT])-

gribvirus74
  • 755
  • 6
  • 20
  • that's close but that also matches the T in the 3rd header. `>ANGBF3683-12|Haemulon_sp._B_JJT-2012|COI-5P|JQ741244` – cooldood3490 Aug 08 '17 at 18:04
  • What is the language that you're writting in? You can add a simple method that filters the lines – gribvirus74 Aug 08 '17 at 18:21
  • I'm editing a text file in Sublime using the Find & Replace function. I'm putting the regex in the Find section and `N` in the replace section. – cooldood3490 Aug 08 '17 at 18:23
  • Ahh ok... So there's a problem, because positive lookbehinds must be fixed length and this means that you can't filter the lines and then find that hyphen in one regex. You can write a simple script for example in python that can accomplish this though. – gribvirus74 Aug 08 '17 at 18:36
1

So this doesn't exactly find just the hyphens, but it will find any combination A, C, G, T including a -. Here is the regex:

(?=[ACGT-]+$)(?=(?:[^-]*[-])+).*

You may have to split this match of the string off and save it to a temporary variable where you then do a .replace('-', 'N'); and concatenate it back on to the end of your data string. Hope this helps!

demogorgon
  • 474
  • 4
  • 20
  • Hold on, it is not quite right. I thought I tested it but it still has a bug. I'll see if I can fix it, but its close! The problem is that as long as any of your given characters are already in existence in the string, you can then put in any character or number. I'll keep trying to work on it. – demogorgon Aug 08 '17 at 23:53
  • @cooldood3490 accidentally pasted wrong `regex`, it is updated now. – demogorgon Aug 09 '17 at 01:19
  • I appreciate your help – cooldood3490 Aug 09 '17 at 01:33