0

I am trying to filter out insertions and deletions from an mpileup txt file. An example of an insertion or deletion would be +3ATG or -9AATCGTCTC.

In another post I found a solution using perl:

regular expression that reference a match from earlier part of expression

However, the script writes insertions and deletions to the special variable $&. I would like to replace all insertions and deletions with nothing in a new variable. So my solution is identical, but with substitution at the start and to be replaced with nothing, see below.

$row =~ s/(\d+)(??{"."*$1})//xg;

Does anyone have any idea why it won't work or an alternative solution?

I would also be happy to match anything that wasn't an insertion or deletion and make this a new variable.


Here is an example of the input:

$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,-8tgatgctg,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..

Here is an example of the output I would like:

$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,.+..,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,-,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..

Cheers,

Daniel

Community
  • 1
  • 1
  • Can you give the expected input and output, and clearly demonstrate what you're trying to do? – fugu May 13 '16 at 11:29

2 Answers2

0

Is this what you're after?

use feature qw(say);

my $DNA = ',...........,,....,,g.,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,.,,.,,-8tgatgctg,,,,,,,,..';

say $DNA;

$DNA =~ s/\d+[ATGCatgc]*//g;

say $DNA;

,...........,,....,,g.,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,.,,.,,-8tgatgctg,,,,,,,,..
,...........,,....,,g.,,,,,,,,,,,.+..,,,,,.,,.,,-,,,,,,,,..
fugu
  • 6,417
  • 5
  • 40
  • 75
  • I've had to put my file into an array and used your answer in a foreach loop but that has worked perfectly. Thanks a lot for your help with this! – Daniel Kelly May 13 '16 at 12:36
  • @DanielKelly - No worries! – fugu May 14 '16 at 09:56
  • @DanielKelly, I don't believe this works. Consider "+3ACGT" -- the greedy nature of * will consume more bases than the indel specifies. See my answer to [Python regex to match and remove the indels in pileup format](http://stackoverflow.com/questions/37491704/python-regex-to-match-and-remove-the-indels-in-pileup-format/40231190#40231190) for further discussion of this. – cdlane Oct 25 '16 at 04:12
0

A slight variation on the pattern you already have should work:

$pileup = '$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,.+12GATGCTGTGTTT..,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,-8tgatgctg,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..';

$pileup =~ s/[+-](\d+)(??{"[ACGTN]{$1}"})//gi;

print($pileup, "\n");

PRODUCES

$,...........................,,.................,,....,,g.,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,...............,,,.....,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.....,,.....,,,,,,,,,,,......,,,,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,.............................,,.,.........,.,.,,....,..........,,......................,,,,,,...........................,,,,,,,,.....,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,,,,,,,,,...,,,,,,,,.,,,,,,,,,,,,,,,,,,,,,,,.,,.,,,,,...,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,..

Which you'll notice is a couple of characters shorter than your example output as you accidentally left in the signs [+-]

cdlane
  • 40,441
  • 5
  • 32
  • 81