Obtain DNA substrings wrt their original orders

Question

I'd like to get the substrings of long DNA sequences

For example, given:

1/ATXGAAATTXXGGAAGGGGTGG
2/AATXGAAGGAAGGAAGGGGATATTX
3/AAAAAATTXXGGAAGGGGXTTTA
4/AAAATTXXATAXXGGAAGGGGXTXG
5/ATTATTGTTXAXTATTT

the output is to be:

1/TXG    -  TTXX
2/TXG     -
3/       -  TTXX
4/TTXX  -   TXG
5/             -

I tried the following regex pattern:

(TXG|TTXX)

and it works, and the results are put in a list but I don't know how to retrieve the order of each result that has appeared in the original sequences. That is, whether TTXX and TXG appear first and second respectively as in sequence 4 but second and first as in sequence 1; and in 2nd and 3rd results, that is harder because match-xx function call doesn't offer an index of the substring which it took from the sequence in question. Thank you for your insights.

Is the hyphen and varying number of spaces in the desired output of some importance? Also, why aren't you supposed to match TTXX in sequence 5? — flesk, Nov 29 '11 at 12:29
@BabyDolphin What are these hyphens in the output? Just random? And the spaces? Other than this your regex seems trivial. — FailedDev, Nov 29 '11 at 12:40

score 3 · Answer 1 · answered Nov 29 '11 at 12:43

How about:

#!/usr/bin/perl 
use strict;
use warnings;
use Data::Dump qw(dump);

my %res;
while(my $line = <DATA>) {
    chomp$line;
    while($line =~ /TXG|TTXX/g) {
        push @{$res{$line}}, "found $& at pos:".(pos($line)-length($&));
    }
}
dump%res;

__DATA__
ATXGAAATTXXGGAAGGGGTGG
AATXGAAGGAAGGAAGGGGATATTX
AAAAAATTXXGGAAGGGGXTTTA
AAAATTXXATAXXGGAAGGGGXTXG
ATTATTGTTXXXTATTT

output:

(
  "ATTATTGTTXXXTATTT",
  ["found TTXX at pos:7"],
  "AATXGAAGGAAGGAAGGGGATATTX",
  ["found TXG at pos:2"],
  "AAAAAATTXXGGAAGGGGXTTTA",
  ["found TTXX at pos:6"],
  "AAAATTXXATAXXGGAAGGGGXTXG",
  ["found TTXX at pos:4", "found TXG at pos:22"],
  "ATXGAAATTXXGGAAGGGGTGG",
  ["found TXG at pos:1", "found TTXX at pos:7"],
)

score 0 · Answer 2 · answered Nov 29 '11 at 12:37

0

What if you put 2 matching functions?

my $result="";

$result.="TXG" if(/TXG/);
$result.="TTXX" if (/TTXX/);

print $result;

answered Nov 29 '11 at 12:37

Shura

108
1
8

Hynek -Pichi- Vychodil · Answer 3 · 2011-11-29T14:16:23.017

0

perl -ne'($a)=/(TXG)/gc;($b)=/\G.*(TTXX)/;($a,$b)=($1,$a)if$a and not$b and/(TTXX)/;m{^(\d/)};printf"$1%5s -%5s\n",$a,$b'

edited Nov 29 '11 at 14:16

answered Nov 29 '11 at 13:43

Hynek -Pichi- Vychodil

26,174
5
52
73

Obtain DNA substrings wrt their original orders

3 Answers3