2

I'd like to get the substrings of long DNA sequences

For example, given:

1/ATXGAAATTXXGGAAGGGGTGG
2/AATXGAAGGAAGGAAGGGGATATTX
3/AAAAAATTXXGGAAGGGGXTTTA
4/AAAATTXXATAXXGGAAGGGGXTXG
5/ATTATTGTTXAXTATTT

the output is to be:

1/TXG    -  TTXX
2/TXG     -
3/       -  TTXX
4/TTXX  -   TXG
5/             -    

I tried the following regex pattern:

(TXG|TTXX) 

and it works, and the results are put in a list but I don't know how to retrieve the order of each result that has appeared in the original sequences. That is, whether TTXX and TXG appear first and second respectively as in sequence 4 but second and first as in sequence 1; and in 2nd and 3rd results, that is harder because match-xx function call doesn't offer an index of the substring which it took from the sequence in question. Thank you for your insights.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Baby Dolphin
  • 489
  • 4
  • 13

3 Answers3

3

How about:

#!/usr/bin/perl 
use strict;
use warnings;
use Data::Dump qw(dump);

my %res;
while(my $line = <DATA>) {
    chomp$line;
    while($line =~ /TXG|TTXX/g) {
        push @{$res{$line}}, "found $& at pos:".(pos($line)-length($&));
    }
}
dump%res;

__DATA__
ATXGAAATTXXGGAAGGGGTGG
AATXGAAGGAAGGAAGGGGATATTX
AAAAAATTXXGGAAGGGGXTTTA
AAAATTXXATAXXGGAAGGGGXTXG
ATTATTGTTXXXTATTT

output:

(
  "ATTATTGTTXXXTATTT",
  ["found TTXX at pos:7"],
  "AATXGAAGGAAGGAAGGGGATATTX",
  ["found TXG at pos:2"],
  "AAAAAATTXXGGAAGGGGXTTTA",
  ["found TTXX at pos:6"],
  "AAAATTXXATAXXGGAAGGGGXTXG",
  ["found TTXX at pos:4", "found TXG at pos:22"],
  "ATXGAAATTXXGGAAGGGGTGG",
  ["found TXG at pos:1", "found TTXX at pos:7"],
)
Toto
  • 89,455
  • 62
  • 89
  • 125
0

What if you put 2 matching functions?

my $result="";

$result.="TXG" if(/TXG/);
$result.="TTXX" if (/TTXX/);

print $result;
Shura
  • 108
  • 1
  • 8
0
perl -ne'($a)=/(TXG)/gc;($b)=/\G.*(TTXX)/;($a,$b)=($1,$a)if$a and not$b and/(TTXX)/;m{^(\d/)};printf"$1%5s -%5s\n",$a,$b'
Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73