Finding inverted repeats in DNA sequence

Question

I have a long string of DNA sequence, and I need to find regions consist of two palindromic sequences flanking a spacer sequence.

The input is:

cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa

This is my code:

use strict;
use warnings;

my $input= $ARGV[0];
chomp $input;
open (my $fh_in, "<", $input) or die "Cannot open file $input $!";
my $dna= <$fh_in>;
chomp $dna;

#######################################################################################

if ($dna=~ /[^ACGT]/gi) {
        print "This is not a valid DNA sequence, it has unknown base(s)\n";
}

$dna=~ tr/[acgt]/[ACGT]/;


######################################################################################

print "Minimum length of palindromic sequence?\n";
my $min= <STDIN>;
chomp $min;

print "Maximum length of palindromic sequence?\n";
my $max= <STDIN>;
chomp $max;

print "Minimum length of spacer region?\n";
my $min_spacer= <STDIN>;
chomp $min_spacer;

print "Maximum length of spacer region?\n";
my $max_spacer= <STDIN>;
chomp $max_spacer;

###################################################################################### 

my $dna_length= length($dna);
my ($length , $offset , $string_1 , $string_2);

for ($offset= 0 ; $offset <= $dna_length-$max-$max-$max_spacer ; $offset++) {
        for ($length= $min ; $length <= $max ; $length++) {
                $string_1= substr ($dna, $offset, $length);
                $string_2= reverse $string_1;
                $string_2=~ tr/[ACGT]/[TGCA]/;
                if ($dna=~ /(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/) {
                        print "IR starts at $offset => $2***$3***$4\n$1\n\n";
                }
        }
}

With parameters: $min = 6 , $max = 12 , $min_spacer = 4 , $max_spacer = 12 The output I get is:

IR starts at 26 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 27 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 118 => CGGATG***GCCAAGGC***CATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGG***CCAAGG***CCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGGC***CAAG***GCCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 119 => GGATGG***CCAAGG***CCATCC
GGATGGCCAAGGCCATCC

IR starts at 119 => GGATGGC***CAAG***GCCATCC
GGATGGCCAAGGCCATCC

IR starts at 120 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

IR starts at 136 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 164 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 252 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 254 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 254 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 255 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 256 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 258 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 274 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 276 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 305 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 306 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 314 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

However when I check the region of my first hit (highlighted in bold in input), the offset of this hit doesn't seem to be at position 26. Could anyone enlighten me what's wrong with my code? Thanks.

I tried using the other DNA sequences and the code seemed to work but except this one. — zebra, Dec 01 '14 at 06:05
Sorry, I can't figure out what your code is doing, and what exactly isn't working. Could you clarify a little? — Sobrique, Dec 01 '14 at 09:55
Sorry I didn't comment my code. I intend to identify a region with two flanking DNA palindromes. For example, GAATTC is a palindrome of its complementary strand when read in reverse direction. Example of region could be GAATTC-----GAATTC or GGGC----GCCC where - is a nucleotide in spacer region. — zebra, Dec 01 '14 at 10:53
A pairs with T and vice versa. C pairs with G and vice versa. — zebra, Dec 01 '14 at 10:57
And what's your expected output that you're not getting? I'm afraid I - like most readers of the site - find code easy, and DNA hard :) — Sobrique, Dec 01 '14 at 11:17
If you copy the first hit (TCGATCGATGCTAGCGGCGCGATCGA) and 2nd hit and find it in my input, you will see that it's not at position 26 as indicated in my output. But 3rd hit is correct. — zebra, Dec 01 '14 at 11:30
so if I understand correctly after churning through letters for a bit, the problem is: your regex match $2 is in position of the first match of $string1 in DNA ( 26 ), even though at that offset value, the other captures did not match, and you would rather have the offset at the time the others matched as well. Would you say that is correct? — bytepusher, Dec 01 '14 at 13:13
@bytepusher Not really, I think. The problem is the first two hits I got in my result (the other hits beside these two are correct, the offset position (26) is wrong. I figured they are wrong when I copy the first hit (TCGATCGATGCTAGCGGCGCGATCGA) and use ctrl-f to locate it in my input DNA sequence. I found the position is not supposed to be 26. I hope this is clear (I am not a native Eng speaker) — zebra, Dec 01 '14 at 13:36
well, the problem is not language, I think, but really, I find it hard to sort out the letters in my head, sorry. I thought I had figured out what happened, but am too tired to try again right now. — bytepusher, Dec 01 '14 at 14:16
Is this a common thing? is there nothing in the bioperl library that can help you? — Len Jaffe, Dec 01 '14 at 17:19
Why are you bioinformatics guys using the perl regex engine for your tasks? that seems horribly inefficient to me with your greatly reduced alphabet. — Patrick J. S., Dec 01 '14 at 23:32

score 1 · Accepted Answer · answered Dec 01 '14 at 17:15

1

Your problem is your regex is looking for a palindromes anywhere in the sequence, not just at the location of the offset. "ATCGATCG" occurs at offset 26, so it matches. You need to add some positional information to your regex. Try something like

/^[ACGT]{$offset}(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/

answered Dec 01 '14 at 17:15

heathobrien

1,027
7
11

Although note that this will not work for long DNA sequences, where the `$offset` is > 32766. A better way might be to only search within a window: `$search_window = substr($dna, $pos, 500);` `if ($search_window =~ /^[ACGT](($region)([ACGT]{$min_spacer,$max_spacer})($inverted_region))/) {` – fugu Oct 01 '17 at 08:37

Patrick J. S. · Answer 2 · 2014-12-03T06:08:12.690

Here is one solution, it uses the experimental (??{}) feature, which is said to change for a long time, but hasn't yet.

How it works: it calls the subroutine convert from within the regular expression and converts the first matching group into a desired regular expression of the outputstring. the rest (backtracking etc) is handled by the regex engine. Sadly interpolating variables as delimiting lengths doesn't sit well with the regex parsing, so I had to use a string to do that. Refrain from that if possible.

use warnings;
use strict;
use 5.01;
use re 'eval'; # needed, because of (??{})

my %c=
  (min_pali   =>    (shift) // 6,
   max_pali   =>    (shift) // 12,
   min_spacer =>    (shift) // 4,
   max_spacer =>    (shift) // 12,
  );

my $re1 = "(.{$c{min_pali},$c{max_pali}})(.{$c{min_spacer},$c{max_spacer}})(??{convert})";
while(<DATA>){
  chomp;
  $_ = uc $_;
  my $converted;
  sub convert {
    my $var = reverse $1;
    $var =~ tr{ACGT}{TGCA};
    $converted =  $var;
  }
  while (/$re1/g) {
    printf "%3d => %s**%s**%s\n", $-[0],$1,$2,$converted;
    pos = $-[0] + 1; # start next match one character after the last match start
  }
}

__DATA__
cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa

OUTPUT:

118 => CGGATG**GCCAAGGC**CATCCG
119 => GGATGG**CCAAGG**CCATCC
120 => GATGGC**CAAG**GCCATC
254 => ATCGATCG**ATGCTAGCGGCG**CGATCGAT
255 => TCGATCG**ATGCTAGCGGCG**CGATCGA
256 => CGATCG**ATGCTAGCGGCG**CGATCG
258 => ATCGAT**GCTAGCGGCGCG**ATCGAT
314 => GATGGC**CAAG**GCCATC

Also I'm not sure if that is a problem, but you could produce longer palidrome sequences that just get shifted into the spacer with this solution:

 Assuming length 2 – 4, spacer= 2 – 4 (X's are unintresting bits)
 ACACAXXTGTGT => ACAC**AXXT**GTGT

Finding inverted repeats in DNA sequence

2 Answers2