5

Suppose my long sequence looks like:

5’-AGGGTTTCCC**TGACCT**TCACTGC**AGGTCA**TGCA-3

The two italics subsequences (here within the two stars) in this long sequence are together called as inverted repeat pattern. The length and the combination of the four letters such as A,T,G,C in those two subsequences will be varying. But there is a relation between these two subsequence. Notice that, when you consider the first subsequence then its complementary subsequence is ACTGGA (according to A combines with T and G combine with C) and when you invert this complementary subsequence (i.e. last letter comes first) then it matches with the second subsequence.

There are large number of such patterns present in a FASTA sequence (contains 10 million ATGC letters) and I want to find such patterns and their start and end positions.

martineau
  • 119,623
  • 25
  • 170
  • 301
user1964587
  • 519
  • 2
  • 9
  • 12

4 Answers4

5

I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.

So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).

The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.

from suffix_tree import GeneralisedSuffixTree

baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }

def revc(seq):
    return "".join([baseComplement[base] for base in seq[::-1]])

data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc  TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
#       012345678901234567890123456789012
#                 1         2         3
minlength = 6

n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
    #print shared
    _, start, stop = shared[0]
    seq = data[start:stop]
    _, rstart, rstop = shared[1]
    rseq = data[n-rstop:n-rstart]
    print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)

This produces output

Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]

It finds each match twice, once from each end. And there's a reverse palindrome in there, too!

martineau
  • 119,623
  • 25
  • 170
  • 301
Carl Raymond
  • 4,429
  • 2
  • 25
  • 39
  • Thank you very much. I have put a comment below (a small version of the algorithm). Can you implement it? – user1964587 Jan 13 '13 at 23:07
  • @user1964587: If this answers the question, please accept the answer by clicking on the checkmark. As for implementing your idea, I'll leave that to you. :-) However, your proposed solution sounds like it will require at least O(n^2) running time, whereas a suffix tree is O(n), and so will be faster. But for 10,000,000-long sequences, you'll want lots of RAM. – Carl Raymond Jan 14 '13 at 00:43
  • Yes, I agree and your suffix tree method is good idea for it. But if you modify your program for a small section of DNA of the whole genome then it will be more efficient. Suppose first consider 20 letters and generate its reverse complement and using the suffix tree method find the shared sequence of it. If there two shared sequences are found then go for the next section and repeat the whole process(and not consider the previous section for the next calculation)otherwise increase the length of the previous sequence.I am trying but can't figure out how to do it.Can you please implement it. – user1964587 Jan 14 '13 at 03:01
  • That's a big request! Why don't you start working on it, piece by piece, and when you have a specific problem, post a new question on it. – Carl Raymond Jan 14 '13 at 17:08
  • I can't print the variable 'tree'. I want to see the structure of the tree variable. Simple 'print tree' is not working here. – user1964587 Jan 14 '13 at 20:28
  • The "tree" variable is an instance of a complex data structure. Look at the documentation for it on the web site linked in my reply for the properties and methods it has. You should be able to print tree.sequences, tree.sharedSubstrings(), and other properties of tree. – Carl Raymond Jan 14 '13 at 21:51
1

Here's a brute force implementation, although it's probably not very useful on super long sequences.

def substrings(s, lmin, lmax):
    for i in range(len(s)):
        for l in range(lmin, lmax+1):
            subst = s[i:i+l]
            if len(subst) == l:
                yield i, l, subst

def ivp(s, lmin, lmax):
    mapping = {'A': 'T', 'G': 'C', 'T': 'A', 'C': 'G'}
    for i, l, sub in substrings(s, lmin, lmax):
        try:
            from string import maketrans
        except ImportError: # we're on Python 3
            condition = sub.translate(
                       {ord(k): v for k, v in mapping.items()})[::-1] in s
        else: # Python 2
            condition = sub.translate(maketrans('ATGC', 'TACG'))[::-1] in s
        if condition:
            yield i, l, sub

Let's find "inverted repeat patterns" of length 6 (and their start positions and lengths):

>>> list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6))
[(10, 6, 'TGACCT'), (19, 6, 'CTGCAG'), (23, 6, 'AGGTCA')]

This doesn't check if the two patterns overlap, though. For instance, 'CTGCAG' matches itself.

Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • I have an idea for finding the inverted palindrome sequence of a long sequence. Consider a section of a DNA sequence of the whole sequence and generate its complement. Then reversing the section of this complement sequence. Then perform the dynamical alignment between these two sections and calculate the cost of it(one from – user1964587 Jan 13 '13 at 22:48
  • actual sequence and other from reverse complement sequence). The cost will be giving an idea which alignment is best. Now if the cost of the best alignment >=threshold cost then select that section and find the common region. The two common regions of this particular section will be one inverted repeat unit. Once find the unit then repeat it for the next section other wise increase the length of the sections continuously. Can anyone implement this algorithm. May be it will be a useful solution. – user1964587 Jan 13 '13 at 22:49
  • I am getting this error message when I was running this scriot. Where is my wrong? Traceback (most recent call last): File "irc.py", line 13, in print list(ivp('AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA', 6, 6)) File "irc.py", line 11, in ivp if sub.translate(mapping)[::-1] in s: TypeError: expected a character buffer object – user1964587 Jan 14 '13 at 04:49
  • @user1964587 Nothing you did wrong, I edited the code to work on both Python 2 and 3. – Lev Levitsky Jan 14 '13 at 06:03
  • but why is this error appearing? any clue. I am using 3.2 but tried with 2. still the error is there – user1964587 Jan 14 '13 at 06:23
  • @user1964587 Please try the edited code. The traceback is from the old code (or please show the new traceback) – Lev Levitsky Jan 14 '13 at 06:34
  • File "irc1.py", line 22 print list(ivp(fasta, 6, 6)) ^ SyntaxError: invalid syntax. this error was coming when I ran the script using python3 or python3.2....why it appears – user1964587 Jan 14 '13 at 07:53
  • @user1964587 Look at the previous line, maybe you forgot to close a bracket there or something. – Lev Levitsky Jan 14 '13 at 08:42
  • @user1964587 Wait a sec, you can't use `print` statement in Python3: [`print` is a function](http://docs.python.org/3.0/whatsnew/3.0.html#print-is-a-function). – Lev Levitsky Jan 14 '13 at 17:44
  • thanks it works. but when I varying lmin and lmax for a long sequence, the output is weird. – user1964587 Jan 14 '13 at 20:39
  • @user1964587 Can you perhaps be more specific? :) – Lev Levitsky Jan 14 '13 at 20:55
0

I have an idea for finding the inverted palindrome sequence of a long sequence. Consider a section of a DNA sequence of the whole sequence and generate its complement. Then reversing the section of this complement sequence. Then perform the dynamical alignment between these two sections and calculate the cost of it(one from actual sequence and other from reverse complement sequence). The cost will be giving an idea which alignment is best. Now if the cost of the best alignment >=threshold cost then select that section and find the common region. The two common regions of this particular section will be one inverted repeat unit. Once find the unit then repeat it for the next section other wise increase the length of the sections continuously. Can anyone implement this algorithm. May be it will be a useful solution.

user1964587
  • 519
  • 2
  • 9
  • 12
0

I took a shot at it using list comprehensions. I'm still new to python, but have been a C# dev for the last 5 or so years. This produces the output you desire, although it's definitely not optimized to be able to efficiently handle a 10 million character string.

Note: because we convert the list into a set in frequent_words, in order to remove duplicate entries, the results are not ordered.

def pattern_matching(text, pattern):
    """ Returns the start and end positions of each instance of the pattern  """
    return [[x, str(len(pattern) + x)] for x in range(len(text) - len(pattern) + 1) if
            pattern in text[x:len(pattern) + x]]


def frequent_words(text, k):
    """ Finds the most common k-mers of k """
    counts = [len(pattern_matching(text, text[x:x + k])) for x in range(len(text) - k)]
    return set([text[x:x + k] for x in range(len(text) - k) if counts[x] == max(counts)])


def reverse_complement(pattern):
    """ Formed by taking the complement of each nucleotide in Pattern """
    complements = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}
    return ''.join([complements.get(c, c) for c in pattern][::-1])


def find_invert_repeats(text, pattern_size):
    """ Returns the overlap for frequent words and its reverse """
    freqs = frequent_words(text, pattern_size)
    rev_freqs = frequent_words(reverse_complement(text), pattern_size)
    return [[x, pattern_matching(text, x)] for x in set(freqs).intersection(rev_freqs)]

print(find_invert_repeats("AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA", 6))

[['TGACCT', [[10, '16']]], ['AGGTCA', [[23, '29']]], ['CTGCAG', [[19, '25']]]]
Alexsh
  • 67
  • 7