0

I was wondering if there is an implemented way to get all the occurrences of a given substring and the suffix array. I was testing a function which I found in here: https://hg.python.org/cpython/file/2.7/Lib/bisect.py which some modifications. What I get is only one of the position of the occurences when there are other ones. For instance,

seq = "ATGTGCAAGAATGAGGCAAG$" #original string
array = [20, 17, 6, 9, 18, 7, 13, 10, 0, 16, 5, 19, 8, 12, 15, 4, 14, 2, 11, 3, 1] #suffix array
def bisect_left(array, query, seq, lo=0, hi=None):
    if lo < 0:
        raise ValueError('must be non-negative')
    if hi is None: #by default len(array)
        hi = len(array)
    while lo < hi: 
        mid = (lo+hi)//2 #set the middle to binary search
        if seq[array[mid]:] < query:
            lo = mid+1
        else:
            hi = mid
    if not seq[array[lo]:array[lo]+len(query)] == query:
        raise IndexError('there is not any index for the query')
    return array[lo]
print(bisect_left(array, 'ATG', seq))

By executing this the output is 10 when should be 0,10 What could be wrong?

Joe Smith
  • 83
  • 1
  • 6
  • 1
    How is that your output? I get list index out of range. – user3483203 Mar 26 '18 at 21:22
  • 1
    First, why are you modifying the stdlib `bisect_left` function? What are your modifications intended to do? It's hard to debug your changes without knowing that. – abarnert Mar 26 '18 at 21:23
  • 3
    Second, the whole point of `bisect`, and binary search in general, is that you can use it to search a sorted sequence. But your list isn't sorted. So how are you expecting bisecting to be helpful? – abarnert Mar 26 '18 at 21:24
  • Also, I don't see a suffix array, or anything like one, anywhere, despite your tag and your mention in the text. Does molecular genetics have some different meaning for the term than computer science that's relevant here? – abarnert Mar 26 '18 at 21:26
  • Are you opposed to `[m.start() for m in re.finditer('ATG', 'ATGTGCAAGAATGAGGCAAG$')]` Output is `[0,10]` – user3483203 Mar 26 '18 at 21:28
  • Anyway: The reason the output is `10` rather than `0,10` is simple: Your `bisect_left` function does `return array[lo]`, and since the elements of `array` are just numbers, your caller is just going to get a number back. And you `print` exactly what you get back. If you want two numbers back… I'm not sure what the other number is, but it would need something like `return lo, array[lo]`. (And then it'll print `(0, 10)`, not `0,10`; if you want that, you need to write a format string or `join` expression or soemthing.) – abarnert Mar 26 '18 at 21:28
  • I just edit the code. It was an error but it is fixed. The numbers which should be printed are 0 because there is an occurrcence there (in index 0 start the query and 10 for the same reason). The modifications were done with the goal to get all the positions not only one but obviously didn't work. Anyway thanks – Joe Smith Mar 26 '18 at 21:36
  • 1
    Purely to counteract the comments above: I think what you are trying to do makes perfect sense. The problem is that your binary search returns only a single element from your suffix array that matches the query. You could modify your method to walk out from that point to find all the other matching occurrences of the query. – myrtlecat Mar 26 '18 at 21:50

1 Answers1

1

The problem is that your search method returns only a single match, when there might be several. In order to catch all of them, I can think of two options:

  1. First find a single match. Then walk up and down the suffix array from that point to find any additional matches. If the number of matches is not too large then this solution should be adequate, but if there are a very large number of matches it may be quite slow.
  2. If you are expecting a large number of matches, a better strategy would be to perform two binary searches. The first binary search should look for the left-most match in your suffix array, and the second search should look for the right-most match.

Below I have implemented the first of these. I defined a helper method match_at just to avoid lots of ugly nested []s.

seq = "ATGTGCAAGAATGAGGCAAG$" #original string
array = [20, 17, 6, 9, 18, 7, 13, 10, 0, 16, 5, 19, 8, 12, 15, 4, 14, 2, 11, 3, 1] #suffix array

def bisect_left(array, query, seq, lo=0, hi=None):
    if lo < 0:
        raise ValueError('must be non-negative')
    if hi is None: #by default len(array)
        hi = len(array)
    while lo < hi: 
        mid = (lo+hi)//2 #set the middle to binary search
        if seq[array[mid]:] < query:
            lo = mid+1
        else:
            hi = mid

    def match_at(i):
        return seq[i: i + len(query)] == query

    if not match_at(array[lo]):
        raise IndexError('there is not any index for the query')

    # array[lo] is one match
    # now we walk backwards to find the first match
    first = lo
    while first > 0 and match_at(array[first - 1]):
        first -= 1

    # and walk forwards to find the last match
    last = lo
    while match_at(array[last]):
        last += 1

    return array[first:last]

print(bisect_left(array, 'ATG', seq))
myrtlecat
  • 2,156
  • 12
  • 16