0

I need a function to give the indices for which a list of strings is best aligned to a larger string.

For example:

Given the string:

text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'

and the list of strings:

tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']

Can a function be created to yield:

indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]


Here is a script I created to illustrate the point:

from re import split
from numpy import vstack, zeros
import numpy as np

# I need a function which takes a string and the tokenized list 
# and returns the indices for which the tokens were split at
def index_of_split(text_str, list_of_strings):
    #?????
    return indices

# The text string, string token list, and character binary annotations 
# are all given
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose'])
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

# Here we would apply our function
indices = index_of_split(text, tok)
# This list is the desired output
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]

# We could now split the binary array based on these indices
bin_ann_toked = np.split(bin_ann, indices)
# and combine with the tokenized list
tokenized_strings = np.vstack((tok, bin_ann_toked)).T

# Then we can remove the trailing zeros, 
# which are likely caused from spaces, 
# or other non tokenized text
for i, el in enumerate(tokenized_strings):
    tokenized_strings[i][1] = el[1][:len(el[0])]
print(tokenized_strings)

This would provide the following output, given that the function worked as described:

[['Kir4.3' array([1, 1, 1, 1, 1, 1])]
 ['is' array([0, 0])]
 ['a' array([0])]
 ['inwardly-rectifying'
  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
 ['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])]
 ['channel' array([0, 0, 0, 0, 0, 0, 0])]
 ['.' array([0])]
 ['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]
 ['is' array([0, 0])]
 ['useful' array([0, 0, 0, 0, 0, 0])]
 ['in' array([0, 0])]
 ['glucose' array([1, 1, 1, 1, 1, 1, 1])]
 ['-' array([0])]
 ['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])]
 ['channels' array([0, 0, 0, 0, 0, 0, 0, 0])]
 ['.' array([0])]]
chase
  • 3,592
  • 8
  • 37
  • 58
  • Why does `Dextran-sulfate` retain the hyphen when `'glucose','-', 'mediated'` does not? If we don't know when that occurs, then the function will have to be hard-coded. – Jacob G. Mar 03 '17 at 00:58

2 Answers2

1
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'

tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']


ind = [0]
for i,substring in enumerate(tok):
    ind.append(text.find(substring,ind[i],len(text)))

print ind[2:]

results in

[7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
plasmon360
  • 4,109
  • 1
  • 16
  • 19
  • Thank you for your help, this is a very pythonic and simple solution to the problem. However, the solution using a brute force method is very interesting and worth looking into for anyone with less structure to their strings in their lists. – chase Mar 03 '17 at 18:45
1

Here is a brute-force numpy approach: It finds all word matches and then scores all combinations penalising offsets.

import numpy as np
from scipy import signal

def pen(l, r):
    return (r-l)*(1-4*(l>r))

class template:
    def __init__(self, template):
        self.template = np.frombuffer(template.encode('utf32'), offset=4,
                                      dtype=np.int32)
        self.normalise = self.template*self.template
    def match(self, other):
        other = np.frombuffer(other.encode('utf32'), offset=4, dtype=np.int32)[::-1]
        m = signal.convolve(self.template, other, 'valid')
        t = signal.convolve(self.normalise, np.ones_like(other), 'valid')
        delta = np.absolute(m - t)
        md = min(delta)
        return np.where(delta == md)[0], md
    def brute(self, tok):
        ms, md = self.match(tok[0])
        matches = [[-md, (tok[0], s, s+len(tok[0]))] for s in ms]
        for t in tok[1:]:
            ms, md = self.match(t)
            matches = [[mo[0] - md - pen(mo[-1][-1], mn)] + mo[1:]
                       + [(t, mn, mn + len(t))] for mn in ms for mo in matches]
        return sorted(matches, key=lambda x: x[0])
#            for t in tok[1:]:
#                ms, md = self.match(t)
#                matches = [[mo[0] - md] + mo[1:]
#                           + [(t, mn, mn + len(t))] for mn in ms for mo in matches
#                           if mo[-1][-1] <=  mn]
#            return sorted(matches, key=lambda x: x[0])

text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
tx = template(text)
matches = tx.brute(tok)
print(matches[-1])

# [-11, ('Kir4.3', 0, 6), ('is', 7, 9), ('a', 10, 11), ('inwardly-rectifying', 12, 31), ('potassium', 32, 41), ('channel', 42, 49), ('.', 49, 50), ('Dextran-sulfate', 51, 66), ('is', 67, 69), ('useful', 70, 76), ('in', 77, 79), ('glucose', 80, 87), ('-', 87, 88), ('mediated', 88, 96), ('channels', 97, 105), ('.', 105, 106)]
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • This is very useful, but I don't fully understand what is going on quite yet. I noticed that the output of matches is extraordinarily large. I will be performing this task hundreds of thousands of times, so my hope is that it would be rather quick. Does it allievate the problem knowing that these will be in the same order? In other words, the sequence should progress similarly, there are just a few characters missing between the joined text from `tok` and the `text` string. – chase Mar 03 '17 at 06:13
  • @chase Sure, I've added an optimistic version (commented out) that gives just the one result, you'll have to experiment how much flexibility you'll need for your real problem. – Paul Panzer Mar 03 '17 at 08:21
  • Thank you so much! This is really an amazing algorithm, I will likely use it in the future for understanding DNA alignment algorithms etc. However, for this certain project, the simple `text.find()` solution works and it is about 50-100x faster than the one match solution. It looks like scoring was approximately as follows: brute force all matches: ~0.15 seconds brute force one match: ~0.001 seconds text.find : ~0.00001 seconds (~10,000x faster than brute force all matches) – chase Mar 03 '17 at 18:41
  • Interesting, I'm not surprised it's worse than the builtin specialist function but I wouldn't have expected it to be so much worse. Similarly, I don't expect it to be competitive with specialist alignment algorithms like [this](https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm) one. – Paul Panzer Mar 03 '17 at 19:44