2

I am using the following function to generate subsequences of a string

import numpy as np
class SlidingKmerFragmenter:
    """
    Slide only a single string
    """
    def __init__(self, k_low, k_high):
        self.k_low = k_low
        self.k_high = k_high
        self.rng = np.random.RandomState(1234)
    def apply(self, seq):
        return [seq[i: i + self.rng.randint(self.k_low, self.k_high + 1)] for i in range(len(seq) - self.k_high + 1)]

For example, I could do

generator = SlidingKmerFragmenter(1, 6)
generator.apply("DHDHDDHEBENRJ")

to get overlapping subsequences of this string where each subsequences can range from 1 to 6 characters.

How can I go from the output of this function back to the original string? I want to concatenate the subsequences together but some portions are overlapping.

Thanks! Jack

EDIT: Just to clarify, the solution here How can I merge overlapping strings in python? does not seem to work for this particular example due to the repeated Ds and Hs in my example string.

If I had a list of original sequences and wanted to match the overlapping subsequence list to the original sequence list, how could that be done?

Jack Arnestad
  • 1,845
  • 13
  • 26
  • I don't think you can. I tried a couple of different string with your fragmenter, and you aren't guaranteed to get all of the characters of the input string back in the output. – Patrick Haugh May 12 '18 at 19:02
  • @PatrickHaugh Would it be possible if we had access to the list of original sequences? My fragmenter generally cuts off the end of the sequence, but maybe matching the concatenated (not-complete) version of the subsequences with the original sequence might help? – Jack Arnestad May 12 '18 at 19:18

0 Answers0