I am using the following function to generate subsequences of a string
import numpy as np
class SlidingKmerFragmenter:
"""
Slide only a single string
"""
def __init__(self, k_low, k_high):
self.k_low = k_low
self.k_high = k_high
self.rng = np.random.RandomState(1234)
def apply(self, seq):
return [seq[i: i + self.rng.randint(self.k_low, self.k_high + 1)] for i in range(len(seq) - self.k_high + 1)]
For example, I could do
generator = SlidingKmerFragmenter(1, 6)
generator.apply("DHDHDDHEBENRJ")
to get overlapping subsequences of this string where each subsequences can range from 1 to 6 characters.
How can I go from the output of this function back to the original string? I want to concatenate the subsequences together but some portions are overlapping.
Thanks! Jack
EDIT: Just to clarify, the solution here How can I merge overlapping strings in python? does not seem to work for this particular example due to the repeated Ds and Hs in my example string.
If I had a list of original sequences and wanted to match the overlapping subsequence list to the original sequence list, how could that be done?