-1

I need to find the substring of s that is closest to a string by Hamming distance and have it return a tuple of the index of the closest substring, the Hamming distance of the closest substring to p, and the closest substring itself.

I have this code so far:

def ham_dist(s1, s2):
    if len(s1) != len(s2):
        raise ValueError("Undefined")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

But I am confused on how I would figure this out:

Your function should return (1,2,'bcef') because the closest substring is 'bcef', it begins at index 1 in s, and its Hamming distance to p is 2.

In your function, you should use your ham_dist function from part (a). If there is more than one substring with the same minimum distance to p, return any of them.

Makyen
  • 31,849
  • 12
  • 86
  • 121
  • 1
    I feel like there is something missing in the description of the problem. It would help if you gave an example: input(s) and desired result. – FMc Mar 21 '19 at 00:33
  • 1
    Welcome to SO! Can you please give an example input and your expected output? This will help others answer your question. Edit: Sorry to step on your toes, @FMc! – Niayesh Isky Mar 21 '19 at 00:33

2 Answers2

4

You can run through the source string and compute the Hamming distance between your search string and the substring of the same length starting at the current index. You save the index, Hamming distance and substring if it is smaller than what you had before. This way you will get the minimal value.

source_string = "pGpEusuCSWEaPOJmamlFAnIBgAJGtcJaMPFTLfUfkQKXeymydQsdWCTyEFjFgbSmknAmKYFHopWceEyCSumTyAFwhrLqQXbWnXSn"
search_string = "tyraM"

def ham_dist(s1, s2):
    if len(s1) != len(s2):
        raise ValueError("Undefined")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

def search_min_dist(source,search):
    l = len(search)
    index = 0
    min_dist = l
    min_substring = source[:l]    
    for i in range(len(source)-l+1):
        d = ham_dist(search, source[i:i+l])
        if d<min_dist:
            min_dist = d
            index = i
            min_substring = source[i:i+l]  
    return (index,min_dist,min_substring)

print search_min_dist(source_string,search_string)

Output

(28, 2, 'tcJaM')
1

The answer from Hugo Delahaye is a good one and does a better job of answering your question directly, but a different way to think about problems like this is to let Python's min() function figure out the answer. Under this type of data-centric programming (see Rule 5), your goal is to organize the data to make that possible.

s = 'abcefgh'
p = 'cdef'
N = len(p)

substrings = [
    s[i : i + N]
    for i in range(0, len(s) - N + 1)
]

result = min(
    (ham_dist(p, sub), sub, i)
    for i, sub in enumerate(substrings)
)

print(substrings)    # ['abce', 'bcef', 'cefg', 'efgh']
print(result)        # (2, 'bcef', 1)
FMc
  • 41,963
  • 13
  • 79
  • 132