Interpreting Hamming Distance speed in python

Question

I've been working on making my python more pythonic and toying with runtimes of short snippets of code. My goal to improve the readability, but additionally, to speed execution.

This example conflicts with the best practices I've been reading about and I'm interested to find the where the flaw in my thought process is.

The problem is to compute the hamming distance on two equal length strings. For example the hamming distance of strings 'aaab' and 'aaaa' is 1.

The most straightforward implementation I could think of is as follows:

def hamming_distance_1(s_1, s_2):
    dist = 0
    for x in range(len(s_1)):
        if s_1[x] != s_2[x]:  dist += 1
    return dist

Next I wrote two "pythonic" implementations:

def hamming_distance_2(s_1, s_2): 
    return sum(i.imap(operator.countOf, s_1, s_2))

and

def hamming_distance_3(s_1, s_2): 
    return sum(i.imap(lambda s: int(s[0]!=s[1]), i.izip(s_1, s_2)))

In execution:

s_1 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
s_2 = (''.join(random.choice('ABCDEFG') for i in range(10000)))
print 'ham_1  ',  timeit.timeit('hamming_distance_1(s_1, s_2)',  "from __main__ import s_1,s_2, hamming_distance_1",number=1000)
print 'ham_2  ',  timeit.timeit('hamming_distance_2(s_1, s_2)',  "from __main__ import s_1,s_2, hamming_distance_2",number=1000)
print 'ham_3  ',  timeit.timeit('hamming_distance_3(s_1, s_2)',  "from __main__ import s_1,s_2, hamming_distance_3",number=1000)

returning:

ham_1   1.84980392456
ham_2   3.26420593262
ham_3   3.98718094826

I expected that ham_3 would run slower then ham_2, due to the fact that calling a lambda is treated as a function call, which is slower then calling the built in operator.countOf.

I was surprised I couldn't find a way to get a more pythonic version to run faster then ham_1 however. I have trouble believing that ham_1 is the lower bound for pure python.

Thoughts anyone?

Ultimately this is the solution with fastest time, sum(i.imap(operator.ne, s_1, s_2)) runs in 1.03. — Scrocco, Feb 05 '15 at 18:30

score 1 · Accepted Answer · answered Feb 04 '15 at 19:41

1

The key is making less method lookups and function calls:

def hamming_distance_4(s_1, s_2):
    return sum(i != j for i, j in i.izip(s_1, s_2))

runs at ham_4 1.10134792328 in my system.

ham_2 and ham_3 makes lookups inside the loops, so they are slower.

answered Feb 04 '15 at 19:41

utdemir

26,532
10
62
81

Yup, that's it. Thanks. ham_4 runs in roughly 1.67991399765 for comparison. – Scrocco Feb 04 '15 at 21:48
1

I was confused by the slowness of ham_2, after more digging I realized that operator.countOf runs a for loop over the strings (each of length 1) and the compiler doesn't optimize that out. Using operator.ne has no for loop and runs in 1/2 the time as your above example. – Scrocco Feb 05 '15 at 18:35

score -1 · Answer 2 · answered Feb 04 '15 at 19:45

-1

I wonder if this might be a bit more Pythonic, in some broader sense. What if you use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html ... a module that already implements what you're looking for?

answered Feb 04 '15 at 19:45

Jim Dennis

17,054
13
68
116

The OP is asking for hamming distance calculation on arrays of strings, not int. Scipy spatial distance calculations only work for integers. – SummerEla Dec 27 '18 at 21:14

Interpreting Hamming Distance speed in python

2 Answers2