2

Since this must be quite a common scenario I'm wondering whether there are any pre-existing solutions to the following.

Lets say I have a set of N strings, and I am computing distances the distances between them. In this case it's a Hamming distance, but that's not really important.

If I wanted to make this as quick as possible, I would avoid self comparisons like so:

 def hamming_distance(string1, string2):
     """Return the Hamming distance between equal-length sequences"""
     if len(string1) != len(string2):
         raise ValueError("Undefined for sequences of unequal length")
     return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))


ratios=[]
 for a, i in enumerate(string_list):
     for b, j in enumerate(string_list):
         if a == b: # Avoid self comparisons for speed
             break
     ratios.append(hamming_distance(string_list[i], string_list[j]))
 return ratios

But since this is 'symmetric', I could also throw away any reciprocal comparisons which would increase the speed if the strings were numerous and/or large.

Is there a generally accepted/elegant way of doing this in the above set up?

I'm also aware that in general it's advised to avoid nested loops as they can be slow - so if there is a better way of achieving this pairwise iteration over lists (maybe something in collections?) and incorporating the avoidance of self and reciprocal comparisons, I'm all ears.

Joe Healey
  • 1,232
  • 3
  • 15
  • 34

2 Answers2

3

You can limit the nested for to start from the next item to the current item in the outer loop. In that way, you only run through each unique once:

for i, s1 in enumerate(string_list):
   for s2 in string_list[i+1:]:
      ratios.append(hamming_distance(s1, s2))
return ratios

You could put this in a list comp.:

ratios = [(s1, s2, hamming_distance(s1, s2)) for i, s1 in enumerate(string_list) 
                                                     for s2 in string_list[i+1:]]

You could put the strings in a tuple as part of the result like I've done in the list comp.

Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
3

What you are looking for is implemented by itertools.combinations().

>>> import itertools
>>> a = [1,2,3]
>>> list(itertools.combinations(a, 2))
[(1, 2), (1, 3), (2, 3)]

So in your case it will look like:

for a, b in itertools.combinations(string_list, 2):
    ratios.append(hamming_distance(a, b))
Constantin S. Pan
  • 1,362
  • 11
  • 4