9

If you have n binary strings, each of length m, is there a faster way to determine the minimum Hamming distance between any pair than to compare all O(n^2) pairs and for each to compute their Hamming distance?

That is can it be done in less than O(n^2m) time?

Apart from anything else and as commented below, the Hamming distance is a proper distance function and so satisfies the triangle inequality, which makes me feel there should be a faster solution.

Simd
  • 19,447
  • 42
  • 136
  • 271
  • There are plenty of papers and other sources on the www. Try google! What have **you** tried so far? – MrSmith42 Apr 14 '18 at 09:32
  • @MrSmith42 I have tried a web search for "minimum hamming distance" but with no luck so far. – Simd Apr 14 '18 at 09:36
  • 1
    It's a distance so it verifies `d(a,c) ≤ d(a,b)+d(b,c)`, and that could be certainly used not to test every pair. – Jean-Baptiste Yunès Apr 14 '18 at 09:40
  • 1
    This may also helps you (you have to find a good ordering): https://stackoverflow.com/questions/38900004/algorithm-to-test-minimum-hamming-distance-against-a-set – Jean-Baptiste Yunès Apr 14 '18 at 09:43
  • 1
    @MrSmith42: `n` binary strings have `(n-1)*n/2` Hamming distances, as the Hamming distance is defined between two strings. The efficient nearest neighbor search in such a Hamming space is not trivial. My Google search at least has not been fruitful so far. – Axel Kemper Apr 14 '18 at 15:58
  • 2
    You could construct a [kd-tree](https://en.wikipedia.org/wiki/K-d_tree) and (for modest values of `m`) use a lookup-table to determine the distance between two strings. This leads to complexity `O(n log n)`. – Axel Kemper Apr 17 '18 at 06:53

2 Answers2

5

Consider using Locality Sensitive Hashing, which is a general technique that can be applied to certain distance metrics including Hamming distance. Excerpt from Wikipedia:

LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items).

In short, you can use LSH to obtain buckets, brute-force Hamming distances within each bucket, and output the smallest distance found. To obtain the right answer with higher probability, you can tweak parameters of the LSH algorithm and/or run LSH multiple times (to get different allocations of items to buckets). I believe you can get arbitrarily close to the correct (optimal) answer with a failure rate exponentially-decreasing in runtime. (You might have to binary search over the LSH parameters, if your Hamming distances are all very close, but you'll still avoid computing n^2 Hamming distances.)

The algorithm and analysis are pretty involved, so I don't think I can write a complete summary here at the moment (it's about 2-3 hours worth of lecture material). I recommend taking a look at the lecture notes/slides here, here, and here; they all cover LSH (in varying degrees of detail) with some mention of Hamming distance.

k_ssb
  • 6,024
  • 23
  • 47
-3

It's not possible to determine the true minimum without performing a full search with O(n^2m). All faster variants will yield only a maybe best minimum.

Kind of proof:

1. Assume there would be a faster solution.
2. Then for one or more combinations the hamming distance is not computed.
3. Omitting a combination means, that there is a criteria to decide
   the combination can't be better than the current best minimum.
4. There is no know criteria.

The triangle inequality is unfortunately helping only to shorten the computation of the true maximum:

  1. compute the distances Di0, sort them and pick a starting maximum.
  2. now omit all Dij where Di0 + D0j <= current maximum.
bebbo
  • 2,830
  • 1
  • 32
  • 37
  • Do you have justification for your first sentence? – k_ssb Apr 23 '18 at 22:36
  • I also wonder that, and it seems easy to find examples where you don't have to compare all strings. If the "first" two strings have a hamming distance ==1 you only need to verify that there are no duplicate strings, which should be O(n*m). I don't know if that can be generalized. – Hans Olsson Apr 24 '18 at 07:43
  • What if the last comparison yields the hamming distance == 1? An if-I-am-lucky example is not a general rule. Depending on further constraints faster solutions are possible. But without any constrain there is no way to deduce an lower limit for Dij. It's an upper limit and this only helps finding the maximum. – bebbo Apr 24 '18 at 10:58
  • Your "kind of proof" seems incredibly shaky -- I don't see how you come to the conclusion that there is no criteria that lets you safely skip a distance computation. – k_ssb Apr 25 '18 at 10:28
  • 3
    No, that's not how it works. You are claiming that _no such criteria_ exists. The burden of proof is on you to disprove the existence of such criteria. – k_ssb Apr 26 '18 at 00:45
  • First provide the proof that God either exists or not. Your choice. How do you prof that's impossible to compute the division x=a/b by only using the plus operator? – bebbo Apr 26 '18 at 07:19
  • I don't see how being pedantic over something the poster already qualified with "kind of" makes this answer incorrect. If someone has an actual `O(n^2m)` answer that isn't a "maybe best" then please, post it. – c z Apr 27 '18 at 16:19
  • @cz I have no problem with a "kind of" proof -- my issue with this answer is that the first sentence ("it's not possible...") is stated as a fact, which I think is likely false. – k_ssb Apr 27 '18 at 17:48
  • @pkpnd it is known to compute a division if you have sub or add+neg. But with only add.... – bebbo May 01 '18 at 10:22