3

I have n strings and I want to find the closest pair.

What's the fastest practical algorithm to find this pair?

Simd
  • 19,447
  • 42
  • 136
  • 271
  • So your dataset is 100k strings and you will have new queries arrive, or will you use as ONE query a string from the dataset? – gsamaras Sep 10 '17 at 06:10
  • @gsamaras For a given input of 100k strings, I want just one answer which is a pair that is closer to each other than all others. – Simd Sep 10 '17 at 06:11

1 Answers1

1

The paper "The Closest Pair Problem under the Hamming Metric", Min, Kao, Zhu seems to be what you are looking for, and it applies to finding a single closest pair.

For your case, where n0.294 < D < n, where D is the dimensionality of your data (1000) and n the size of your dataset, the algorithm will run in O(n1.843 D0.533).

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Looking at the abstract it seems the complexity they give for a practical algorithm is O(n^2 D/log^2 D). The other algorithm looks completely impractical to me. – Simd Sep 10 '17 at 06:19
  • Isn't the O(n^1.843 D^0.533) complexity algorithm one of the completely impractical ones from that paper? – Simd Sep 10 '17 at 06:31