0

Using Levenshtein distance as a metric, I want to find the exact k-nearest neighbours for all elements in a large set of strings, but I am not yet sure how high a value of k I will require. Is there an algorithm or data structure that will allow me to defer this selection, and gradually increase k without significant efficiency costs over calculating the higher value of k in the first place? I would like the flexibility of being able to use different values of k for different elements, if possible.

I have a number of data-sets I could use, but I'd like to use one with 500000 strings, each approximately 100 characters, which would make methods approaching O(N^2) calls to the distance function take too long.

I have tried using GNAT, but found knn queries a bit too slow (often approaching N distance function calls per element).

Aramdooli
  • 43
  • 1
  • 5
  • What's the largest value of `k` that you expect you'll need? – user3386109 Jan 06 '17 at 19:30
  • As I hopefully implied, I'm not sure! As a stab in the dark, 5<=k<=20? Maybe a lot more for some particular strings, but few enough to allow a more naive approach to cover those outliers separately, if necessary. – Aramdooli Jan 19 '17 at 15:33
  • Given a maximum value of `k=20`, I think your best approach is to calculate using the highest value of `k` in the first place. The data storage requirement is about 80 bytes per string, total of 40MB, to store the ID and distance for each of the 20 neighbors. – user3386109 Jan 19 '17 at 21:35
  • I'm not too concerned about space/memory (I wouldn't have used GNAT otherwise); it's the time complexity which is the issue. – Aramdooli Jan 27 '17 at 13:41

0 Answers0