2

I have a big corpus and I'm trying to find the most similar n-grams in the corpus. For that case, I'm using get_close matches.

The problem is that this procedure takes a lot of time. A friend suggests me to convert the n-grams to MD5 and then calculate the distance. I suspect that it will work. Is hashing invariant to hashing? Is distance calculation efficiently running on MD5 that strings?

Post scriptum, what is the most efficient way to calculate the distance between strings (like n-grams) in a large corpus?

Yanirmr
  • 923
  • 8
  • 25
  • 1
    See this answer: https://stackoverflow.com/questions/21408760/better-fuzzy-matching-performance TLDR: use `fuzzyset` instead. It is a lot faster. – amdex Jun 09 '20 at 08:48

1 Answers1

2

A promising approach would be metric embedding. In this paper: Convolutional Embedding for Edit Distance the researchers state that the algorithm can accelerate the searching by orders of magnitude. After doing the training metric embedding you can apply the approximate nearest neighbor algorithms to find the k text with the shortest distance.

HTH.

Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66