I have a big corpus and I'm trying to find the most similar n-grams in the corpus. For that case, I'm using get_close matches
.
The problem is that this procedure takes a lot of time. A friend suggests me to convert the n-grams to MD5 and then calculate the distance. I suspect that it will work. Is hashing invariant to hashing? Is distance calculation efficiently running on MD5 that strings?
Post scriptum, what is the most efficient way to calculate the distance between strings (like n-grams) in a large corpus?