Nearest neighbor search vs Near duplicate detection

Question

I was searching for a few AI/ML and non-AI/ML solutions for the "Near duplicate detection" problem (text, image, audio), I found that there is a similar/exact problem i,e, "Nearest neighbor search" which is also seems handled exactly the same way as "Near duplicate detection". I wondering whether there are any differences at all between these two problems or their solutions in any way.

You can also search for "data deduplication" for related work — thomaskeefe, Jun 19 '22 at 14:21

joaopfg · Answer 1 · 2022-06-19T18:17:12.563

The two problem names seems semantically the same from an english perspective.

In a nearest neighbor search you have a set of elements and, given a reference element, you want to search for an element in the set that is the closest to the reference with respect to a given metric.

In a near duplicate detection you have a set of elements and, given a reference element, you want to search for an element in the set that is the closest to be a duplicate of the reference with respect to a given metric.

Having said that, in the litterature I see people usually using the later name when the elements in the set are textual documents. In this case, one example algorithm consists of getting the set of windows of size k of the textual documents (k-shingles) and comparing two documents using the Jaccard metric (number of shingles in common in the two documents divided by number of different shingles) between the set of k-shingles of each of the documents. To avoid calculating the Jaccard metric explicitly, there is a theorem. If you hash all the k-shingles to 64-bit integers (for example) and consider a random permutation from 64-bit integers to 64-bit integers, then, if you apply the permutation to the set of hashed k-shingles of each document, the probability that the smallest elements of each of the two sets of permuted values are equal is equal to the Jaccard metric between the two documents.

On the other side, I see people usually using the first name if the set of elements is a subset of R^n (for example). In this case, many techniques exist. For example, some useful data structures are octrees, kd-trees.

Having said that, people also use vectorization techniques to convert some sets of elements into a subset of R^n. For example, signal2vec, word2vec etc.

Nearest neighbor search vs Near duplicate detection

1 Answers1