Removing "almost duplicate" strings in subquadratic time

Question

I'm trying to do machine learning on a real-life dataset (hotel reviews). Unfortunately, it's plagued by spam, which comes in the form of almost identical reviews, complicating matters for me greatly.

I would like to remove "almost duplicates" from the dataset based on the edit distance or something similar, and since the dataset size is >100K, the algorithm has to be subquadratic in the size of the dataset. Right now I can only think of flagging individual sentences or phrases that are repeated too often and then removing all reviews that have them, but it's easy to see how such strategy could backfire. Is there a common algorithm that does better?

This smells like a bulk of nearest neighbor queries, albeit with an unusual distance metric (I don't think edit distance satisfies the triangle inequality). I suggest looking into the usual data structures for speeding up nearest neighbor searches. — , Jan 10 '14 at 15:30
What about manually seeding a bayesian filter with some of the selected spam entries? — JimR, Jan 10 '14 at 15:33
This is the "near duplicates" detection problem. One common technique is called shingling. If you search on these terms, you should find some useful algorithms. — Dave, Jan 10 '14 at 18:20
@delnan Actually, edit distance *does* satisfy the axioms of a metric. — Alexei Averchenko, Jan 11 '14 at 10:31
@AlexeiAverchenko You're right, my hunch yesterday turned out to be wrong. — , Jan 11 '14 at 10:36
@Tomas subquadratic means o(n^2) complexity class. I'm not trying to reinvent the wheel, I'm asking people to point the wheel out to me so that I could use it. — Alexei Averchenko, Jan 12 '14 at 02:10
@delnan "The act of continuing to date a girl you want to break up with while you start dating a new girl" I gotta remember that :) — Alexei Averchenko, Jan 12 '14 at 08:36
**Locality-Sensitive Hashing (LSH)** seems an appealing heuristic, see for example [Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions](http://people.csail.mit.edu/indyk/p117-andoni.pdf). But I can only second [ElKamina](http://stackoverflow.com/a/21051579/341970) that "solving this problem in whole might involve writing a decent research paper." The basic idea is that with LSH, similar entries (w.r.t. a threshold) are likely to end up in the same bin, but entries sufficiently different end up in different bins. — Ali, Jan 12 '14 at 17:13
To apply LSH, you need to turn all the reviews into vectors which is in itself quite tricky. In any case, +1, a very interesting question! — Ali, Jan 12 '14 at 17:15

score 4 · Accepted Answer · answered Jan 10 '14 at 18:21

Obviously solving this problem in whole might involve writing a decent research paper. Here is my suggestion.

In bioinformatics we face this problem all the time. The most used algorithm is BLAST (http://en.wikipedia.org/wiki/BLAST). Please go through the algorithm and you might get an idea of what is involved.

score 0 · Answer 2 · answered Jan 11 '14 at 06:51

A quick and dirty way of doing this is to find the key words that occur in the reviews and store them in universal dictionary and then scan each document for those words. Make a hash table of key words for each documents. Then compare all pairs of documents then evaluate the count of similar key words in each pair and then if it is greater than a threshold then mark them as similar, you can use fast union find datastructure for finding unions of two documents if similar. At the end you will have sets of similar documents.

Note: I cant think of any way to make it subquadratic but i seems difficult because you need to check all pairs of documents in worst case if you need to find if there are similar ones.

Removing "almost duplicate" strings in subquadratic time

2 Answers2