I'm trying to do machine learning on a real-life dataset (hotel reviews). Unfortunately, it's plagued by spam, which comes in the form of almost identical reviews, complicating matters for me greatly.
I would like to remove "almost duplicates" from the dataset based on the edit distance or something similar, and since the dataset size is >100K, the algorithm has to be subquadratic in the size of the dataset. Right now I can only think of flagging individual sentences or phrases that are repeated too often and then removing all reviews that have them, but it's easy to see how such strategy could backfire. Is there a common algorithm that does better?