I have a large collection of short texts where I want to filter out texts that are very similar to each other (or exact duplicates). I'd like to achieve this using Apache Beam running on Google Cloud Dataflow.
I'm hoping to use the MinHash LSH algorithm to determine whether the similarity of two texts exceeds a certain threshold.
The MinHash LSH algorithm generates a form of hash table to (probabilistically) find similar sentences. I'd expect this hash table to be around 1 Gb for one million texts and grow linearly with the number of texts.
The only way I see of mapping this use case to Apache Beam's programming model would be to use a Combine
transform to generate the hash table for all items (the accumulator would be the hash table; I am able to implement "Merge Accumulators") and then use it as a side input for a ParDo
where I look up each text in the hash table to see if it collides with another text.
Does this seem like a reasonable thing to do? Specifically, is it an issue that the accumulator could be several gigabytes large?