0

I have a large collection of short texts where I want to filter out texts that are very similar to each other (or exact duplicates). I'd like to achieve this using Apache Beam running on Google Cloud Dataflow.

I'm hoping to use the MinHash LSH algorithm to determine whether the similarity of two texts exceeds a certain threshold.

The MinHash LSH algorithm generates a form of hash table to (probabilistically) find similar sentences. I'd expect this hash table to be around 1 Gb for one million texts and grow linearly with the number of texts.

The only way I see of mapping this use case to Apache Beam's programming model would be to use a Combine transform to generate the hash table for all items (the accumulator would be the hash table; I am able to implement "Merge Accumulators") and then use it as a side input for a ParDo where I look up each text in the hash table to see if it collides with another text.

Does this seem like a reasonable thing to do? Specifically, is it an issue that the accumulator could be several gigabytes large?

ehrencrona
  • 6,102
  • 1
  • 18
  • 24
  • I'm doing something very similar. Any feedback on this ? – Scharron Aug 09 '18 at 10:08
  • 1
    personally, i gave up on using beam for this. the local runner is single-threaded which makes it slow. running it on google cloud dataflow somehow allocated lots and lots of machines while still running slower than my single-threaded local instance. it's possible that i did something wrong, but it seemed like a lot of effort for something i could just hack without beam. – ehrencrona Aug 10 '18 at 13:18

0 Answers0