I am trying to implement the MinHash Algorithm as described in chapter 3 as simple as possible in Spark. I have searched a lot everywhere. Well i decided to follow an implementation from this blog as Bill Dim proposes: https: //blog.cluster-text.com/tag/minhash/ I just feel something is wrong with my implementation or i misunderstood. What I have done so far is:
- document => n-grams (i use 9-grams(letters) as said in the book, but it can be changed to 5-words as proposed by Bill Dim)
- n-grams => MurMurHash3 (So thats Hased NGrams for Every Document)
- HashedNGramsRDD => Find Min(NGram) for every Document
- HashedNGramsRDD ^ (199 Random Numbers) and take min = 199 minimuns of the Xored HashedMurMurNGrams.
- So I have 200 minimus at total. and thats my MinHash Signature. Is this correct? please help! Thanks in advance.