MinHash Implementation Spark

Asked Oct 10 '16 at 10:10

Active May 03 '17 at 15:25

Viewed 1,166 times

I am trying to implement the MinHash Algorithm as described in chapter 3 as simple as possible in Spark. I have searched a lot everywhere. Well i decided to follow an implementation from this blog as Bill Dim proposes: https: //blog.cluster-text.com/tag/minhash/ I just feel something is wrong with my implementation or i misunderstood. What I have done so far is:

document => n-grams (i use 9-grams(letters) as said in the book, but it can be changed to 5-words as proposed by Bill Dim)
n-grams => MurMurHash3 (So thats Hased NGrams for Every Document)
HashedNGramsRDD => Find Min(NGram) for every Document
HashedNGramsRDD ^ (199 Random Numbers) and take min = 199 minimuns of the Xored HashedMurMurNGrams.
So I have 200 minimus at total. and thats my MinHash Signature. Is this correct? please help! Thanks in advance.

edited Oct 10 '16 at 11:47

Jagat Dave

1,643
3
23
30

asked Oct 10 '16 at 10:10

Spar

Is this for fun or an actual problem that you are trying to solve (my answer might vary)? – Marsellus Wallace Oct 12 '16 at 16:02
You are welcome! Now, the follow up is: Is your task 'implementing minhash' or 'finding similar documents' (possibly using minhash/lsh and external libraries)? – Marsellus Wallace Oct 12 '16 at 18:41
@Gevorg my task involves implementing MinHash and LSH. If you **read** my initial post, I am trying to implement MinHash as described in that blog, so to continue with LSH. **IF you read...** – Spar Oct 13 '16 at 07:08

MinHash Implementation Spark

0 Answers0