0

I am trying to implement the MinHash Algorithm as described in chapter 3 as simple as possible in Spark. I have searched a lot everywhere. Well i decided to follow an implementation from this blog as Bill Dim proposes: https: //blog.cluster-text.com/tag/minhash/ I just feel something is wrong with my implementation or i misunderstood. What I have done so far is:

  • document => n-grams (i use 9-grams(letters) as said in the book, but it can be changed to 5-words as proposed by Bill Dim)
  • n-grams => MurMurHash3 (So thats Hased NGrams for Every Document)
  • HashedNGramsRDD => Find Min(NGram) for every Document
  • HashedNGramsRDD ^ (199 Random Numbers) and take min = 199 minimuns of the Xored HashedMurMurNGrams.
  • So I have 200 minimus at total. and thats my MinHash Signature. Is this correct? please help! Thanks in advance.
Jagat Dave
  • 1,643
  • 3
  • 23
  • 30
Spar
  • 463
  • 1
  • 5
  • 23
  • Is this for fun or an actual problem that you are trying to solve (my answer might vary)? – Marsellus Wallace Oct 12 '16 at 16:02
  • You are welcome! Now, the follow up is: Is your task 'implementing minhash' or 'finding similar documents' (possibly using minhash/lsh and external libraries)? – Marsellus Wallace Oct 12 '16 at 18:41
  • @Gevorg my task involves implementing MinHash and LSH. If you **read** my initial post, I am trying to implement MinHash as described in that blog, so to continue with LSH. **IF you read...** – Spar Oct 13 '16 at 07:08

0 Answers0