Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions
0
votes
1 answer

Set distance as similarity metric for MinHashing algorithm

I am currently working on document clustering using MinHashing technique. However, I am not getting desired results as MinHash is a rough estimation of Jaccard similarity and it doesn't suits my requirement. This is my scenario: I have a huge set…
Maggie
  • 5,923
  • 8
  • 41
  • 56
0
votes
1 answer

Mahout minhash org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

I am using : hadoop-1.2.1 and mahout-distribution-0.8 When I try to run HASHMIN method with following command: $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.minhash.MinHashDriver -i tce-data/cv.vec -o tce-data/out/cv/minHashDriver/ -ow I get…
Osy
  • 1,613
  • 5
  • 21
  • 35
0
votes
4 answers

Proof of calculating Minhash

I'm reading about MinHash technique to estimate the similarity between 2 sets: Given set A and B, h is the hash function and hmin(S) is the minimum hash of set S, i.e. hmin(S)=min(h(s)) for s in S. We have the equation: p(hmin(A)=hmin(B))=|A∩B| /…
Long Thai
  • 807
  • 3
  • 12
  • 34
-1
votes
1 answer

Questions about LSH (Locality-sensitive hashing) and minihashing implementation

I'm trying to implement this paper Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic I got a couple of questions about the LHS algorithm in general and the proposed implementation: The LSH…
ianux22
  • 405
  • 4
  • 16
-1
votes
1 answer

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and…
atheodos
  • 131
  • 12
-4
votes
1 answer

LSH implementation for finding clusters

Hie guys. I am very new to stack exchange and I am currently doing a research on graph theory. The set of questions I'm going to ask are very introductory since I'm a beginner level programmer (not acquainted with hashing, buckets, vectors etc data…
Samarth Shah
  • 878
  • 8
  • 14
1 2 3 4 5
6