Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions
3
votes
1 answer

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…
cjauvin
  • 3,433
  • 4
  • 29
  • 38
3
votes
0 answers

What is the number of Hash functions needed in Bundle Min Hashing for Logo Recognition?

In reference to the paper Bundle Min Hashing for Logo Recognition: Suppose we have bundles {2,5,18,444,678} and {2,5,79,368,841} and the vocabulary size is 1M. If we have just 1 sketch per bundle then do we need just 1 hash function which hashes 1M…
3
votes
2 answers

Generating Random Hash Functions for LSH Minhash Algorithm

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been…
user3246779
  • 125
  • 3
  • 12
2
votes
0 answers

One-hot encoding minHashed genomes

I have an algorithm to one-hot encode minHashed genomes and I am seeking opinions on whether I have constructed it correctly based on the nature of minHashing. There's some disagreement between myself and a collaborator and we are trying to find the…
C. John
  • 144
  • 1
  • 15
2
votes
2 answers

how to set the seed value for ruby murmur hash

Is there a way to set the seed value for using the ruby hash function (i.e. murmur hash in 1.9, don't know JRuby?) so that I can get the same hash code every time I run the script (i.e. in parallel on multiple processes or on different nodes) so…
Charles
  • 495
  • 1
  • 5
  • 12
2
votes
4 answers

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 30 million unique (name_id, name) combinations for…
2
votes
1 answer

ufunc 'bitwise_and' not supported for the input types Minhash

I am using Python 3.7.1 for making minhash a list of string. The code is as follows. import mmh3 import random import string import itertools from datasketch import MinHash def grouper(iterable,n=2): return ["".join(x) for x in…
Nithin Varghese
  • 893
  • 1
  • 6
  • 28
2
votes
2 answers

Node.js / javascript minhash module that outputs a similar hashstring for similar text

I am looking for a node.js / Javascript module that applies the minhash algorithm to a string or bigger text, and returns me an "identifying" or "characteristic" Bytestring or Hexstring for that text. If I apply the algorithm to another similar text…
MMMM
  • 3,320
  • 8
  • 43
  • 80
2
votes
1 answer

Faster implementation of LSH (AND-OR)

I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I…
Ramki
  • 43
  • 1
  • 8
2
votes
0 answers

how do I calculate the Minhash Signature of a given characteristic matrix using Spark

I have a DataSet as follows: +----+---------+-------------------------+ |key |value |vector | +----+---------+-------------------------+ |key0|[a, d] |(5,[0,2],[1.0,1.0]) | |key1|[c] |(5,[1],[1.0]) …
lee
  • 234
  • 2
  • 16
2
votes
0 answers

Using the Minhash Token Filter in elasticsearch

What does the bucket_count setting correspond to? Does this mean that the minhashes are further hashed to values between 1 and bucket_count-1? Would generating minhashes in the following scenario result in any speedup? Case: Index 10 million…
Cygorger
  • 772
  • 7
  • 15
2
votes
2 answers

Using minHash to compare more than 2 sets

I have a class called FindSimilar which uses minHash to find similarities between 2 sets (and for this goal, it works great). My problem is that I need to compare more than 2 sets, more specifically, I need to compare a given set1 with an unknown…
Lazy Wolf
  • 97
  • 3
  • 10
2
votes
3 answers

Memory efficient map, set> alternative

I have a huge amount (1500 Million) of Integer pairs where each one is associated with a document-ID. My goal now is to search for documents which have the same pair. My first idea was to use a hash-map (std::map) using the pair values as keys and…
Mad A.
  • 401
  • 4
  • 11
2
votes
1 answer

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents consisting in first computing their minhash signatures (from their shingles, or n-grams), and then use an LSH-based algorithm to cluster them efficiently (i.e. avoid the…
1
vote
1 answer

Optimal way for calculating Weighted Jaccard index in Python

I have a dataset constructed as a sparse weighted matrix for which I want to calculate weighted Jaccard index for downstream grouping/clustering, with inspiration from below article:…
Charmander_
  • 55
  • 1
  • 9