Highest Voted 'minhash' Questions

3

votes

1 answer

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…

asked Jun 12 '15 at 14:50

cjauvin

3,433
4
29
38

3

votes

0 answers

What is the number of Hash functions needed in Bundle Min Hashing for Logo Recognition?

In reference to the paper Bundle Min Hashing for Logo Recognition: Suppose we have bundles {2,5,18,444,678} and {2,5,79,368,841} and the vocabulary size is 1M. If we have just 1 sketch per bundle then do we need just 1 hash function which hashes 1M…

computer-vision object-recognition hash-function minhash

asked Oct 28 '14 at 06:39

Ankit Nayan

363
7
18

3

votes

2 answers

Generating Random Hash Functions for LSH Minhash Algorithm

I'm programming a minhashing algorithm in Java that requires me to generate an arbitrary number of random hash functions (240 hash functions in my case), and run any number of integers through it (2000 at the moment). In order to do that, I've been…

java algorithm hash locality-sensitive-hash minhash

asked Jul 10 '14 at 12:11

user3246779

125
3
12

2

votes

0 answers

One-hot encoding minHashed genomes

I have an algorithm to one-hot encode minHashed genomes and I am seeking opinions on whether I have constructed it correctly based on the nature of minHashing. There's some disagreement between myself and a collaborator and we are trying to find the…

hash computer-science bioinformatics minhash

asked May 04 '22 at 16:38

C. John

144
1
15

2

votes

2 answers

how to set the seed value for ruby murmur hash

Is there a way to set the seed value for using the ruby hash function (i.e. murmur hash in 1.9, don't know JRuby?) so that I can get the same hash code every time I run the script (i.e. in parallel on multiple processes or on different nodes) so…

ruby jruby hashcode murmurhash minhash

asked Jul 08 '11 at 01:24

Charles

495
1
5
12

2

votes

4 answers

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 30 million unique (name_id, name) combinations for…

pyspark apache-spark-sql garbage-collection amazon-emr minhash

asked May 28 '20 at 13:11

thijsvdp

404
3
16

2

votes

1 answer

ufunc 'bitwise_and' not supported for the input types Minhash

I am using Python 3.7.1 for making minhash a list of string. The code is as follows. import mmh3 import random import string import itertools from datasketch import MinHash def grouper(iterable,n=2): return ["".join(x) for x in…

python-3.x numpy hash minhash

asked Mar 20 '19 at 17:54

Nithin Varghese

893
1
6
28

2

votes

2 answers

Node.js / javascript minhash module that outputs a similar hashstring for similar text

I am looking for a node.js / Javascript module that applies the minhash algorithm to a string or bigger text, and returns me an "identifying" or "characteristic" Bytestring or Hexstring for that text. If I apply the algorithm to another similar text…

javascript node.js minhash

asked Mar 19 '19 at 23:00

MMMM

3,320
8
43
80

2

votes

1 answer

Faster implementation of LSH (AND-OR)

I have a data set of size (160000,3200), in which all the elements are either zero or one. I want to find similar candidates. I have hashed it to (160000,200) using Minhash using one for-loop and it took about two minutes, which I am happy with. I…

python locality-sensitive-hash minhash

asked Nov 17 '18 at 06:27

Ramki

43
1
8

2

votes

0 answers

how do I calculate the Minhash Signature of a given characteristic matrix using Spark

I have a DataSet as follows: +----+---------+-------------------------+ |key |value |vector | +----+---------+-------------------------+ |key0|[a, d] |(5,[0,2],[1.0,1.0]) | |key1|[c] |(5,[1],[1.0]) …

apache-spark minhash

asked Feb 03 '18 at 23:18

lee

234
2
16

2

votes

0 answers

Using the Minhash Token Filter in elasticsearch

What does the bucket_count setting correspond to? Does this mean that the minhashes are further hashed to values between 1 and bucket_count-1? Would generating minhashes in the following scenario result in any speedup? Case: Index 10 million…

elasticsearch minhash

asked Mar 15 '17 at 19:56

Cygorger

772
7
15

2

votes

2 answers

Using minHash to compare more than 2 sets

I have a class called FindSimilar which uses minHash to find similarities between 2 sets (and for this goal, it works great). My problem is that I need to compare more than 2 sets, more specifically, I need to compare a given set1 with an unknown…

set similarity minhash

asked Nov 28 '16 at 15:29

Lazy Wolf

97
3
10

2

votes

3 answers

Memory efficient map, set> alternative

I have a huge amount (1500 Million) of Integer pairs where each one is associated with a document-ID. My goal now is to search for documents which have the same pair. My first idea was to use a hash-map (std::map) using the pair values as keys and…

c++ dictionary hashmap minhash

asked Jun 14 '16 at 14:58

Mad A.

401
4
11

2

votes

1 answer

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents consisting in first computing their minhash signatures (from their shingles, or n-grams), and then use an LSH-based algorithm to cluster them efficiently (i.e. avoid the…

data-mining cluster-analysis locality-sensitive-hash minhash simhash

asked Jan 11 '14 at 00:02

cjauvin

3,433
4
29
38

1

vote

1 answer

Optimal way for calculating Weighted Jaccard index in Python

I have a dataset constructed as a sparse weighted matrix for which I want to calculate weighted Jaccard index for downstream grouping/clustering, with inspiration from below article:…

python numpy distance minhash

asked Feb 26 '22 at 11:07

Charmander_

55
1
9

Questions tagged [minhash]