Highest Voted 'minhash' Questions

1

vote

1 answer

Compare list to every element in a pyspark column

I have a list minhash_sig = ['112', '223'], and I would like to find the jaccard similarity between this list and every element in a pyspark dataframe's column. Unfortunately I'm not able to do so. I've tried using array_intersect, as well as…

asked Aug 28 '21 at 14:37

coderboi

161
3
22

1

vote

2 answers

Number of pairs in calculating Jaccard distance using PySpark are less than they should be

I am trying to calculate Jaccard distance between certain ids with their attributes in the form of SparseVectors. from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col from…

pyspark apache-spark-mllib apache-spark-ml minhash lsh

asked Jan 15 '21 at 22:11

secretive

2,032
7
16

1

vote

1 answer

Is the number of rows always 1 in each band in the Spark implementation of MinHashLSH

I'm trying to understand the MinHash LSH implementation in Spark, org.apache.spark.ml.feature.MinHashLSH. These two files seem the most relevant: MinHashLSH.scala and LSH.scala. To use MinHashLSH, the doc says it needs a numHashTables parameter,…

apache-spark locality-sensitive-hash minhash

asked Dec 11 '20 at 22:15

zyxue

7,904
5
48
74

1

vote

1 answer

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates.…

r md5 minhash ropensci lsh

asked Aug 15 '20 at 16:43

retrography

6,302
3
22
32

1

vote

1 answer

why is JaccardDistance always 0 for different docs from spark MinHashLSHModel approxSimilarityJoin

I am new to Spark ML. Spark ML has MinHash implementation for Jaccard Distance. Please see the doc https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance. In the sample code, input data for comparison are from vectors. I have…

apache-spark machine-learning minhash

asked Oct 31 '19 at 17:46

steve99

11
2

1

vote

0 answers

Technique For Comparing Items in a Set with Varying Numbers of Attributes Possibly Using LSH

I have a data set containing millions of items collected from many disparate sources. Each item contains a list of anywhere from fifty to a thousand attributes. The specific attributes available vary greatly from item to item. I am looking for the…

data-science similarity cosine-similarity locality-sensitive-hash minhash

asked Feb 07 '19 at 18:36

Anthony Gatlin

4,407
5
37
53

1

vote

1 answer

How to evaluate minHashLSH in Spark with scala?

I have a dataset of academic papers and it has 27770 papers (nodes) and another file (graph file) with the original edges with 352807 entries long. I want to calculate minHashLSH to find similar documents and predict links between tow nodes! Bellow…

apache-spark evaluation minhash lsh

asked Jan 26 '19 at 18:40

atheodos

131
12

1

vote

1 answer

UDF to check is non zero vector, not working after CountVectorizer through spark-submit

As per this question, I am applying udf to filter empty vector after CountVectorizer. val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords") val vectorizer = new…

scala apache-spark apache-spark-mllib minhash

asked Feb 12 '18 at 10:34

Sheel

1,010
1
17
30

1

vote

1 answer

String similarity with OR condition in MinHash Spark ML

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| …

scala apache-spark apache-spark-mllib apache-spark-ml minhash

asked Jan 25 '18 at 07:23

Sheel

1,010
1
17
30

1

vote

1 answer

Minhashing on Strings with K-length

I have a application where I should implement Bloom Filters and Minhashing to find similar items. I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it: The aplication generates a number of k-length…

probability bloom-filter minhash

asked Nov 30 '17 at 15:41

Afonso Guimaraes

11
1
5

1

vote

1 answer

Optimum number of permutations to use for estimating set similarity using min hash

Let's say I have to find estimate the jaccard similarity between documents A and B, and I use k random permutations of the union of these sets/documents to determine the documents' signatures. How should I set my k value? Since setting it to a…

bigdata similarity locality-sensitive-hash minhash

asked Nov 23 '17 at 01:46

theitpushover

23
1
4

1

vote

1 answer

How to get the Intersection and Union of two Series in Pandas with non-unique values?

If I have 2 Series objects, like so: [0,0,1] [1,0,0] How would I get the intersection and union of the two? They only contain booleans which means they are non-unique values. I have a large Boolean matrix. I've minhashed it and now I'm trying to…

python pandas minhash

asked Nov 18 '17 at 05:29

user3927312

814
2
13
27

1

vote

0 answers

Spark MinHashLSH Never Progresses

I am new to spark but I am attempting to produce network clusters using user supplied tags or attributes. First I am using the jaccard minhash algorithm to produce similarity scores then running it through power iteration clustering algorithm but…

scala apache-spark apache-spark-mllib google-cloud-dataproc minhash

asked Jul 20 '17 at 15:24

Charlie

21
3

1

vote

0 answers

Using a Probabilistic Data Structure to Do Text Matching (Python)

I have a list of 10,000,000 strings, each is a name of an item. 3 to 5 words, up to 80 characters. Then I have a list of 5,000 strings to match on. Meaning, for each of the 5,000 potential match rules, I need to identify how many of the 10,000,000…

python data-structures bloom-filter minhash

asked Dec 15 '16 at 17:12

jrjames83

901
2
9
22

1

vote

1 answer

Creating different hash functions for integers in Python?

For my implementation of the minhashing algorithm I need to make many random permutations of integers, which will be simulated by using random hash functions (as many as possible). Currently I use hash functions of the form: h(x) = (a*x + b) %…

python-2.7 random hash integer minhash

asked Oct 17 '16 at 10:38

Keyb0ardwarri0r

227
2
10

Questions tagged [minhash]