Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions
1
vote
1 answer

Compare list to every element in a pyspark column

I have a list minhash_sig = ['112', '223'], and I would like to find the jaccard similarity between this list and every element in a pyspark dataframe's column. Unfortunately I'm not able to do so. I've tried using array_intersect, as well as…
coderboi
  • 161
  • 3
  • 22
1
vote
2 answers

Number of pairs in calculating Jaccard distance using PySpark are less than they should be

I am trying to calculate Jaccard distance between certain ids with their attributes in the form of SparseVectors. from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col from…
secretive
  • 2,032
  • 7
  • 16
1
vote
1 answer

Is the number of rows always 1 in each band in the Spark implementation of MinHashLSH

I'm trying to understand the MinHash LSH implementation in Spark, org.apache.spark.ml.feature.MinHashLSH. These two files seem the most relevant: MinHashLSH.scala and LSH.scala. To use MinHashLSH, the doc says it needs a numHashTables parameter,…
zyxue
  • 7,904
  • 5
  • 48
  • 74
1
vote
1 answer

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?

As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates.…
retrography
  • 6,302
  • 3
  • 22
  • 32
1
vote
1 answer

why is JaccardDistance always 0 for different docs from spark MinHashLSHModel approxSimilarityJoin

I am new to Spark ML. Spark ML has MinHash implementation for Jaccard Distance. Please see the doc https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance. In the sample code, input data for comparison are from vectors. I have…
steve99
  • 11
  • 2
1
vote
0 answers

Technique For Comparing Items in a Set with Varying Numbers of Attributes Possibly Using LSH

I have a data set containing millions of items collected from many disparate sources. Each item contains a list of anywhere from fifty to a thousand attributes. The specific attributes available vary greatly from item to item. I am looking for the…
1
vote
1 answer

How to evaluate minHashLSH in Spark with scala?

I have a dataset of academic papers and it has 27770 papers (nodes) and another file (graph file) with the original edges with 352807 entries long. I want to calculate minHashLSH to find similar documents and predict links between tow nodes! Bellow…
atheodos
  • 131
  • 12
1
vote
1 answer

UDF to check is non zero vector, not working after CountVectorizer through spark-submit

As per this question, I am applying udf to filter empty vector after CountVectorizer. val tokenizer = new RegexTokenizer().setPattern("\\|").setInputCol("dataString").setOutputCol("dataStringWords") val vectorizer = new…
Sheel
  • 1,010
  • 1
  • 17
  • 30
1
vote
1 answer

String similarity with OR condition in MinHash Spark ML

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| …
Sheel
  • 1,010
  • 1
  • 17
  • 30
1
vote
1 answer

Minhashing on Strings with K-length

I have a application where I should implement Bloom Filters and Minhashing to find similar items. I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it: The aplication generates a number of k-length…
1
vote
1 answer

Optimum number of permutations to use for estimating set similarity using min hash

Let's say I have to find estimate the jaccard similarity between documents A and B, and I use k random permutations of the union of these sets/documents to determine the documents' signatures. How should I set my k value? Since setting it to a…
1
vote
1 answer

How to get the Intersection and Union of two Series in Pandas with non-unique values?

If I have 2 Series objects, like so: [0,0,1] [1,0,0] How would I get the intersection and union of the two? They only contain booleans which means they are non-unique values. I have a large Boolean matrix. I've minhashed it and now I'm trying to…
user3927312
  • 814
  • 2
  • 13
  • 27
1
vote
0 answers

Spark MinHashLSH Never Progresses

I am new to spark but I am attempting to produce network clusters using user supplied tags or attributes. First I am using the jaccard minhash algorithm to produce similarity scores then running it through power iteration clustering algorithm but…
1
vote
0 answers

Using a Probabilistic Data Structure to Do Text Matching (Python)

I have a list of 10,000,000 strings, each is a name of an item. 3 to 5 words, up to 80 characters. Then I have a list of 5,000 strings to match on. Meaning, for each of the 5,000 potential match rules, I need to identify how many of the 10,000,000…
jrjames83
  • 901
  • 2
  • 9
  • 22
1
vote
1 answer

Creating different hash functions for integers in Python?

For my implementation of the minhashing algorithm I need to make many random permutations of integers, which will be simulated by using random hash functions (as many as possible). Currently I use hash functions of the form: h(x) = (a*x + b) %…
Keyb0ardwarri0r
  • 227
  • 2
  • 10