Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions

votes

2 answers

Which algorithm should I use to match the pattern or finding intersection between datasets?

I have personID and VaccinationsID plotted in x and y axis. I want to group those personIDs who have the most similar selection of vaccinations. I am trying to use clustering machine learning algorithm. But I am not sure whether I should use this…

asked Oct 09 '19 at 15:00

P H

votes

1 answer

LSH Binning On-The-Fly

I want to use MinHash LSH to bin a large number of documents into buckets of similar documents (Jaccard similarity). The question: Is it possible to compute the bucket of a MinHash without knowing about the MinHash of the other documents? As far as…

python minhash lsh

asked Jun 01 '19 at 08:25

Raphael

1,731
2
7
23

votes

1 answer

Elasticsearch minhash prefix query with wildcards?

I have a minhash field generated for some text (based on minhash algorithm), now my question is, is it possible to somehow complement or add the prefix query with wildcards? Because the problem is, the hashed string values are based on the content…

elasticsearch wildcard prefix minhash

asked Mar 28 '19 at 11:35

MMMM

3,320
8
43
80

votes

0 answers

How to compare millions of minhashed documents on elasticsearch?

I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query…

elasticsearch string-comparison fuzzy-search minhash

asked Mar 25 '19 at 08:34

MMMM

3,320
8
43
80

votes

0 answers

Filtering of similar texts using Apache Beam

I have a large collection of short texts where I want to filter out texts that are very similar to each other (or exact duplicates). I'd like to achieve this using Apache Beam running on Google Cloud Dataflow. I'm hoping to use the MinHash LSH…

apache-beam plagiarism-detection minhash

asked Mar 18 '18 at 12:19

ehrencrona

6,102
1
18
24

votes

1 answer

Implement minhash LSH using Spark (Java)

this is quite long, and I am sorry about this. I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this: +--------+------+------+------+------+ |element | doc0 | doc1…

java apache-spark minhash

asked Feb 05 '18 at 02:35

lee

votes

1 answer

How can I get the similarity matrix from minhash LSH?

I have read many tutorials and tried a number of minhash LSH, but it cannot generate the similarity matrix, instead it returns just similar data which exceeds the threshold. How can I generate it? My intention is to use the LSH results for…

cluster-analysis locality-sensitive-hash minhash

asked Jan 04 '18 at 14:01

z3r0

votes

1 answer

How to determine upper bound of c when estimating jaccard similarity between documents?

Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents…

bigdata similarity locality-sensitive-hash minhash

asked Nov 23 '17 at 04:24

theitpushover

votes

0 answers

How to measure similarity between 2 timestamp series of events?

Suppose I have two timestamp series of events: T1 = ['2017-03-22 15:16:45', '2017-03-22 15:16:50', '2017-03-22 15:17:55', ...] T2 = ['2017-03-22 15:16:47', '2017-03-22 15:16:52', '2017-03-22 15:17:57', ...] Each timestamp means the time it…

python time-series minhash

asked Oct 18 '17 at 02:00

FriedRice

votes

2 answers

How to calculate similarity of two texts with Jaccard similarity of two bag via MinHash?

I have the following two text: text0 = "AAAAAAAAAAAA"; text1 = "AAAAABAAAAAA"; I use 4-shingle. Thus, text0 = {AAAA}, text1 = {AAAA, AAAB, AABA, ABAA, BAAA}. Then, the Jaccard similarity is sim = 1/5 = 0.2. I do not want this result. Because the…

similarity minhash

asked Aug 31 '17 at 05:13

Yuansheng liu

votes

0 answers

MinHash Implementation Spark

I am trying to implement the MinHash Algorithm as described in chapter 3 as simple as possible in Spark. I have searched a lot everywhere. Well i decided to follow an implementation from this blog as Bill Dim proposes: https: …

apache-spark minhash

asked Oct 10 '16 at 10:10

Spar

votes

1 answer

Find similar images using Geometric Min Hash: How to calculated theoretical matching probabilities?

I'm trying to match images based on visual words (labeled key points within images). When comparing the simulated results to my theoretical results I get significant deviations, therefore I guess there must be a mistake in my theoretical probability…

image-processing computer-vision probability minhash

asked May 24 '16 at 14:41

Mad A.

votes

1 answer

How to cluster sets (users/documents) with distributed MinHash using the banding technique?

I have a big doubt about the way I should cluster sets using MinHash together with the banding technique. I assume everyone reading has a good knowledge of MinHash so I won't define most of the terms I'm using. My goal is to use MinHash to cluster…

scala cluster-analysis apache-flink minhash

asked May 24 '16 at 13:51

Chobeat

3,445
6
41
59

votes

1 answer

Should we consider two sets to be similar if their rows contain the same hashes but in different order?

Suppose we have minhash signatures for two sets and we want to calculate the Jaccard similarity of the two sets. We have: -> S1 S2 h1 0 1 h2 1 2 h3 2 0 h4 3 3 S1 and S2 have the same signatures in different orders. Is the…

machine-learning minhash

asked Feb 20 '16 at 13:18

haky_nash

1,040
1
10
15

votes

1 answer

How to detect the similar text on big data?

As i just know, simhash and minhash are available on this task. But all those algorithms have to traverse the whole text database which will be quite aweful. Is there any optimization or other algorithm that can accelebrate the task? All I come up…

text similarity minhash simhash

asked Nov 18 '15 at 16:05

Leo Zhao

Prev 1 2 3 4

6 Next