Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions
0
votes
2 answers

Which algorithm should I use to match the pattern or finding intersection between datasets?

I have personID and VaccinationsID plotted in x and y axis. I want to group those personIDs who have the most similar selection of vaccinations. I am trying to use clustering machine learning algorithm. But I am not sure whether I should use this…
P H
  • 294
  • 1
  • 3
  • 16
0
votes
1 answer

LSH Binning On-The-Fly

I want to use MinHash LSH to bin a large number of documents into buckets of similar documents (Jaccard similarity). The question: Is it possible to compute the bucket of a MinHash without knowing about the MinHash of the other documents? As far as…
Raphael
  • 1,731
  • 2
  • 7
  • 23
0
votes
1 answer

Elasticsearch minhash prefix query with wildcards?

I have a minhash field generated for some text (based on minhash algorithm), now my question is, is it possible to somehow complement or add the prefix query with wildcards? Because the problem is, the hashed string values are based on the content…
MMMM
  • 3,320
  • 8
  • 43
  • 80
0
votes
0 answers

How to compare millions of minhashed documents on elasticsearch?

I have lots of documents with a minhashed field (based on content similarity) stored in elasticsearch. Now, I would either compare all of them with eachother to get similar (hash) documents, with the Elasticsearch API, but I can't do a fuzzy query…
MMMM
  • 3,320
  • 8
  • 43
  • 80
0
votes
0 answers

Filtering of similar texts using Apache Beam

I have a large collection of short texts where I want to filter out texts that are very similar to each other (or exact duplicates). I'd like to achieve this using Apache Beam running on Google Cloud Dataflow. I'm hoping to use the MinHash LSH…
ehrencrona
  • 6,102
  • 1
  • 18
  • 24
0
votes
1 answer

Implement minhash LSH using Spark (Java)

this is quite long, and I am sorry about this. I have been trying to implement the Minhash LSH algorithm discussed in chapter 3 by using Spark (Java). I am using a toy problem like this: +--------+------+------+------+------+ |element | doc0 | doc1…
lee
  • 234
  • 2
  • 16
0
votes
1 answer

How can I get the similarity matrix from minhash LSH?

I have read many tutorials and tried a number of minhash LSH, but it cannot generate the similarity matrix, instead it returns just similar data which exceeds the threshold. How can I generate it? My intention is to use the LSH results for…
z3r0
  • 3
  • 5
0
votes
1 answer

How to determine upper bound of c when estimating jaccard similarity between documents?

Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents…
0
votes
0 answers

How to measure similarity between 2 timestamp series of events?

Suppose I have two timestamp series of events: T1 = ['2017-03-22 15:16:45', '2017-03-22 15:16:50', '2017-03-22 15:17:55', ...] T2 = ['2017-03-22 15:16:47', '2017-03-22 15:16:52', '2017-03-22 15:17:57', ...] Each timestamp means the time it…
FriedRice
  • 9
  • 2
0
votes
2 answers

How to calculate similarity of two texts with Jaccard similarity of two bag via MinHash?

I have the following two text: text0 = "AAAAAAAAAAAA"; text1 = "AAAAABAAAAAA"; I use 4-shingle. Thus, text0 = {AAAA}, text1 = {AAAA, AAAB, AABA, ABAA, BAAA}. Then, the Jaccard similarity is sim = 1/5 = 0.2. I do not want this result. Because the…
Yuansheng liu
  • 165
  • 1
  • 2
  • 10
0
votes
0 answers

MinHash Implementation Spark

I am trying to implement the MinHash Algorithm as described in chapter 3 as simple as possible in Spark. I have searched a lot everywhere. Well i decided to follow an implementation from this blog as Bill Dim proposes: https: …
Spar
  • 463
  • 1
  • 5
  • 23
0
votes
1 answer

Find similar images using Geometric Min Hash: How to calculated theoretical matching probabilities?

I'm trying to match images based on visual words (labeled key points within images). When comparing the simulated results to my theoretical results I get significant deviations, therefore I guess there must be a mistake in my theoretical probability…
Mad A.
  • 401
  • 4
  • 11
0
votes
1 answer

How to cluster sets (users/documents) with distributed MinHash using the banding technique?

I have a big doubt about the way I should cluster sets using MinHash together with the banding technique. I assume everyone reading has a good knowledge of MinHash so I won't define most of the terms I'm using. My goal is to use MinHash to cluster…
Chobeat
  • 3,445
  • 6
  • 41
  • 59
0
votes
1 answer

Should we consider two sets to be similar if their rows contain the same hashes but in different order?

Suppose we have minhash signatures for two sets and we want to calculate the Jaccard similarity of the two sets. We have: -> S1 S2 h1 0 1 h2 1 2 h3 2 0 h4 3 3 S1 and S2 have the same signatures in different orders. Is the…
haky_nash
  • 1,040
  • 1
  • 10
  • 15
0
votes
1 answer

How to detect the similar text on big data?

As i just know, simhash and minhash are available on this task. But all those algorithms have to traverse the whole text database which will be quite aweful. Is there any optimization or other algorithm that can accelebrate the task? All I come up…
Leo Zhao
  • 77
  • 1
  • 12