Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions
1
vote
2 answers

Storing the result of Minhash

The result is a fixed number of arrays, let's say lists (all of the same length) in python. One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python? A list where every…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
1
vote
1 answer

Create set of shingles from a text file (octave)

I'm creating MinHash and LSH in Octave/Matlab. But I'm trying to get a set (cell array or array) of shingles with k size from a given document and I don't know how to do it. What I have right now is this simple code: doc = fopen(document); i =…
nkt09
  • 56
  • 1
  • 4
1
vote
2 answers

Fast and scalable similarity detection

I have large postgresql database, containing documents. Every document represented as a row in the table. When new document added to the database I need to check for duplicates. But I can't just use select to find exact match. Two documents can vary…
Evgeny Lazin
  • 9,193
  • 6
  • 47
  • 83
0
votes
0 answers

MinHash Query Parser for Solr: "sim" param not working as expected & How to normalize "hash_score" result?

the "sim" param asks me what minimum similarity score I want, I input that. But it seems to fully ignore what minimum score I want and returns me any document that has at least one matching word to the query's string. Secondly, how do I normalize…
0
votes
0 answers

python: minH - LSH

I am trying to find document similarity on a big database (I want to compare 10 000 job descriptions to 1 000 000 existing ones). I am trying to use minH-LSH algorithme. But I find very bad result. I think I might do something wrong. I made a simple…
Anneso
  • 583
  • 2
  • 11
  • 20
0
votes
0 answers

Using DataSketch to find similarity between 3 audios using mfccs

So i am using the datasketch library to find if the audio 2 and audio 3 are similar to the audio 1. However even at the threshold=1 where it should only output audios that are 100% same, it shows the out of the other 2 audios aswell which are really…
0
votes
1 answer

How to use Solr MinHashQParser

Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs: The queries measure Jaccard similarity between the query string and MinHash fields How to…
0
votes
1 answer

Generate sparse vector for all the column values in spark dataframe

column1 column2 1 1 1 0 1 0 0 0 Now I want to calculate the hash or sparse vector of all the values in column1 and column2
0
votes
1 answer

How to choose Elastiknn LSH Jaccard similarity index parameters L and k ? In my case I have minhash size = 100, and jaccard Similarity = 0.8

I am trying to detect near-duplicates using Elasticknn plugin. I have created minhashes of text documents, with Minhash set size = 100 I want to apply LSH with Jaccard similarity using Elasticknn plugin (because it has this type of index…
pratik
  • 1
  • 1
0
votes
1 answer

Transform a dataframe for the minHashLSH in spark

I have this data frame: val df = ( spark .createDataFrame( Seq((1L, 2L), (1L, 5L), (1L,8L), (2L,4L), (2L,6L), (2L,8L)) ) .toDF("A","B") .groupBy("A") .agg(collect_list("B").alias("B")) ) And I would like to transform…
0
votes
1 answer

Why does my query using a MinHash analyzer fail to retrieve duplicates?

I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation. I use the Python client running in containers to index and perform the search. My corpus is a JSONL file a bit like this: {"id":1, "text":"I'd just…
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
0
votes
1 answer

is LSH works for zip,jar, wim, iso or any kind of compressed files?

I am wanted to know will LSH(Locality senstive hashing) work for any kind of files to find nearest neighbors ? Means i noticed everywhere, using text file only but i want to find for wim,iso and zip files. So will it work for the wim, iso and zip…
0
votes
1 answer

Pairiwse jaccard similarity using minhash algorithm

I am working with 200k sentences and I want to find Jaccard similarity using minhash algorithm. but it becomes really slow because of two for loops. could someone suggest me some good implementation? Below is my current code from datasketch.minhash…
Sanket Badhe
  • 53
  • 1
  • 6
0
votes
0 answers

NameError: name 'min_hash' is not defined

My spark code out some error with min_hash. Just don't know what to do? Have to import some package? I'm a beginner so have to start with little steps...
Juggis
  • 1
  • 1
0
votes
0 answers

making LSH implementation faster in C++11

I am implementing minhash and LSH for similarity search for some string elements in C++11. The minhash sketch for my implementation is a vector of 200 64-bit integers i.e. vector MinHashSketch. I have more than 2 million entries and the…
SBDK8219
  • 661
  • 4
  • 11