Questions tagged [minhash]

MinHash is a probabilistic hashing technique for quickly estimating how similar two sets are.

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are.

The idea of the MinHash scheme is to reduce the variance by averaging together several variables constructed in the same way.

81 questions

vote

2 answers

Storing the result of Minhash

The result is a fixed number of arrays, let's say lists (all of the same length) in python. One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python? A list where every…

asked May 05 '16 at 23:50

gsamaras

71,951
46
188
305

vote

1 answer

Create set of shingles from a text file (octave)

I'm creating MinHash and LSH in Octave/Matlab. But I'm trying to get a set (cell array or array) of shingles with k size from a given document and I don't know how to do it. What I have right now is this simple code: doc = fopen(document); i =…

matlab octave minhash

asked Dec 13 '15 at 02:20

nkt09

vote

2 answers

Fast and scalable similarity detection

I have large postgresql database, containing documents. Every document represented as a row in the table. When new document added to the database I need to check for duplicates. But I can't just use select to find exact match. Two documents can vary…

data-mining inverted-index minhash

asked Dec 04 '12 at 11:13

Evgeny Lazin

9,193
6
47
83

votes

0 answers

MinHash Query Parser for Solr: "sim" param not working as expected & How to normalize "hash_score" result?

the "sim" param asks me what minimum similarity score I want, I input that. But it seems to fully ignore what minimum score I want and returns me any document that has at least one matching word to the query's string. Secondly, how do I normalize…

solrj solrcloud solr-query-syntax minhash lsh

asked Jul 19 '23 at 21:46

jasonbored

votes

0 answers

python: minH - LSH

I am trying to find document similarity on a big database (I want to compare 10 000 job descriptions to 1 000 000 existing ones). I am trying to use minH-LSH algorithme. But I find very bad result. I think I might do something wrong. I made a simple…

python minhash lsh

asked Mar 29 '23 at 12:22

Anneso

votes

0 answers

Using DataSketch to find similarity between 3 audios using mfccs

So i am using the datasketch library to find if the audio 2 and audio 3 are similar to the audio 1. However even at the threshold=1 where it should only output audios that are 100% same, it shows the out of the other 2 audios aswell which are really…

python audio librosa mfcc minhash

asked Feb 13 '23 at 18:24

Faizan Ul Haq

votes

1 answer

How to use Solr MinHashQParser

Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs: The queries measure Jaccard similarity between the query string and MinHash fields How to…

solr similarity minhash

asked Oct 21 '22 at 12:35

Kipras Bielinskas

votes

1 answer

Generate sparse vector for all the column values in spark dataframe

column1 column2 1 1 1 0 1 0 0 0 Now I want to calculate the hash or sparse vector of all the values in column1 and column2

apache-spark pyspark apache-spark-mllib minhash

asked Mar 10 '22 at 15:13

Tanmay Sinha

votes

1 answer

How to choose Elastiknn LSH Jaccard similarity index parameters L and k ? In my case I have minhash size = 100, and jaccard Similarity = 0.8

I am trying to detect near-duplicates using Elasticknn plugin. I have created minhashes of text documents, with Minhash set size = 100 I want to apply LSH with Jaccard similarity using Elasticknn plugin (because it has this type of index…

elasticsearch duplicates minhash lsh

asked Oct 12 '21 at 12:38

pratik

votes

1 answer

Transform a dataframe for the minHashLSH in spark

I have this data frame: val df = ( spark .createDataFrame( Seq((1L, 2L), (1L, 5L), (1L,8L), (2L,4L), (2L,6L), (2L,8L)) ) .toDF("A","B") .groupBy("A") .agg(collect_list("B").alias("B")) ) And I would like to transform…

scala apache-spark user-defined-functions apache-spark-ml minhash

asked Feb 01 '21 at 18:28

Galuoises

2,630
24
30

votes

1 answer

Why does my query using a MinHash analyzer fail to retrieve duplicates?

I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation. I use the Python client running in containers to index and perform the search. My corpus is a JSONL file a bit like this: {"id":1, "text":"I'd just…

python elasticsearch duplicates elasticsearch-py minhash

asked Aug 02 '20 at 22:47

Davide Fiocco

5,350
5
35
72

votes

1 answer

is LSH works for zip,jar, wim, iso or any kind of compressed files?

I am wanted to know will LSH(Locality senstive hashing) work for any kind of files to find nearest neighbors ? Means i noticed everywhere, using text file only but i want to find for wim,iso and zip files. So will it work for the wim, iso and zip…

file duplicates data-science minhash lsh

asked Jul 10 '20 at 08:05

Mohammad Wasim Khan

votes

1 answer

Pairiwse jaccard similarity using minhash algorithm

I am working with 200k sentences and I want to find Jaccard similarity using minhash algorithm. but it becomes really slow because of two for loops. could someone suggest me some good implementation? Below is my current code from datasketch.minhash…

performance text similarity minhash

asked Jun 19 '20 at 06:02

Sanket Badhe

votes

0 answers

NameError: name 'min_hash' is not defined

My spark code out some error with min_hash. Just don't know what to do? Have to import some package? I'm a beginner so have to start with little steps...

apache-spark pyspark data-mining minhash

asked May 09 '20 at 20:40

Juggis

votes

0 answers

making LSH implementation faster in C++11

I am implementing minhash and LSH for similarity search for some string elements in C++11. The minhash sketch for my implementation is a vector of 200 64-bit integers i.e. vector MinHashSketch. I have more than 2 million entries and the…

c++ c++11 minhash lsh

asked Dec 18 '19 at 22:14

SBDK8219

Prev 1 2 3

5 6 Next