Questions tagged [lsh]

Locality-sensitive hashing

Locality-sensitive hashing reduces the dimensionality of high-dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items). LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items.1 Locality-sensitive hashing has much in common with data clustering and nearest neighbor search.

48 questions
0
votes
1 answer

LSHModel on spark structured streaming

Apparently, the LSHModel of MLLib from spark 2.4 supports Spark Structured Streaming (https://issues.apache.org/jira/browse/SPARK-24465). However, it's not clear to me how. For instance an approxSimilarityJoin from MinHashLSH transformation…
Galuoises
  • 2,630
  • 24
  • 30
0
votes
1 answer

How to hash a signature matrix to buckets in Locality-sensitive hashing (LSH)

I understand the algorithm behind creating signature matrix from shingles by applying hash functions. However I don't understand how to hash a specific band in a signature matrix to buckets. Assume in matrix M, band b1 we have following values for…
Mina
  • 45
  • 2
  • 6
0
votes
1 answer

is LSH works for zip,jar, wim, iso or any kind of compressed files?

I am wanted to know will LSH(Locality senstive hashing) work for any kind of files to find nearest neighbors ? Means i noticed everywhere, using text file only but i want to find for wim,iso and zip files. So will it work for the wim, iso and zip…
0
votes
1 answer

jaccard similarity using cartesian

I have this piece of code: StructType schema = new StructType( new StructField[] { DataTypes.createStructField("file_path", DataTypes.StringType, false), DataTypes.createStructField("file_content", …
notsure
  • 15
  • 4
0
votes
1 answer

Locality Sensitive Hashing to find nearest neighbours in Python

I am using this link to achieve the solution for my problem I have a situation where I am using location sensitivity hashing to find the 3 nearest neighbours . My dataset has 22 columns both categorical and continuous and ab out 5000 rows. I am…
Django0602
  • 797
  • 7
  • 26
0
votes
1 answer

LSH on android studio

Hi I am trying to make an android app for determining similarity images and my model uses lsh so how can I implement this using java on an android studio.
0
votes
0 answers

making LSH implementation faster in C++11

I am implementing minhash and LSH for similarity search for some string elements in C++11. The minhash sketch for my implementation is a vector of 200 64-bit integers i.e. vector MinHashSketch. I have more than 2 million entries and the…
SBDK8219
  • 661
  • 4
  • 11
0
votes
0 answers

How do semantic textual similarity search based on techniques like LSH contrast with distributional semantics based embedding techniques?

Both on the surface looks like we generate a low dimension representation of texts by hashing or vectoring them, were similar vectors will lie close in the vector space if embedded (in the embedding case) and similar hashes will be in the same…
0
votes
0 answers

Fast way to compare a vector against other vectors using cosine similarity in python? pre-computed matrix? LSH hashing?

I am working on a problem that needs similarity metrics to extract a subset of data from a larger set for further analysis. The way I am extracting the subset is by using cosine similarity above certain threshold. The toy set below describes the…
Luis Miguel
  • 5,057
  • 8
  • 42
  • 75
0
votes
0 answers

PySpark ApproxSimilarityJoin Missing Results

I am trying to do a similarity join between two dataframes by applying MinHashLSH on the bigrams of metaphone representations of names. This works well in most cases but does not appear to handle short substring cases. For example, I want to look…
0
votes
1 answer

LSH Binning On-The-Fly

I want to use MinHash LSH to bin a large number of documents into buckets of similar documents (Jaccard similarity). The question: Is it possible to compute the bucket of a MinHash without knowing about the MinHash of the other documents? As far as…
Raphael
  • 1,731
  • 2
  • 7
  • 23
0
votes
1 answer

LSH - Binary matrix representation from shingles

I have a large dataset of news articles, 48000 to be precise. I have made ngrams of each article where n = 3. my ngrams look like this: [[(tikro, enters, into), (enter, into, research), (into, research, and),...]] now I need to make a binary…
Samiul
  • 45
  • 7
0
votes
1 answer

Is it possible to store custom class object in Spark Data Frame as a column value?

I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark. I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing…
user2058320
  • 97
  • 4
  • 12
0
votes
1 answer

Function returning same variable separated by a comma

I don't understand the point of this function returning two variables, which are the same: def construct_shingles(doc,k,h): #print 'antes -> ',doc,len(doc) doc = doc.lower() doc = ''.join(doc.split(' ')) #print 'depois ->…
spacedustpi
  • 351
  • 5
  • 18
0
votes
1 answer

How can I union all the DataFrame in RDD[DataFrame] to a DataFrame without for loop using scala in spark?

val result is a spark DataFram and its column is [uid: Int, vector: Vector]. But the type of recomRes is RDD[DataFrame], how can I map union all the result in recomRes to a DataFrame? val recomRes = result.rdd.map(row => { val uid =…
许传华
  • 83
  • 1
  • 2