Questions tagged [locality-sensitive-hash]

Locality-sensitive hashing (LSH) is a method of probabilistic dimension reduction.

Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

97 questions
1
vote
1 answer

Confusion in hashing used by LSH

Matrix M is the signatures matrix, which is produced via Minhashing of the actual data, has documents as columns and words as rows. So a column represents a document. Now it says that every stripe (b in number, r in length) has its columns hashed,…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
1
vote
1 answer

Order preserving mapping from utf8 to an array of bytes

I'm working with an algorithm that indexes arbitrarily large unsigned integers of a known, fixed size (e.g. 64 bits or 128 bits). I'd like to be able to apply it to utf-8 strings as well, but in order to do so I need to have a reliable way to map a…
zslayton
  • 51,416
  • 9
  • 35
  • 50
1
vote
1 answer

Applying LSH approach by using sparse matrix instead of dense matrix

I try to apply LSH (https://github.com/soundcloud/cosine-lsh-join-spark) to calculate cosine similarity for some vectors. For my real data I have 2M rows (documents) and 30K features belonging to them. Besides, that matrix is highly sparse. To give…
mlee_jordan
  • 772
  • 4
  • 18
  • 50
1
vote
1 answer

Use Locality Sensitive Hashing on dynamic data set

I am using LSH for database records and by that I am creating a index (not a database index, a simple hashmap) where similar records blocked in to the same bucket. The database may contain several millions of records. My question regards with the…
1
vote
1 answer

Trouble shooting locality sensitive hash

I am using caffe, a deep neural network library, to generate image features for image based retrieval. The particular network I am using generates a 4096 dimensional feature. I am using LSHash to generate hash buckets from the features. When I do a…
freakTheMighty
  • 1,172
  • 1
  • 12
  • 27
1
vote
1 answer

example from LSHForest, results not convinced

The library and corresponding documentation is following -- yes i read everything and being able to "run" on my own codes. http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html But the results do not really make sense to…
user381509
  • 65
  • 7
1
vote
1 answer

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with…
1
vote
0 answers

locality sensitive hashing for infinite feature space

I'm trying to wrap my head around locality-senstive hashing in the case when you can not enumerate all possible features (e.g. Facebook likes when comparing users). Are there solutions adressing this problem? The Locality-sensitive hashing…
1
vote
1 answer

LSH: practice of solving nearest neigbors search

"LSH has some huge advantages. First, it is simple. You just compute the hash for all points in your database, then make a hash table from them. To query, just compute the hash of the query point, then retrieve all points in the same bin…
Alexei Vinogradov
  • 1,548
  • 3
  • 15
  • 34
1
vote
1 answer

Can the cosine similarity when using Locality Sensitive Hashing be -1?

I was reading this question: How to understand Locality Sensitive Hashing? But then I found that the equation to calculate the cosine similarity is as follows: Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi…
Jack Twain
  • 6,273
  • 15
  • 67
  • 107
0
votes
0 answers

Preprocessing of audios for Locality Search Hashing(LSH) algorithm

I am working on designing LSH algorithm for similarity detection in audios. I am using librosa module to extract MFCC of audios which returns multi-dimensional list (20 rows x n columns). Currently what I am doing is that I normalized each value in…
0
votes
1 answer

How to share sensitive data among programs while keeping the possibility of comparing them with other local data?

Context As part of my studies, I am creating a bot capable of detecting scam messages, in Python 3. One of the problems I am facing is the detection of fraudulent websites. Currently, I have a list of domain names saved in a CSV file, containing…
Z_runner
  • 67
  • 1
  • 10
0
votes
1 answer

How to hash a signature matrix to buckets in Locality-sensitive hashing (LSH)

I understand the algorithm behind creating signature matrix from shingles by applying hash functions. However I don't understand how to hash a specific band in a signature matrix to buckets. Assume in matrix M, band b1 we have following values for…
Mina
  • 45
  • 2
  • 6
0
votes
1 answer

How can I get the similarity matrix from minhash LSH?

I have read many tutorials and tried a number of minhash LSH, but it cannot generate the similarity matrix, instead it returns just similar data which exceeds the threshold. How can I generate it? My intention is to use the LSH results for…
z3r0
  • 3
  • 5
0
votes
1 answer

How to determine upper bound of c when estimating jaccard similarity between documents?

Let's say I've a million documents that I preprocessed (calculated signatures for using minhash) in O(D*sqrt(D)) time where D is the number of documents. When I'm given a query document, I've to return the first of the million preprocessed documents…