Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
5
votes
0 answers

Categorical Features in Distance Matrix

I'm calculating the cosine similarity between two feature vectors and wondering if someone might have a neat solution to the below problem around categorical features. Currently i have (example): # define the similarity function cosineSim <-…
4
votes
1 answer

Return the most similar document compared to a query document by using Cosine similarity in python

I have a set of files and a query doc.My purpose is to return the most similar documents by comparing with query doc for each of the document.To use cosine similarity first i have to map the document strings to vectors.Also i have already created a…
Barbaros26
  • 149
  • 1
  • 6
4
votes
3 answers

Get RequestError(400, 'search_phase_execution_exception', 'runtime error') for cossimilarity

I am trying to do semantic search with Elasticsearch using tensorflow_hub, but I get RequestError: RequestError(400, 'search_phase_execution_exception', 'runtime error') . From search_phase_execution_exception I suppose that with corrupted data(from…
Armen Sanoyan
  • 1,898
  • 2
  • 19
  • 32
4
votes
2 answers

Gensim Doc2Vec visualization issue when using t-SNE and/or PCA

I am trying to familiarize with Doc2Vec results by using a public dataset of movie reviews. I have cleaned the data and run the model. There are, as you can see below, 6 tags/genres. Each is a document with its vector representation. doc_tags =…
4
votes
2 answers

How to use word embeddings (i.e., Word2vec, GloVe or BERT) to calculate the most word similarity in a set of N words?

I am trying to calculate the semantic similarity by inputting the word list and output a word, which is the most word similarity in the list. E.g. If I pass in a list of words words = ['portugal', 'spain', 'belgium', 'country', 'netherlands',…
H M
  • 89
  • 1
  • 6
4
votes
4 answers

Pairwise similarity matrix between a set of vectors in PyTorch

Let's suppose that we have a 3D PyTorch tensor, where the first dimension represents the batch_size, as follows: import torch import torch.nn as nn x = torch.randn(32, 100, 25) That is, for each i, x[i] is a set of 100 25-dimensional vectors. I…
4
votes
4 answers

How to find outliers in document classification with million documents?

I have million documents which belongs to different classes (100 classes). I want to find outlier documents in each class (which doesn't belong to that class but wrongly classified) and filter them. I can do document similarity using cosine…
4
votes
1 answer

A vector and matrix rows cosine similarity in pytorch

In pytorch, I have multiple (scale of hundred thousand) 300 dim vectors (which I think I should upload in a matrix), I want to sort them by their cosine similarity with another vector and extract the top-1000. I want to avoid for loop as it is time…
user3531835
  • 55
  • 1
  • 6
4
votes
2 answers

how to compare two text document with tfidf vectorizer?

I have two different text which I want to compare using tfidf vectorization. What I am doing is: tokenizing each document vectorizing using TFIDFVectorizer.fit_transform(tokens_list) Now the vectors that I get after step 2 are of different…
akshit bhatia
  • 573
  • 6
  • 22
4
votes
3 answers

Efficient way to compute cosine similarity between 1D array and all rows in a 2D array

I have one 1D array of shape (300, ) and a 2D array of shape (400, 300). Now, I want to compute the cosine similarity between each of the rows in this 2D array to the 1D array. Thus, my result should be of shape (400, ) which represents how similar…
kmario23
  • 57,311
  • 13
  • 161
  • 150
4
votes
1 answer

Not able to understand python function of cosine similarity

I am working through the example in the blog to understand collaborative filter method used in recommendation system.I came across cosine similarity expressed as In python using numpy its written as def similarity(ratings, kind='user',…
Nithin Varghese
  • 893
  • 1
  • 6
  • 28
4
votes
3 answers

How to speed up computation of cosine similarity between set of vectors

I have a set of vectors (~30k), each of which consists of 300 elements generated by fasttext, each vector is representing the meaning of an entity, I want to calculate the similarity between all entities, so I iterate over the vectors in a nested…
4
votes
1 answer

cosine similarity LSH and random hyperplane

I read few solutions about nearest neighbor search in high-dimensions using random hyperplane, but I am still confused about how the buckets work. I have 100 millions of document in the form of 100-dimension vectors and 1 million queries. For each…
4
votes
1 answer

cosine similarity between documents (rows) - spark

I have spark job to compute the similarity between text documents: RowMatrix rowMatrix = new RowMatrix(vectorsRDD.rdd()); CoordinateMatrix rowsimilarity=rowMatrix.columnSimilarities(0.5); JavaRDD entries =…
4
votes
0 answers

python scikit-learn cosine similarity value error: could not convert integer scalar

I am trying to produce a cosine similarity matrix using text descriptions of apps. The script below first reads in a csv data file (I can provide the data file if needed) which contains two columns, one with two app categories and the other with…