Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
10
votes
4 answers

Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents. So far I have calculated the tf-idf of the documents doing the following: from sklearn.feature_extraction.text import TfidfVectorizer def…
OultimoCoder
  • 244
  • 2
  • 7
  • 24
10
votes
3 answers

Cosine similarity TSNE in sklearn.manifold

I have a small problem to perform TSNE on my dataset, using cosine similarity. I have calculated the cosine similarity of all of my vectors, so I have a square matrix which contains my cosine similarity : A = [[ 1 0.7 0.5 0.6 ] [ …
HugoLasticot
  • 203
  • 3
  • 12
10
votes
1 answer

word2vec, sum or average word embeddings?

I'm using word2vec to represent a small phrase (3 to 4 words) as a unique vector, either by adding each individual word embedding or by calculating the average of word embeddings. From the experiments I've done I always get the same cosine…
David Batista
  • 3,029
  • 2
  • 23
  • 42
10
votes
2 answers

Finding the best cosine similarity in a set of vectors

I have n vectors, each with m elements (real number). I want to find the pair where there cosine similarity is maximum among all pairs. The straightforward solution would require O(n2m) time. Is there any better solution? update Cosine similarity /…
hs3180
  • 188
  • 1
  • 8
10
votes
3 answers

clustering with cosine similarity

I have a large data set that I would like to cluster. My trial run set size is 2,500 objects; when I run it on the 'real deal' I will need to handle at least 20k objects. These objects have a cosine similarity between them. This cosine similarity…
9
votes
1 answer

Why use cosine similarity in Word2Vec when its trained using dot-product similarity

According to several posts I found on stackoverflow (for instance this Why does word2Vec use cosine similarity?), it's common practice to calculate the cosine similarity between two word vectors after we have trained a word2vec (either CBOW or…
9
votes
2 answers

cosine similarity built-in function in matlab

I want to calculate cosine similarity between different rows of a matrix in matlab. I wrote the following code in matlab: for i = 1:n_row for j = i:n_row S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j)); S2(j,i) =…
Mehdi
  • 117
  • 1
  • 6
9
votes
1 answer

SQL Computation of Cosine Similarity

Suppose you have a table in a database constructed as follows: create table data (v int, base int, w_td float); insert into data values (99,1,4); insert into data values (99,2,3); insert into data values (99,3,4); insert into data values…
tipanverella
  • 3,477
  • 3
  • 25
  • 41
8
votes
2 answers

Cosine similarity between 0 and 1

I am interested in calculating similarity between vectors, however this similarity has to be a number between 0 and 1. There are many questions concerning tf-idf and cosine similarity, all indicating that the value lies between 0 and 1. From…
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
8
votes
2 answers

Python: Cosine similarity between two large numpy arrays

I have two numpy arrays: Array 1: 500,000 rows x 100 cols Array 2: 160,000 rows x 100 cols I would like to find the largest cosine similarity between each row in Array 1 and Array 2. In other words, I compute the cosine similarities between the…
Alex
  • 4,030
  • 8
  • 40
  • 62
8
votes
2 answers

Python tf-idf: fast way to update the tf-idf matrix

I have a dataset of several thousand rows of text, my target is to calculate the tfidf score and then cosine similarity between documents, this is what I did using gensim in Python followed the tutorial: dictionary = corpora.Dictionary(dat) corpus =…
snowneji
  • 1,086
  • 1
  • 11
  • 25
8
votes
3 answers

create cosine similarity matrix numpy

Suppose I have a numpy matrix like the following: array([array([ 0.0072427 , 0.00669255, 0.00785213, 0.00845336, 0.01042869]), array([ 0.00710799, 0.00668831, 0.00772334, 0.00777796, 0.01049965]), array([ 0.00741872, 0.00650899, …
Sal
  • 277
  • 2
  • 3
  • 9
8
votes
1 answer

Pairwise Operations between Rows of Spark Dataframe (Pyspark)

I have a Spark Dataframe with two columns: id and hash_vector. The id is the id for a document and hash_vector is a SparseVector of word counts corresponding to the document (and has size 30000). There are ~100000 rows (one for each document) in…
8
votes
3 answers

Vectorized cosine similarity calculation in Python

I have two large sets of vectors, A and B. Each element of A is a 1-dimensional vector of length 400, with float values between -10 and 10. For each vector in A, I'm trying to calculate the cosine similarities to all vectors in B in order to find…
BoltzmannBrain
  • 5,082
  • 11
  • 46
  • 79
8
votes
2 answers

Right way to compute cosine similarity between two arrays?

I am working on a project that detects some features of two input images(handwritten signatures) and compares those two features using cosine similarity. Here When I mean two input images, one is an original image, and other is duplicate image. Say…
Shruthi Kodi
  • 107
  • 1
  • 3
  • 10
1
2
3
66 67