Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
8
votes
2 answers

How to efficiently compute the cosine similarity between millions of strings

I need to compute the cosine similarity between strings in a list. For example, I have a list of over 10 million strings, each string has to determine similarity between itself and every other string in the list. What is the best algorithm I can use…
Kennedy
  • 2,146
  • 6
  • 31
  • 44
8
votes
4 answers

Combining TF-IDF (cosine similarity) with pagerank?

Given a query I have a cosine score for a document. I also have the documents pagerank. Is there a standard good way of combining the two? I was thinking of multiply them Total_Score = cosine-score * pagerank Because if you get to low on either…
user1506145
  • 5,176
  • 11
  • 46
  • 75
7
votes
1 answer

Scipy cosine similarity vs sklearn cosine similarity

I noticed that both scipy and sklearn have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors: setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 =…
Jay Mody
  • 3,727
  • 1
  • 11
  • 27
7
votes
3 answers

Cosine similarity for very large dataset

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB machine. Each array fits perfectly in my memory…
Saurabh Gokhale
  • 53,625
  • 36
  • 139
  • 164
7
votes
0 answers

Maximal optimization for cosine similarity search

I have pre-made database full of 512 dimensional vectors and want to implement an efficient searching algorithm over them. Research Cosine similarity: The best algorithm in this case would consist of cosine similarity measure, which is basically a…
ShellRox
  • 2,532
  • 6
  • 42
  • 90
7
votes
2 answers

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say: Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e.…
user3554004
  • 1,044
  • 9
  • 24
7
votes
1 answer

Spark cosine distance between rows using Dataframe

I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between…
Ivan Shelonik
  • 1,958
  • 5
  • 25
  • 49
7
votes
5 answers

how to convert from Object[] to int[]

I want to pass myVector to another class (Case.java) but I get this kind of error message. Type mismatch: cannot convert from Object[] to int[]. Can anybody tell me how to solve this? User.java JButton btnNewButton = new…
John Joe
  • 12,412
  • 16
  • 70
  • 135
7
votes
3 answers

Cosine similarity calculation between two matrices

I have a code to calculate cosine similarity between two matrices: def cos_cdist_1(matrix, vector): v = vector.reshape(1, -1) return sp.distance.cdist(matrix, v, 'cosine').reshape(-1) def cos_cdist_2(matrix1, matrix2): return…
gladys0313
  • 2,569
  • 6
  • 27
  • 51
7
votes
2 answers

python: How to calculate the cosine similarity of two word lists?

I want to calculate the cosine similarity of two lists like following: A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory'] B = [u'home (private)', u'school', u'bank', u'shopping mall'] I know the cosine similarity of…
gladys0313
  • 2,569
  • 6
  • 27
  • 51
7
votes
3 answers

Python pandas: Finding cosine similarity of two columns

Suppose I have two columns in a python pandas.DataFrame: col1 col2 item_1 158 173 item_2 25 191 item_3 180 33 item_4 152 165 item_5 96 108 What's the best way to take the cosine similarity of these two columns?
hlin117
  • 20,764
  • 31
  • 72
  • 93
7
votes
1 answer

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure: User has many Profiles (Limit - no more than one of each profile type per user, no duplicates) Profiles has many Attribute Values (A user can have as many or few attribute values as they like) Attributes belong to a category …
StringsOnFire
  • 2,726
  • 5
  • 28
  • 50
7
votes
1 answer

How to efficiently retrieve top K-similar vectors by cosine similarity using R?

I'm working on a high-dimensional problem (~4k terms) and would like to retrieve top k-similar (by cosine similarity) and can't afford to do a pair-wise calculation. My training set is 6million x 4k matrix and I would like to make predictions for…
user1509107
6
votes
1 answer

Calculating similarities of text embeddings using CLIP

I am trying to use CLIP to calculate the similarities between strings. (I know that CLIP is usually used with text and images but it should work with only strings as well.) I provide a list of simple text prompts and calculate the similarity between…
Adi Eyal
  • 183
  • 9
6
votes
1 answer

Universal sentence encoder for big document similarity

I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents. After analyzing few approaches, I got very good results with the Universal Sentence Encoder…
1 2
3
66 67