Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
3
votes
1 answer

Fastest way to compute cosine similarity in a GPU

So I have a huge tfidf matrix with more than a million records, I would like to find the cosine similarity of this matrix with itself. I am using colab to run the code, but I am not sure how to best make use of the gpu provided by…
3
votes
2 answers

How can I get the cosine similarity of all elements of an array with all the other elements in the same array using Tensorflow

Given an array of sentence embeddings (arrays of 512) with a shape of (1000000, 512) how do I calculate the cosine similarity of every one of the 1 million sentence embeddings of the array against every other sentence embedding of the array, ideally…
jdoig
  • 1,472
  • 13
  • 27
3
votes
1 answer

How to measure how distinct a document is based on predefined linguistic categories?

I have 3 categories of words that correspond to different types of psychological drives (need-for-power, need-for-achievement, and need-for-affiliation). Currently, for every document in my sample (n=100,000), I am using a tool to count the number…
3
votes
1 answer

cosine_similarity between 2 pandas df column to get cosine distance

I have a dataframe as shown below: vector_a vector_b [1,2,3] [2,5,6] [0,2,1] [2,9,1] [4,7,1] [1,7,4] I would like to do sklearn's cosine_similarity between the columns vector_a and vector_b to get a…
atjw94
  • 529
  • 1
  • 6
  • 22
3
votes
0 answers

Getting a 1 x N similarity matrix instead of N x N one using Count Vectorizer

So I'm trying to create similarity matrix of huge dataset whose dimension becomes 60000 x 60000 which is not possible to be stored in the even 25gb ram so I wanted to create the similarity scores separately with the dimension 1 x 60000 where i get…
Yaboku
  • 202
  • 1
  • 2
  • 10
3
votes
2 answers

Bert fine-tuned for semantic similarity

I would like to apply fine-tuning Bert to calculate semantic similarity between sentences. I search a lot websites, but I almost not found downstream about this. I just found STS benchmark. I wonder if I can use STS benchmark dataset to train a…
3
votes
0 answers

Which pyspark abstraction is appropriate for my large matrix multiplication?

I want to perform a large matrix multiplication C = A * B.T and then filter C by applying a stringent threshold, collecting a list of the form (row index, column index, value). A and B are sparse, with mostly zero entries. They are initially…
brch
  • 407
  • 4
  • 7
3
votes
1 answer

How cosine similarity differs from Okapi BM25?

I'm conducting a research using elasticsearch. I was planning to use cosine similarity but I noted that it is unavailable and instead we have BM25 as default scoring function. Is there a reason for that? Is cosine similarity improper for querying…
3
votes
2 answers

Cosine Similarity between Lists of Sentences using Doc2Vec

I'm new to NLP but I'm trying to match a list of sentences to another list of sentences in Python based on their semantic similarity. For example, list1 = ['what they ate for lunch', 'height in inches', 'subjectid'] list2 = ['food eaten two days…
m13op22
  • 2,168
  • 2
  • 16
  • 35
3
votes
2 answers

How to find pairs of values greater than a certain cosine distance value?

I have an array: [[ 0.32730174 -0.1436172 -0.3355202 -0.2982458 ] [ 0.50490916 -0.33826587 0.4315952 0.4850834 ] [-0.18594801 -0.06028342 -0.24817085 -0.41029227] [-0.22551994 0.47151482 -0.39798814 -0.14978702] [-0.3315491 0.05832376…
M. ahmed
  • 53
  • 2
  • 11
3
votes
1 answer

Does Euclidean Distance measure the semantic similarity?

I want to measure the similarity between sentences. Can I use sklearn and Euclidean Distance to measure the semantic similarity between sentences. I read about Cosine similarity also. Can someone explain the difference of those to measures and what…
3
votes
1 answer

Add exception in Spacy tokenizer to not break the tokens with whitespaces?

I am trying to find word similarity between a list of 5 words and a list of 3500 words. The problem that I am facing: The List of 5 words I have are as below …
venkatttaknev
  • 669
  • 1
  • 7
  • 21
3
votes
1 answer

Python speed up document similarity calculation of corpus

My input is a string in this (spintax) format, "The {PC|Personal Computer|Desktop} is in {good|great|fine|excellent} condition" Then using itertools, I generate all possible combinations. e.g. "The PC is in good condition" "The PC is in great…
Mujeeb
  • 995
  • 1
  • 8
  • 18
3
votes
2 answers

PostgreSQL: perform cosine similarity search over pre-vectorized database

I'm trying to implement the cosine similarity search on pre-vectorized database table (like trigram similarity), having objects in this structure: from django.contrib.postgres.fields import ArrayField from django.db import models class…
ShellRox
  • 2,532
  • 6
  • 42
  • 90
3
votes
1 answer

fastest way to perform cosine similarity for 10 million pairs of 1x20 vectors

I have a pandas df of 2 columns each containing 2.7 million rows of normalized vectors of length 20. I want to take the cosine sim of column1 - row1 vs column2- row1, column1 - row2 vs column2 - row2... so and and so forth until 2.7 million. I have…