Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
3
votes
2 answers

What is the most efficient way to identify text similarity between items in large lists of strings in Python?

The following piece of code achieves the results I'm trying to achieve. There is a list of strings called 'lemmas' that contains the accepted forms of a specific class of words. The other list, called 'forms' contains a lot of spelling variations of…
jfontana
  • 141
  • 1
  • 6
3
votes
0 answers

How to get the most similar match using BERT from a pandas column to an input string?

I am trying to find the most similar match in a column of a pandas dataframe to an input string that is not in English (Swedish). This is what I have tried. I have encoded both my input string and the texts in the pandas' column and then I tried to…
Vai
  • 149
  • 1
  • 10
3
votes
1 answer

Calculate Distance Metric between Homomorphic Encrypted Vectors

Is there a way to calculate a distance metric (euclidean or cosine similarity or manhattan) between two homomorphically encrypted vectors? Specifically, I'm looking to generate embeddings of documents (using a transformer), homomorphically…
3
votes
3 answers

Huggingface Transformers FAISS index scores

Huggingface transformers library has a pretty awesome feature: it can create a FAISS index on embeddings dataset which allows searching for the nearest neighbors. train_ds['train'].add_faiss_index("embedding") scores, sample =…
Nik
  • 161
  • 1
  • 13
3
votes
1 answer

Getting similarity score with spacy and a transformer model

I've been using the spacy en_core_web_lg and wanted to try out en_core_web_trf (transformer model) but having some trouble wrapping my head around the difference in the model/pipeline usage. My use case looks like the following: import spacy from…
Connor
  • 393
  • 2
  • 9
3
votes
1 answer

What is the equivalent of python's faiss.normalize_L2() in C++?

I want to perfom similarity search using FAISS for 100k facial embeddings in C++. For the distance calculator I would like to use cosine similarity. For this purpose, I choose faiss::IndexFlatIP .But according to the documentation we need to…
Sabbir Talukdar
  • 115
  • 2
  • 11
3
votes
1 answer

Python compute cosine similarity on two directories of files

I have two directories of files. One contains human-transcribed files and the other contains IBM Watson transcribed files. Both directories have the same number of files, and both were transcribed from the same telephony recordings. I'm computing…
jtoepp
  • 83
  • 7
3
votes
1 answer

Cosine distance more than 1

I'm using the distance.cosine function from the scipy.spatial python package. The problem is that my code returns me some values which are more than one. How is that possible? My code is very simple but that's it: for i in…
Barbamento
  • 33
  • 1
  • 5
3
votes
2 answers

Top N Values of Cosine Similarity Matrix in R

How do I get the top pairs of a cosine similarity matrix like below: southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773, 0.0869562860988618, 0.165272735625452, 0, 0.251690602341816, 0.472701602991984,…
nak5120
  • 4,089
  • 4
  • 35
  • 94
3
votes
3 answers

Calculating words similarity score in python

I'm trying to calculate books similarity by comparing the topics lists. Need to get similarity score from the 2 lists between 0-1. Example: book1_topics = ["god", "bible", "book", "holy", "religion", "Christian"] book2_topics = ["god", "Christ",…
Sapir
  • 31
  • 1
  • 2
3
votes
2 answers

Computing Cosine Distance with Differently shaped tensors

I have the following tensor representing a word vector A = (2, 500) Where the first dimension is the BATCH dimension (i.e. A contains two word vectors each with 500 elements) I also have the following tensor B = (10, 500) I want to compute the…
Joe
  • 175
  • 3
  • 10
3
votes
4 answers

How to find most optimal number of clusters with K-Means clustering in Python

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice. I want to cluster similar…
3
votes
2 answers

create a function to compute all pairwise cosine similarity of the row vectors in a 2-D matrix using only numpy

For example, given matrix array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]]) it should return array([[1. , 0.91465912, 0.87845859], [0.91465912, 1. , 0.99663684], [0.87845859,…
RRR
  • 63
  • 2
  • 8
3
votes
0 answers

Text similarity as probability (between 0 and 1)

I have been trying to compute text similarity such that it'd be between 0 and 1, seen as a probability. The two text are encoded in two vectors, that are a bunch of numbers between [-1, 1]. So as two vectors are given, it seems plausible to use…
inverted_index
  • 2,329
  • 21
  • 40
3
votes
2 answers

About cosine similarity, how to choose the loss function and the network(I have two plans)

Sorry I have no clue, I don't know where to find a solution. I'm using two networks to construct two embeddings,I have binary target to indicate whether embeddingA and embeddingB "match" or not(1 or -1). The dataset like this: embA0 embB0 1.0 embA1…