Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
6
votes
3 answers

Pytorch RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0

I'm building a simple content based recommendations system. In order to compute the cosine similarity in a GPU accelerated way, i'm using Pytorch. At the time of creating the tfidf vocabulary tensor from a csr_matrix, it promts the following…
6
votes
1 answer

String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide. I am attempting to use two…
HMan06
  • 755
  • 2
  • 9
  • 23
6
votes
2 answers

Create random vector given cosine similarity

Basically given some vector v, I want to get another random vector w with some cosine similarity between v and w. Is there any way we can get this in python? Example: for simplicity I will have 2D vector of v [3,-4]. I want to get random vector w…
eugen
  • 1,249
  • 9
  • 15
6
votes
1 answer

When using the linear_kernel or the cosine_similarity for TfIdfVectorizer I get the error "Kernel died, restarting"

When using the linear_kernel or the cosine_similarity for TfIdfVectorizer, I get the error "Kernel died, restarting". I am running the scikit learn functions for TfID method Vectorizer and fit_transform on some text data like the example below, but…
ana
  • 61
  • 1
  • 4
6
votes
1 answer

TfIdfVectorizer: How does the vectorizer with fixed vocab deal with new words?

I'm working on a corpus of ~100k research papers. I'm considering three fields: plaintext title abstract I used the TfIdfVectorizer to get a TfIdf representation of the plaintext field and feed the thereby originated vocab back into the…
nadre
  • 507
  • 1
  • 4
  • 17
6
votes
1 answer

How to efficiently retrieve top K-similar document by cosine similarity using python?

I am handling one hundred thousand(100,000) documents(mean document length is about 500 terms). For each document, I want to get the top k (e.g. k = 5) similar documents by cosine similarity. So how to efficiently do this by Python. Here is what I…
user1024
  • 982
  • 4
  • 13
  • 26
6
votes
2 answers

DBSCAN error with cosine metric in python

I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X) where X is a csr_matrix. The error is the following:…
6
votes
0 answers

Python: check cosine similarity between mongoDB database documents

I am using python. Now I have a mongoDB database collection, in which all documents have such a format: {"_id":ObjectId("53590a43dc17421e9db46a31"), "latlng": {"type" : "Polygon", "coordinates":[[[....],[....],[....],[....],[.....]]]} …
gladys0313
  • 2,569
  • 6
  • 27
  • 51
6
votes
1 answer

Why does scikit-learn's Nearest Neighbor doesn't seem to return proper cosine similarity distances?

I am trying to use scikit's Nearest Neighbor implementation to find the closest column vectors to a given column vector, out of a matrix of random values. This code is supposed to find the nearest neighbors of column 21 then check the actual cosine…
6
votes
2 answers

How to compute cosine similarity using two matrices

I have two matrices, A (dimensions M x N) and B (N x P). In fact, they are collections of vectors - row vectors in A, column vectors in B. I want to get cosine similarity scores for every pair a and b, where a is a vector (row) from matrix A and b…
John Manak
  • 13,328
  • 29
  • 78
  • 119
6
votes
1 answer

How to efficiently compute similarity between documents in a stream of documents

I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of…
5
votes
1 answer

Is it possible to model cosine similarity in Solr/Lucene?

I'm interested in possible ways to model the cosine similarity algorithm using Solr. I have items which are assigned a vector, for example: items = [ { id: 1, vector: [0,0,0,2,3,0,0] }, { id: 2, vector: [0,1,0,1,5,0,0] }, { id: 3, vector:…
JC Grubbs
  • 39,191
  • 28
  • 66
  • 75
5
votes
0 answers

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu

import asyncio import torch import os import pandas as pd from flair.data import Sentence from flair.embeddings import FlairEmbeddings, DocumentPoolEmbeddings, WordEmbeddings device = torch.device("cpu") print(device) # first, declare how you…
Karthick Aravindan
  • 1,042
  • 3
  • 12
  • 19
5
votes
2 answers

How to find cosine similarity of one vector vs matrix

I have a TF-IDF matrix of shape (149,1001). What is want is to compute the cosine similarity of last columns, with all columns Here is what I did from numpy import dot from numpy.linalg import norm for i in range(mat.shape[1]-1): cos_sim =…
Talha Anwar
  • 2,699
  • 4
  • 23
  • 62
5
votes
1 answer

How to do Sentence Similarity with XLNet?

I want to perform a sentence similarity task and tried the following: from transformers import XLNetTokenizer, XLNetModel import torch import scipy import torch.nn as nn import torch.nn.functional as F tokenizer =…
spadel
  • 998
  • 2
  • 16
  • 40