Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions

votes

2 answers

Finding most similar sentence match

I have a large dataset containing a mix of words and short phrases, such as: dataset = [ "car", "red-car", "lorry", "broken lorry", "truck owner", "train", ... ] I am trying to find a way to determine the most similar…

asked Jun 20 '18 at 15:28

user9966656

votes

1 answer

How to calculate weighted similarity with scipy.spatial.distance.cosine?

From the function definition: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html scipy.spatial.distance.cosine(u, v, w=None) but my codes got some errors: from scipy import spatial d1 = [3,5,5,3,3,2] d2 =…

python math machine-learning euclidean-distance cosine-similarity

asked Jun 19 '18 at 11:09

DataHolic

votes

1 answer

Cosine similarity between matching rows in numpy ndarrays

I have 2 ndarrays of (n_samples, n_dimensions) and I want for each pair of corresponding rows, so the output would be (n_samples, ) Using sklearn's implementation I get (n_samples, n_samples) result - which obviously makes a lot of irrelevant…

python arrays numpy distance cosine-similarity

asked Mar 11 '18 at 08:53

bluesummers

11,365
8
72
108

votes

0 answers

Identifying Duplicate Customers Based on Similarity (Spark Dataframe)

I have a spark dataframe that contains customer information. Some clients are duplicates but it's hard for the computer to determine that without some form of fuzzy matching like levenstein distance, etc. In the example below, John Smith and Johnny…

scala apache-spark-sql cosine-similarity fuzzy-logic fuzzy-comparison

asked Aug 28 '17 at 21:47

Steve

11,831
14
51
63

votes

0 answers

Normalising Data to use Cosine Distance in Kmeans (Python)

I am currently solving a problem where I have to use Cosine distance as the similarity measure for Kmeans clustering. However, the standard Kmeans clustering package (from Sklearn package) uses Euclidean distance as standard, and does not allow you…

python k-means euclidean-distance cosine-similarity normalize

asked Aug 20 '17 at 07:45

MSalty

4,086
2
12
16

votes

2 answers

Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search…

python pandas scikit-learn tf-idf cosine-similarity

asked Mar 23 '17 at 00:37

Bango

votes

1 answer

Cosine Similarity

I was reading and came across this formula: The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix: M =…

python numpy scikit-learn similarity cosine-similarity

asked Mar 17 '17 at 20:01

Mike El Jackson

votes

2 answers

Efficiently calculate cosine similarity using scikit-learn

After preprocessing and transforming (BOW, TF-IDF) data I need to calculate its cosine similarity with each other element of the dataset. Currently, I do this: cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title] cs_abstract =…

python performance optimization scikit-learn cosine-similarity

asked Feb 04 '17 at 19:40

user7347576

votes

1 answer

Understanding Spark CosineSimillarity output

I am using spark 1.6 cosine similarity (DIMSUM) algorithm. Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala Here is what I am doing. Input: 50k documents' text with…

algorithm scala apache-spark cosine-similarity

asked Jan 31 '17 at 19:02

MasterGoGo

votes

2 answers

Right way to calculate the cosine similarity of two word-frequency-dictionaries in python?

I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a…

python python-3.x nlp nltk cosine-similarity

asked Jan 24 '17 at 12:13

lvcasco

votes

2 answers

Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1)…

nlp data-mining tf-idf cosine-similarity

asked Jan 02 '17 at 18:19

Richard Knoche

votes

1 answer

Calculate cosine similarity between words

If we have two lists of strings: A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split() B = "bank, weather, sun, moon, fun, hi".split(",") The words in list A constitute my word vector basis. How can I calculate the…

python cosine-similarity

asked Nov 05 '16 at 11:48

JohnD

votes

1 answer

Pandas: Apply function over each pair of columns under constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form: Code | 14 | 17 | 19 | ... w1 | 0 | 5 | 3 | ... w2 | 2 …

python pandas cosine-similarity

asked Jul 19 '16 at 10:00

Gonzalo Donoso

votes

1 answer

Pairwise distance python (one base vector against many others)

I have a base vector (consisting of 1's and 0's) and I want to find the cosine distance to 50,000 other vectors (also consisting of 1's and 0's). I found many ways to calculate an entire matrix of pairwise distance, but I'm not interested in that.…

python cosine-similarity

asked Jul 08 '16 at 19:46

Green

votes

1 answer

How to handle negative values of cosine similarities

I computed tf-idf of my documents based of terms. Then, I applied LSA to reduce the dimensionality of the terms. 'similarity_dist' contains values which are negative (see table below). How can I compute cosine distance with the range…

python scikit-learn svd cosine-similarity lsa

asked May 26 '16 at 07:53

kitchenprinzessin

1,023
3
14
30

Prev 1 2 3

…

66 67 Next