Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
3
votes
2 answers

Finding most similar sentence match

I have a large dataset containing a mix of words and short phrases, such as: dataset = [ "car", "red-car", "lorry", "broken lorry", "truck owner", "train", ... ] I am trying to find a way to determine the most similar…
user9966656
3
votes
1 answer

How to calculate weighted similarity with scipy.spatial.distance.cosine?

From the function definition: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html scipy.spatial.distance.cosine(u, v, w=None) but my codes got some errors: from scipy import spatial d1 = [3,5,5,3,3,2] d2 =…
3
votes
1 answer

Cosine similarity between matching rows in numpy ndarrays

I have 2 ndarrays of (n_samples, n_dimensions) and I want for each pair of corresponding rows, so the output would be (n_samples, ) Using sklearn's implementation I get (n_samples, n_samples) result - which obviously makes a lot of irrelevant…
bluesummers
  • 11,365
  • 8
  • 72
  • 108
3
votes
0 answers

Identifying Duplicate Customers Based on Similarity (Spark Dataframe)

I have a spark dataframe that contains customer information. Some clients are duplicates but it's hard for the computer to determine that without some form of fuzzy matching like levenstein distance, etc. In the example below, John Smith and Johnny…
Steve
  • 11,831
  • 14
  • 51
  • 63
3
votes
0 answers

Normalising Data to use Cosine Distance in Kmeans (Python)

I am currently solving a problem where I have to use Cosine distance as the similarity measure for Kmeans clustering. However, the standard Kmeans clustering package (from Sklearn package) uses Euclidean distance as standard, and does not allow you…
MSalty
  • 4,086
  • 2
  • 12
  • 16
3
votes
2 answers

Python: MemoryError when computing tf-idf cosine similarity between two columns in Pandas

I'm trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The cosine similarity value is intended to be a "feature" for a search…
Bango
  • 155
  • 1
  • 9
3
votes
1 answer

Cosine Similarity

I was reading and came across this formula: The formula is for cosine similarity. I thought this looked interesting and I created a numpy array that has user_id as row and item_id as column. For instance, let M be this matrix: M =…
Mike El Jackson
  • 771
  • 3
  • 14
  • 23
3
votes
2 answers

Efficiently calculate cosine similarity using scikit-learn

After preprocessing and transforming (BOW, TF-IDF) data I need to calculate its cosine similarity with each other element of the dataset. Currently, I do this: cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title] cs_abstract =…
3
votes
1 answer

Understanding Spark CosineSimillarity output

I am using spark 1.6 cosine similarity (DIMSUM) algorithm. Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala Here is what I am doing. Input: 50k documents' text with…
MasterGoGo
  • 98
  • 6
3
votes
2 answers

Right way to calculate the cosine similarity of two word-frequency-dictionaries in python?

I'm trying to iterate through a file containing text and calculate the cosine similarity between the current line and a query the user raised. I have already tokenized the query and the line and saved the union of their words into a…
lvcasco
  • 45
  • 1
  • 8
3
votes
2 answers

Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1)…
3
votes
1 answer

Calculate cosine similarity between words

If we have two lists of strings: A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split() B = "bank, weather, sun, moon, fun, hi".split(",") The words in list A constitute my word vector basis. How can I calculate the…
JohnD
  • 201
  • 2
  • 9
3
votes
1 answer

Pandas: Apply function over each pair of columns under constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form: Code | 14 | 17 | 19 | ... w1 | 0 | 5 | 3 | ... w2 | 2 …
Gonzalo Donoso
  • 657
  • 1
  • 6
  • 17
3
votes
1 answer

Pairwise distance python (one base vector against many others)

I have a base vector (consisting of 1's and 0's) and I want to find the cosine distance to 50,000 other vectors (also consisting of 1's and 0's). I found many ways to calculate an entire matrix of pairwise distance, but I'm not interested in that.…
Green
  • 393
  • 1
  • 14
3
votes
1 answer

How to handle negative values of cosine similarities

I computed tf-idf of my documents based of terms. Then, I applied LSA to reduce the dimensionality of the terms. 'similarity_dist' contains values which are negative (see table below). How can I compute cosine distance with the range…
kitchenprinzessin
  • 1,023
  • 3
  • 14
  • 30