Questions tagged [cosine-similarity]

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. It is a popular similarity measure between two vectors because it is calculated as a normalized dot product between the two vectors, which can be calculated with simple mathematical operations.

From Wikipedia:

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 degrees is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 degrees have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is a popular similarity measure between two vectors a and b because it can be computed efficiently dividing the dot product of the two vectors by the Euclidean norm of each (the square root of the sum of the squared terms). For instance, vectors (0, 3, 4) and (-3, 4, 0) have dot product 12 and each have norm 5, so their dot product similarity is 12/5/5 = 0.48.

1004 questions
0
votes
1 answer

Cosine similarity for already known pairs of duplicates

I have a list of duplicate document pairs saved in a csv file. Each ID from column 1 is a duplicate to the corresponding ID in column 2. The file goes something like this: Document_ID1 Document_ID2 12345 87565 34546 …
Minu
  • 450
  • 1
  • 7
  • 21
0
votes
1 answer

Why cosine_similarity of pretrained fasttex model is high between two sentents are not relative at all?

I am wondering to know why pre-trained 'fasttext model' with wiki(Korean) seems not to work well! :( model = fasttext.load_model("./fasttext/wiki.ko.bin") model.cosine_similarity("테스트 테스트 이건 테스트 문장", "지금 아무 관계 없는 글 정말로 정말로") (in…
DSDS
  • 57
  • 7
0
votes
0 answers

Spark MLlib Scala - Creating Rowmatrix from MovieLens like DataSet

I am trying to implement cosine similarity to calculate Item-Item Similairity using Input Dataset which looks like this - UserID, ProductID, Transactions where UserID, ProductID are Long values and Transaction is Integer. I am following this…
0
votes
0 answers

Cosine similarity robust to shifts

Is there a generalization of cosine similarity that is robust to shifts across the compared vectors? E.g. a metric assigning high similarity to the following vectors: [0,1,1,1,2,2,0,0] [1,1,1,2,2,0,0,0]
Dion
  • 123
  • 2
  • 8
0
votes
0 answers

Similar Users in MovieLens Data

I am trying to find the similar users in Movie Lens data using numpy in python so that all calculations are fast. However, I am not able to get the final code to find similarity using matrices mulplications etc. Here is the code: import pandas as…
Manish Kumar
  • 1,419
  • 3
  • 17
  • 36
0
votes
1 answer

Cosine similarity between any two sentences is giving 0.99 always

I downloaded the stackoverflow dump (which is a 10GB file) and ran word2vec on the dump in order to get vector representations for programming terms (I require it for a project that I'm doing). Following is the code: from gensim.models import…
morghulis
  • 145
  • 11
0
votes
0 answers

how to compute cosine similarity between words for a large DocumentTermMatrix

I have a large tdm, for which I need the cosine similarity for every term with every other term. Standard procedures are not helping as I am getting the following error. Error: cannot allocate vector of size 1162.4 Gb Since I am a novice with…
NinjaR
  • 621
  • 6
  • 22
0
votes
1 answer

Write custom kernel for svm in R

I'm looking to use the svm() function of the e1071 package in R. I am new to this package and I was wondering if it is possible to write your own custom kernel callable in svm(). I see that there are several kernels pre-loaded, but I don't see a…
user162381
  • 101
  • 1
0
votes
1 answer

How to apply content based filtering in ne04j

I have a data in below format where 1st column represents the products node, all the following columns represent properties of the products. I want to apply content based filtering algo using cosine similarity in Neo4j. For that, I believe, I need…
Amar jaiswal
  • 55
  • 1
  • 9
0
votes
1 answer

Similarity Metrics

I am trying to research on different metrics and found many ssimilarity metrics : Euclidean distance Dynamic Time Warping, Edit Distance with Real Penalty DISSIM , Sequence Weighted Alignment model, Spatial Assembling Distance. However I had a…
0
votes
1 answer

Calculating cosine similarity from file vectors in Python

I would like to calculate cosine similarity between two vectors in the a file in the following format: first_vector 1 2 3 second_vector 1 3 5 ... simply the name of the vector and then its elements, separated by single space. I have defined a…
Programmer
  • 37
  • 6
0
votes
1 answer

How does cosine similarity used with K-means algorithm?

For three text document vectors having different length in their vectors in VSM where entries are tf-idf of terms: Q1: how cosine similarity used by k-means does then how the clusters are constructed. Q2: when I use TF-IDF algo. Its produce a…
0
votes
1 answer

Long running spark submit job

I am trying to run a script using spark submit as this spark-submit -v \ --master yarn \ --num-executors 80 \ --driver-memory 10g \ --executor-memory 10g \ --executor-cores 5 \ --class cosineSimillarity jobs-1.0.jar This script is implementing…
0
votes
1 answer

Computing cosine similarity using Python

I have written the following code to compute the cosine similarity between a number of preprocessed document (stop word removal, stemming and term frequency-inverse document frequency). print(X.shape) similarity = [] for each in X: …
user7347576
  • 236
  • 2
  • 5
  • 15
0
votes
1 answer

Tf-Idf calculation for two corpuses

I have two corpuses (Corpus 1 & Corpus 2), documents in corpus 1 contain plagiarized sentences from Corpus 2. I'm using Tf-Idf approach to measure the similarity between documents in corpus 1 against docs in Corpus 2. An inverted index for terms in…
Minions
  • 5,104
  • 5
  • 50
  • 91