How can I calculate Cosine similarity between two strings vectors

Question

I have 2 vectors of dimensions 6 and I would like to have a number between 0 and 1.

a=c("HDa","2Pb","2","BxU","BuQ","Bve")

b=c("HCK","2Pb","2","09","F","G")

Can anyone explain what I should do?

In this case, will `0.667 0.00 0.00 1.00 1.00 1.00` be what you want or is it `0.333 1.00 1.00 0.00 0.00 0.00` ? — etienne, Dec 02 '15 at 15:16
ı just want to see one single probability between 0 and 1.if relation is strong between a and b vectors ıt should be close to 1 and vice versa — Ozgur Alptekın, Dec 02 '15 at 15:20

user2380782 · Accepted Answer · 2015-12-02T16:22:32.450

5

using the lsa package and the manual for this package

# create some files
library('lsa')
td = tempfile()
dir.create(td)
write( c("HDa","2Pb","2","BxU","BuQ","Bve"), file=paste(td, "D1", sep="/"))
write( c("HCK","2Pb","2","09","F","G"), file=paste(td, "D2", sep="/"))

# read files into a document-term matrix
myMatrix = textmatrix(td, minWordLength=1)

EDIT: show how is the mymatrix object

myMatrix
#myMatrix
#       docs
#  terms D1 D2
#    2    1  1
#    2pb  1  1
#    buq  1  0
#    bve  1  0
#    bxu  1  0
#    hda  1  0
#    09   0  1
#    f    0  1
#    g    0  1
#    hck  0  1

# Calculate cosine similarity
res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
res
#0.3333

edited Dec 02 '15 at 16:22

answered Dec 02 '15 at 15:45

user2380782

1,542
4
22
60

Could u explain your codes please? When u compare "HDa" and "HCK" this is not important both have common letter "H". They are completely different.are ur codes works like that – Ozgur Alptekın Dec 02 '15 at 16:10
The code is going to create a `textmatrix-document` using your input vectors, when you create the `textmatrix` you assign an index to the work, i.e, `HDa` is going to be different of `HCK`, see my edit. Then the `cosine` function is going to calculate the cosine similarity between both documents (`a` and `b` on your example) – user2380782 Dec 02 '15 at 16:19
1

One more thing I have a matrix 266 row and 7 column. If ı want to have my own function and give a 2 input one is a vector and the other is a product id. As a result I want to see true or false(1 or 0) among top 8 product which most similar to product id's vector. İf u could answer my question I ll be greatfull.Many thanks in advance – Ozgur Alptekın Dec 02 '15 at 17:35
here is the link http://stackoverflow.com/questions/34062909/how-can-i-built-a-function-that-calculate-cosine-similarity-in-language-r @user2380782 – Ozgur Alptekın Dec 04 '15 at 16:35

score 1 · Answer 2 · answered Dec 02 '15 at 15:20

1

You need a dictionary of possible terms first and then convert your vectors to binary vectors with a 1 in the positions of the corresponding terms and 0 elsewhere. If you name the new vectors a2 and b2, you can calculate the cosine similarly with cor(a2, b2), but notice the cosine similarly is between -1 and 1. You could map it to [0,1] with something like this: 0.5*cor(a2, b2) + 0.5

answered Dec 02 '15 at 15:20

Felipe Gerard

1,552
13
23

After having the dictionary created you can use package `lsa` and run `cos` function such as, `cos(a2, b2)` – user2380782 Dec 02 '15 at 15:25
This is the right way but this looks more like a comment rather than an answer because it shows the general way and not a specific solution. Also, `cor(a2, b2, method='pearson)` is (almost) identical to cosine similarity. – LyzandeR Dec 02 '15 at 15:32
@user2380782 I think the function is `lsa::cosine` – LyzandeR Dec 02 '15 at 15:32
True. I was just hoping to clarify to OP what he really wanted. – Felipe Gerard Dec 02 '15 at 15:47

score 1 · Answer 3 · answered Jul 04 '17 at 10:31

CSString_vector <- c("Hi Hello","Hello");
corp <- tm::VCorpus(VectorSource(CSString_vector));
controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf), weighting = weightTf)
dtm <- DocumentTermMatrix(corp,control = controlForMatrix);
matrix_of_vector = as.matrix(dtm);
res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,]);

could be the better one for the larger data set.

score 0 · Answer 4 · answered Mar 08 '21 at 10:15

Advanced form of embedding might help you to get better output. Please check the following code. It is a Universal sentence encode model that generates the sentence embedding using transformer-based architecture.

from absl import logging
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model([input])

paragraph = [
    "Universal Sentence Encoder embeddings also support short paragraphs. ",
    "Universal Sentence Encoder support paragraphs"]
messages = [paragraph]

print(np.inner( embed(paragraph[0]) , embed(paragraph[1])))

How can I calculate Cosine similarity between two strings vectors

4 Answers4

Linked