Calculating Similarity Between Pairs of Documents in R

Question

How can I calculate the cosine semantic similarity between pairs of word documents in R?

Specifically, I have the plot (i.e., descriptions) of movie sequels and their original films and want to see how similar the plot of the sequel is with the original film.

Seems both overly-broad and quite vague. How are you representing plots? What do you mean by two plots being similar? Seems like more of an AI problem than something for which you can get a ready numerical score. In any event, the blog post [Using cosine similarity to build a movie recommendation system](https://towardsdatascience.com/using-cosine-similarity-to-build-a-movie-recommendation-system-ae7f20842599) might give you some ideas. — John Coleman, Dec 01 '21 at 18:33
Plots are in text form. I simply want to compare the text of the sequel to the corresponding text of the original film. — bzh, Dec 01 '21 at 18:47

score 1 · Answer 1 · answered Dec 03 '21 at 17:34

As a baseline, I would use a bag of words approach, first unweighted then with tf-idf weighting. Once you have your vectors, calculate the cosine similarity. Here is an sklearn implementation taken from this answer.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from scipy import spatial
import pandas as pd
clf = CountVectorizer(ngram_range=(1,1))
clf.fit(pd.concat([df.originalplot, df.sequelplot]))
originalplot = clf.transform(df.originalplot).todense()
sequelplot= clf.transform(df.sequelplot).todense()
similarities = [1- spatial.distance.cosine(originalplot[x], sequelplot[x]) for x in range(len(sequelplot))]
similarities
# use 'clf = TfidfVectorizer(ngram_range=(1, 1))' at the top for a tf-idf wieghted score.

As a more advanced technique, you can use word embeddings to try and capture not just 1-1 vocabulary matches but also semantically similar words. There are off the self word-embeddings trained on some large corpus. Alternatively, you could train it specifically on your corpus. A sample off the shelf implementation in spaCy, again measuring cosine similarity of the vectors:

import spacy 
nlp = spacy.load("en_core_web_md")
df["original_spacy"] = df.originalplot.apply(nlp)
df["sequel_spacy"] = df.sequelplot.apply(nlp)
df["similarity"] = df.apply(lambda row: row.sequelplot.similarity(row.original_spacy), axis=1)

Note all the above code is a starting point (and could be optimized if you care about speed). You will likely want to refine it and add or subtract transformations (stop-word removal, stemming, lemmatization) as you play around with your data. Check out this Paul Minogue blog post for a more in-depth explanation of these two approaches. If you want to use R, text2vec should have implementations of all the above concepts.

This definitely helps. However, I am having some trouble applying my pre-processed tokenized words into the code you provided. After cleaning and tokenizing the two columns of text, I have a dataframe with the columns: "data['stem_plot']" and "data['stem_prev']" in token form. How can I vectorize these to apply it to your code in the cosine similarity code? — bzh, Dec 10 '21 at 15:38

Calculating Similarity Between Pairs of Documents in R

1 Answers1