In addition to spaCy, I would also suggest the Jaccard similarity index if all you're looking for is lexical overlap/similarity.
You're going to need to install NLTK.
from nltk.util import ngrams
def jaccard_similarity(str1, str2, n):
str1_bigrams = list(ngrams(str1, n))
str2_bigrams = list(ngrams(str2, n))
intersection = len(list(set(str1_bigrams).intersection(set(str2_bigrams))))
union = (len(set(str1_bigrams)) + len(set(str2_bigrams))) - intersection
return float(intersection) / union
In the above function you can choose n
(which refers to the "n" in n-gram) to be whatever you want. I usually use n=2
in order to use bigram Jaccard similarity, but it's up to you.
Now to apply that to your example, I'd personally calculate the bigram Jaccard similarity for each pair of words in each list and average those values (assuming you have the jaccard_similarity
function defined above):
>>> from itertools import product
>>> book1_topics = ["god", "bible", "book", "holy", "religion", "Christian"]
>>> book2_topics = ["god", "Christ", "idol", "Jesus"]
>>> pairs = list(product(book1_topics, book2_topics))
>>> similarities = [jaccard_similarity(str1, str2, 2) for str1, str2 in pairs]
>>> avg_similarity = sum(similarities) / len(similarities)