3

I'm trying to calculate books similarity by comparing the topics lists.

Need to get similarity score from the 2 lists between 0-1.

Example:

book1_topics = ["god", "bible", "book", "holy", "religion", "Christian"]

book2_topics = ["god", "Christ", "idol", "Jesus"]

Tried using wordnet but not sure how to calculate the score.

Any suggestions?

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Sapir
  • 31
  • 1
  • 2
  • I suggest you to look at [this](https://stackoverflow.com/questions/52113939/spacy-strange-similarity-between-two-sentences) discussion – SilentCloud Apr 02 '21 at 12:46
  • 1
    what would be nice is, in your question if you told us how you are comparing them. Like what makes them similar? – program.exe Apr 02 '21 at 12:49
  • To complete my previous comment: I see now that you want to calculate similarity by topics and not by words, so maybe the discussion I suggested is not on point, my bad – SilentCloud Apr 02 '21 at 12:52

3 Answers3

2

I would suggest using spaCy, a Python nlp library

import spacy

book1_topics = ['god', 'bible', 'book', 'holy', 'religion', 'Christian']
book2_topics = ['god', 'Christ', 'idol', 'Jesus']

nlp = spacy.load('en_core_web_md')
doc1 = nlp(' '.join(book1_topics))
doc2 = nlp(' '.join(book2_topics))

print(doc1.similarity(doc2))

Output:

0.822639616995468

Note

You might want to install spacy:

pip3 install spacy

and the model:

python3 -m spacy download en_core_web_md
Rostan
  • 809
  • 9
  • 25
1

In addition to spaCy, I would also suggest the Jaccard similarity index if all you're looking for is lexical overlap/similarity.

You're going to need to install NLTK.

from nltk.util import ngrams

def jaccard_similarity(str1, str2, n):
    str1_bigrams = list(ngrams(str1, n))
    str2_bigrams = list(ngrams(str2, n))

    intersection = len(list(set(str1_bigrams).intersection(set(str2_bigrams))))
    union = (len(set(str1_bigrams)) + len(set(str2_bigrams))) - intersection

    return float(intersection) / union

In the above function you can choose n (which refers to the "n" in n-gram) to be whatever you want. I usually use n=2 in order to use bigram Jaccard similarity, but it's up to you.

Now to apply that to your example, I'd personally calculate the bigram Jaccard similarity for each pair of words in each list and average those values (assuming you have the jaccard_similarity function defined above):

>>> from itertools import product
>>> book1_topics = ["god", "bible", "book", "holy", "religion", "Christian"]
>>> book2_topics = ["god", "Christ", "idol", "Jesus"]
>>> pairs = list(product(book1_topics, book2_topics))
>>> similarities = [jaccard_similarity(str1, str2, 2) for str1, str2 in pairs]
>>> avg_similarity = sum(similarities) / len(similarities)
Sean
  • 2,890
  • 8
  • 36
  • 78
0

This

might be a good approximation if the set of topics is not big. Otherwise I would try to look at models like Word2Vec and its successors.

arstep
  • 1
  • 1