0

For my project I want to compare to sets of keywords that are stored in lists and obtain a similarity index.

An example would look like the following:

db_1: list of 5 keywords db_2: list of 10 keywords

The data was obtained mostly through web scraping and keyword engineering with rake_nltk therefore they don't exactly match. Semantically there are differences despite the keywords have the same meaning.

Is there any way to get a more or less reliable similarity index, to determine how similar the entries of db_1 and db_2 are?

Please find an example here: enter image description here

I tried to calculate similarity using the spacy library, nevertheless I can't import the module because my environment is not compatible with all versions I tried to install.

Do you know any alternatives?

  • Do you mean you want to compare the meaning of the keywords between the two lists to obtain a similarity measure, or only on the words in themselves? – GregoirePelegrin Dec 07 '22 at 14:54
  • I want to compare the meaning, so for example movie and film would get a high similarity index. – codexxblack Dec 07 '22 at 15:16
  • 1
    You may want to look into [Gensim](https://radimrehurek.com/gensim/) and especially the [pre-trained models](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html). You should look into the `Word2Vec` pre-trained models. They are models trained on a large amount of data, with big corpora, to try and grasp the meaning of many words. If all of your keywords are usual words, then it may be enough for what you need, if not, you may need to train your own model, which will be significantly harder in my opinion. – GregoirePelegrin Dec 07 '22 at 15:34

0 Answers0