0

How can I calculate the string similarity (semantic meaning) between 2 string?

For example if I have 2 string like "Display" and "Screen" the string similarity must be close to 100%

If I have "Display" and "Color" the screen similarity must be close to 0%

I'm writing my script in Python... My question is if exists some library or framework to do this kind or think... In alternative can someone suggest me a good approach?

Usi Usi
  • 2,967
  • 5
  • 38
  • 69
  • 1
    @DTing: a dinstance is the opposite of similarity, and furthermore this is semantical similarity. WordNet is probably a good point to start. – Willem Van Onsem Jul 29 '15 at 02:36
  • I see. Misread the question. – dting Jul 29 '15 at 02:39
  • You would need a semantic net or database correllating the meanings of a relatively large number of words. Then it could be queried to find the similarity of two input words. It's operation would use transitivity of similarity to compute similarity of pairs that are not yet stored. –  Jul 29 '15 at 03:42

6 Answers6

3

Based on your examples, I think you are looking for semantical similarity. You can do this for instance by using WordNet, but you will have to add for instance that you are working with nouns and possible iterate over the different meanings of the word. The link shows two examples that calculate the similarity according to various implementations.

Most implementations are however computationally expensive: they make use of a large amount of text to calculate how often two words are close to each other, etc.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
2

What you're looking to solve is an NLP problem; which, if you're not familiar with, can be a hassle. The most popular library out there is NTLK, which has a lot of AI tools. A quick google of what you're looking for yields logic of semantics: http://www.nltk.org/book/ch10.html

This is a computationally heavy process, since it involves loading a dictionary of the entire English language. If you have a small subset of examples, you might be better off creating a mapping yourself.

Ivan Peng
  • 579
  • 5
  • 16
1

I am not good at in NPL, but I think Levenshtein Distance Algorithm can help you solve this problem.Becuase I use this algorithm to calculate the similarity between to strings. And the preformance is not bad. The following are my CPP code, click the link, maybe you can transform the code to Python.I will post the Python code later. If you understance Dynamic Programming, I think you can understande it. enter link description here

GoingMyWay
  • 16,802
  • 32
  • 96
  • 149
  • 2
    I think the OP is looking for *semantical* similarity, not *syntax* similarity. The Levenshtein distance only calculates the minimum number of insertions, deletions and modifications to turn one string into another one. – Willem Van Onsem Jul 29 '15 at 13:23
0

Have a look in following libraries:

NIlesh Sharma
  • 5,445
  • 6
  • 36
  • 53
0

Check out word2vec as implemented in the Gensim library. One of its features is to compute word similarity.

https://radimrehurek.com/gensim/models/word2vec.html

More details and demos can be found here.

I believe this is the state of the art right now.

lightalchemist
  • 10,031
  • 4
  • 47
  • 55
0

As another user suggested, the Gensim library can do this using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.

Google Colab comes with the Gensim package installed. We can import the part of it we require:

from gensim.models import KeyedVectors

We will download training data from Google News, and load it up

!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)

This gives us a measure of similarity between any two words. From your examples:

word_vectors.similarity('display', 'color')
>>> 0.3068566

word_vectors.similarity('display', 'screen')
>>> 0.32314363

Compare those resulting numbers and you will see the words display and screen are more similar than display and color are.

John Skiles Skinner
  • 1,611
  • 1
  • 8
  • 21