Building word Thesaurus in Python

Question

I have a list of words that were inputted by my users after I did some cleaning up (to correct spelling mistakes) I have the following list, each row represents a string and the number of times this string was inputted:

Pepsi        500
Coke         358
Dr. pepper   254
Sprite       204
Coca cola    159
7 up         140
Mountain dew 137
Diet coke    58
Mtn. dew     50

Now I would like to have a script that will go over this list and group similar words. For example, merging Coke, Coca cola and Diet coke into one group (because they are synonyms of Coca cola).

I saw that in NLTK WordNet there are some similarity functions, can I use them? or is there a "better" way of approaching this problem?

It's not a simple thing to do. First of all, you need to properly tokenize bigrams, e.g. "diet coke" must be regarded as a single word, not two separate words. Then you will probably need to train a skip-gram model (e.g. word2vec) on a large corpus to get a measure of semantic similarity for your tokens. And your question is clearly off-topic and/or too broad for this site. — Eli Korvigo, Dec 15 '16 at 15:56
Your best bet would be to either hardcode the similarities or implement a simple neural network where you can provide it training data on similar words. — Jacob G., Dec 15 '16 at 16:01
Looks like the multi-word terms are already available as such, so nothing to worry about there. But yeah, there are various measures of word similarity, and it all depends on your purposes. Since all your examples are soft drinks, I don't see how you expect any distributional algorithm to distinguish between them. You'd need a product information database for this. (Also I wouldn't call "Diet coke" a synonym of "Coca cola".) — alexis, Dec 15 '16 at 20:37
There might be false friends too. The word "coke" is ambiguous by itself. Also "sprite" could have referred like the fairy in the fantasy world. Also, you might get into one "sense"/reference, many forms, e.g. "Mtn dew" vs "mountain dew", "pepsi" vs "pepsi cola". — alvas, Dec 16 '16 at 06:43

Building word Thesaurus in Python

0 Answers0