-1

I need to develop clusters for similar words based on the meaning of the word. For example, I want "apple" in the same cluster as "fruit," "banana," "honeycrisp."

Is there some lexicon package that has something like this in place, or would it be up to me to create my own clusters?

albin45
  • 19
  • 1

2 Answers2

0

Google's natural language processing api might help. Here is a link: https://cloud.google.com/natural-language/. There is option to demo it right on the site, so you can see if that is what you are looking for.

marsnebulasoup
  • 2,530
  • 2
  • 16
  • 37
0

There are a number of pretrained models for you to download that are vector representations of word stems. A popular choice is Google's pretrained 300-dimensional Word2Vec Model which can be downloaded from:

https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

and loaded (after unzipping) with:

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  

This model is quite large, but does exactly what you want. If you are interested in only a subset of words, I suggest that you extract only those from the model and store their coordinates in a DataFrame for later (and faster) use.

For other models see the following:

  1. https://fasttext.cc/docs/en/crawl-vectors.html
  2. https://nlp.stanford.edu/projects/glove/
  3. https://fasttext.cc/
Joe B
  • 912
  • 2
  • 15
  • 36