2

I'm very new to NLP, so I have some theoretical question.
Let's say I have the following Spark dataframe:

+--+------------------------------------------+
|id|                                 word_list|
+--+------------------------------------------+
| 1| apple, banana, lime, juice, cherry, peach|
| 2| sauce, cabbage, cucumber, tomatoes, pesto|
| 3|  cocoa, coffee, bottle, tea, water, juice|
+--+------------------------------------------+

I need for each id to extract a generic word that describes the predominant set of semantically similar words in the word_list column. Desired output:

+--+------------------------------------------+----------+
|id|                                 word_list|  category|
+--+------------------------------------------+----------+
| 1| apple, banana, lime, juice, cherry, peach|     fruit|
| 2| sauce, cabbage, cucumber, tomatoes, pesto|vegetables|
| 3|  cocoa, coffee, bottle, tea, water, juice| beverages|
+--+------------------------------------------+----------+  

Is there any unsupervised NLP algorithm that can be used to get the desired output?

Hilary
  • 475
  • 3
  • 10
  • 1
    This is more a design question so it would be better to ask on https://datascience.stackexchange.com/. There's no standard algorithm which does this, however it can be done based on some semantic similarity. Based on the example, [WordNet](https://wordnet.princeton.edu/) would be a good candidate. Similarity with pretrained word embeddings could also work. – Erwan Jun 06 '22 at 09:52

0 Answers0