I'm very new to NLP, so I have some theoretical question.
Let's say I have the following Spark dataframe:
+--+------------------------------------------+
|id| word_list|
+--+------------------------------------------+
| 1| apple, banana, lime, juice, cherry, peach|
| 2| sauce, cabbage, cucumber, tomatoes, pesto|
| 3| cocoa, coffee, bottle, tea, water, juice|
+--+------------------------------------------+
I need for each id
to extract a generic word that describes the predominant set of semantically similar words in the word_list
column. Desired output:
+--+------------------------------------------+----------+
|id| word_list| category|
+--+------------------------------------------+----------+
| 1| apple, banana, lime, juice, cherry, peach| fruit|
| 2| sauce, cabbage, cucumber, tomatoes, pesto|vegetables|
| 3| cocoa, coffee, bottle, tea, water, juice| beverages|
+--+------------------------------------------+----------+
Is there any unsupervised NLP algorithm that can be used to get the desired output?