Below is the input Dataframe I have.
id description
1 **must watch avoid** **good acting**
2 average movie bad acting
3 good movie **acting good**
4 pathetic avoid
5 **avoid watch must**
I want to extract the ngrams i.e bigram, trigram and 4 word grams from the frequently used words in phrases. Lets tokenize the phrases into words, then can we find ngrams even when the order of the frequently used words are in different order i.e (frequently used words are interchanged like in 1st phrase if we have frequently used words as "good movie" and in the 2nd phrase we have frequently used words as "movie good", can we extract the bigram as "good movie"). A sample of what i'm expecting is shown below:
ngram frequency
must watch 2
acting good 2
must watch avoid 2
average 1
As we can see in 1st sentence the frequently used words are "must watch" and in the last sentence, we have "watch must" i.e the order of the frequent words are changed. So it extracts bigrams as must watch with a frequency of 2.
I need to extract ngrams/bigrams from frequently used words from the phrases.
How to implement this using Python dataframe? Any help is greatly appreciated.
Thanks!