1

Below is the input Dataframe I have.

id  description
1   **must watch avoid** **good acting**
2   average movie bad acting
3   good movie **acting good**
4   pathetic avoid
5   **avoid watch must**

I want to extract the ngrams i.e bigram, trigram and 4 word grams from the frequently used words in phrases. Lets tokenize the phrases into words, then can we find ngrams even when the order of the frequently used words are in different order i.e (frequently used words are interchanged like in 1st phrase if we have frequently used words as "good movie" and in the 2nd phrase we have frequently used words as "movie good", can we extract the bigram as "good movie"). A sample of what i'm expecting is shown below:

ngram              frequency
must watch            2
acting good           2
must watch avoid      2
average               1

As we can see in 1st sentence the frequently used words are "must watch" and in the last sentence, we have "watch must" i.e the order of the frequent words are changed. So it extracts bigrams as must watch with a frequency of 2.

I need to extract ngrams/bigrams from frequently used words from the phrases.

How to implement this using Python dataframe? Any help is greatly appreciated.

Thanks!

Ash
  • 11
  • 1
  • 4
  • when you say `dataframe` do you mean a `string`? – Binyamin Even Jan 18 '18 at 22:28
  • @ Binyamin Even: I have a dataframe of mixed objects. i.e id of int dtype as one of the column, description as object dtype as the 2nd column. This object is a mixture of numbers and String from which i need to extract the ngrams – Ash Jan 18 '18 at 22:32
  • Try formatting your dataframe as code in the question to make it readable. – Kyle Jan 18 '18 at 22:45

1 Answers1

7
import pandas as pd
from collections import Counter
from itertools import chain

data = [
    {"sentence": "Run with dogs, or shoes, or dogs and shoes"},
    {"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
    {"sentence": "Hold this while I finish writing the python script"},
    {"sentence": "Is this python script written yet, hey, hold this"},
    {"sentence": "Can dogs write python, or a python script?"},
]

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))

df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()

Now Onto the Frequency Counts

# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]

bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)

 [(('dogs,', 'or'), 2),
 (('shoes,', 'or'), 2),
 (('or', 'without'), 2),
 (('hold', 'this'), 2),
 (('python', 'script'), 2),
 (('run', 'with'), 1),
 (('with', 'dogs,'), 1),
 (('or', 'shoes,'), 1),
 (('or', 'dogs'), 1),
 (('dogs', 'and'), 1)]
jrjames83
  • 901
  • 2
  • 9
  • 22