How to extract all the ngrams from a text dataframe column in different order in a pandas dataframe?

Question

Below is the input Dataframe I have.

id  description
1   **must watch avoid** **good acting**
2   average movie bad acting
3   good movie **acting good**
4   pathetic avoid
5   **avoid watch must**

I want to extract the ngrams i.e bigram, trigram and 4 word grams from the frequently used words in phrases. Lets tokenize the phrases into words, then can we find ngrams even when the order of the frequently used words are in different order i.e (frequently used words are interchanged like in 1st phrase if we have frequently used words as "good movie" and in the 2nd phrase we have frequently used words as "movie good", can we extract the bigram as "good movie"). A sample of what i'm expecting is shown below:

ngram              frequency
must watch            2
acting good           2
must watch avoid      2
average               1

As we can see in 1st sentence the frequently used words are "must watch" and in the last sentence, we have "watch must" i.e the order of the frequent words are changed. So it extracts bigrams as must watch with a frequency of 2.

I need to extract ngrams/bigrams from frequently used words from the phrases.

How to implement this using Python dataframe? Any help is greatly appreciated.

Thanks!

@ Binyamin Even: I have a dataframe of mixed objects. i.e id of int dtype as one of the column, description as object dtype as the 2nd column. This object is a mixture of numbers and String from which i need to extract the ngrams — Ash, Jan 18 '18 at 22:32
Try formatting your dataframe as code in the question to make it readable. — Kyle, Jan 18 '18 at 22:45

jrjames83 · Answer 1 · 2019-07-22T14:19:15.863

7

import pandas as pd
from collections import Counter
from itertools import chain

data = [
    {"sentence": "Run with dogs, or shoes, or dogs and shoes"},
    {"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
    {"sentence": "Hold this while I finish writing the python script"},
    {"sentence": "Is this python script written yet, hey, hold this"},
    {"sentence": "Can dogs write python, or a python script?"},
]

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))

df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()

Now Onto the Frequency Counts

# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]

bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)

 [(('dogs,', 'or'), 2),
 (('shoes,', 'or'), 2),
 (('or', 'without'), 2),
 (('hold', 'this'), 2),
 (('python', 'script'), 2),
 (('run', 'with'), 1),
 (('with', 'dogs,'), 1),
 (('or', 'shoes,'), 1),
 (('or', 'dogs'), 1),
 (('dogs', 'and'), 1)]

edited Jul 22 '19 at 14:19

answered Jan 18 '18 at 22:36

jrjames83

901
2
9
22

I'm finding ngrams/bigrams from the frequently used words from the phrases. Like in your example there is no words that is repeated i.e (frequently used). – Ash Jan 19 '18 at 14:56
Your question makes sense more now. Give me a second – jrjames83 Jan 19 '18 at 15:14
Were you to solve this? I'm trying the same with my data, should I be deleting the stopwords first? – VMEscoli Mar 12 '18 at 23:02
How would you get the n-gram frequency counts? – Superdooperhero Jul 20 '19 at 16:11
1

@Superdooperhero - df['col].tolist(), then flatten it (it will be a list of lists), then pass each element into a Counter class from the collections module. – jrjames83 Jul 21 '19 at 17:36
Thanks! Any chance you can update your answer to show this? I'm definitely not as experienced as you. – Superdooperhero Jul 22 '19 at 12:47
1

@Superdooperhero - yeah, note the edited answer. I didn't handle punctuation in the new example, but I think you'll get the idea! – jrjames83 Jul 22 '19 at 14:19

How to extract all the ngrams from a text dataframe column in different order in a pandas dataframe?

1 Answers1