N-gram analysis based on impression in Python

Question

This is how my sample dataset looks like:

My goal is to understand how many impressions are associated with one word, two words, three words, four words, five words, and six words. I used to run the N-gram algorithm, but it only returns count. This is my current n-gram code.

def find_ngrams(text, n):
    word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(text)
    frequencies = sum(sparse_matrix).toarray()[0]
    ngram = 

pd.DataFrame(frequencies,index=word_vectorizer.get_feature_names(),columns=
['frequency'])
ngram = ngram.sort_values(by=['frequency'], ascending=[False])
return ngram

one = find_ngrams(df['query'],1)
bi = find_ngrams(df['query'],2)
tri = find_ngrams(df['query'],3)
quad = find_ngrams(df['query'],4)
pent = find_ngrams(df['query'],5)
hexx = find_ngrams(df['query'],6)

I figure what I need to do is: 1. split the queries into one word to six words. 2. attach impression to the split words. 3. regroup all the split words and sum the impressions.

Take the second query "dog common diseases and how to treat them" as an example". It should be split as:

(1) 1-gram: dog, common, diseases, and, how, to, treat, them;
(2) 2-gram: dog common, common diseases, diseases and, and how, how to, to treat, treat them;
(3) 3-gram: dog common diseases, common diseases and, diseases and how, and how to, how to treat, to treat them;
(4) 4-gram: dog common diseases and, common diseases and how, diseases and how to, and how to treat, how to treat them;
(5) 5-gram: dog common diseases and how, the queries into one word, diseases and how to treat, and how to treat them;
(6) 6-gram: dog common diseases and how to, common diseases and how to treat, diseases and how to treat them;

I'm sorry, what exactly is your question? Are you asking how to generate n-grams? And what do you mean "I used to run the N-gram algorithm, but it only returns count"? — juanpa.arrivillaga, Apr 20 '17 at 19:48
I need to find out the impression associated with the n-gram. N-gram gives the frequency of the terms appear, but I need to understand how many impressions are associated. — Ran Tao, Apr 20 '17 at 19:50

Gijs · Accepted Answer · 2017-04-20T20:53:54.883

Here is a method! Not the most efficient, but, let's not optimize prematurely. The idea is to use apply to get a new pd.DataFrame with new columns for all ngrams, join this with the old dataframe, and do some stacking and grouping.

import pandas as pd

df = pd.DataFrame({
    "squery": ["how to feed a dog", "dog habits", "to cat or not to cat", "dog owners"],
    "count": [1000, 200, 100, 150]
})

def n_grams(txt):
    grams = list()
    words = txt.split(' ')
    for i in range(len(words)):
        for k in range(1, len(words) - i + 1):
            grams.append(" ".join(words[i:i+k]))
    return pd.Series(grams)

counts = df.squery.apply(n_grams).join(df)

counts.drop("squery", axis=1).set_index("count").unstack()\
    .rename("ngram").dropna().reset_index()\
    .drop("level_0", axis=1).groupby("ngram")["count"].sum()

This last expression will return a pd.Series like below.

    ngram
a                       1000
a dog                   1000
cat                      200
cat or                   100
cat or not               100
cat or not to            100
cat or not to cat        100
dog                     1350
dog habits               200
dog owners               150
feed                    1000
feed a                  1000
feed a dog              1000
habits                   200
how                     1000
how to                  1000
how to feed             1000
how to feed a           1000
how to feed a dog       1000
not                      100
not to                   100
not to cat               100
or                       100
or not                   100
or not to                100
or not to cat            100
owners                   150
to                      1200
to cat                   200
to cat or                100
to cat or not            100
to cat or not to         100
to cat or not to cat     100
to feed                 1000
to feed a               1000
to feed a dog           1000

Spiffy method

This one is a bit more efficient probably, but it still does materialize the dense n-gram vector from CountVectorizer. It multiplies that one on each column with the number of impressions, and then adds over the columns to get a total number of impressions per ngram. It gives the same result as above. One thing to note is that a query that has a repeated ngram also counts double.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1, 5))
ngrams = cv.fit_transform(df.squery)
mask = np.repeat(df['count'].values.reshape(-1, 1), repeats = len(cv.vocabulary_), axis = 1)
index = list(map(lambda x: x[0], sorted(cv.vocabulary_.items(), key = lambda x: x[1])))
pd.Series(np.multiply(mask, ngrams.toarray()).sum(axis = 0), name = "counts", index = index)

I like this way! We only need to group all the same queries and sum their impressions, and then separate them based on the number of words. — Ran Tao, Apr 20 '17 at 21:27
Can you tell me how to store the pd.Series result as a data frame — Ran Tao, Apr 20 '17 at 22:02
If s is a Series, `s = pd.Series()`, then `s.to_frame()` is a function that makes a dataframe with one column out of it. — Gijs, Apr 21 '17 at 12:23
Thanks. I renamed "squery" as "query", and changed: ngrams = cv.fit_transform(df.query). But the error returned: 'method' object is not iterable. Do you know how to fix this issue. — Ran Tao, Apr 21 '17 at 14:55
Sound like you forgot parentheses somewhere. Look at which line gives the error. Somewhere in that line you are looping over a thing that is not actually an iterable, ie. a loopable thing, but a method. — Gijs, Apr 26 '17 at 09:27

score 1 · Answer 2 · answered Apr 20 '17 at 20:50

How about something like this:

def find_ngrams(input, n):
    # from http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/
    return zip(*[input[i:] for i in range(n)])

def impressions_by_ngrams(data, ngram_max):
    from collections import defaultdict
    result = [defaultdict(int) for n in range(ngram_max)]
    for query, impressions in data:
        words = query.split()
        for n in range(ngram_max):
           for ngram in find_ngrams(words, n + 1):
                result[n][ngram] += impressions
    return result

Example:

>>> data = [('how to feed a dog', 10000),
...         ('see a dog run',     20000)]
>>> ngrams = impressions_by_ngrams(data, 3)
>>> ngrams[0]   # unigrams
defaultdict(<type 'int'>, {('a',): 30000, ('how',): 10000, ('run',): 20000, ('feed',): 10000, ('to',): 10000, ('see',): 20000, ('dog',): 30000})
>>> ngrams[1][('a', 'dog')]  # impressions for bigram 'a dog'
30000

N-gram analysis based on impression in Python

2 Answers2