This is how my sample dataset looks like:
My goal is to understand how many impressions are associated with one word, two words, three words, four words, five words, and six words. I used to run the N-gram algorithm, but it only returns count. This is my current n-gram code.
def find_ngrams(text, n):
word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(text)
frequencies = sum(sparse_matrix).toarray()[0]
ngram =
pd.DataFrame(frequencies,index=word_vectorizer.get_feature_names(),columns=
['frequency'])
ngram = ngram.sort_values(by=['frequency'], ascending=[False])
return ngram
one = find_ngrams(df['query'],1)
bi = find_ngrams(df['query'],2)
tri = find_ngrams(df['query'],3)
quad = find_ngrams(df['query'],4)
pent = find_ngrams(df['query'],5)
hexx = find_ngrams(df['query'],6)
I figure what I need to do is: 1. split the queries into one word to six words. 2. attach impression to the split words. 3. regroup all the split words and sum the impressions.
Take the second query "dog common diseases and how to treat them" as an example". It should be split as:
(1) 1-gram: dog, common, diseases, and, how, to, treat, them;
(2) 2-gram: dog common, common diseases, diseases and, and how, how to, to treat, treat them;
(3) 3-gram: dog common diseases, common diseases and, diseases and how, and how to, how to treat, to treat them;
(4) 4-gram: dog common diseases and, common diseases and how, diseases and how to, and how to treat, how to treat them;
(5) 5-gram: dog common diseases and how, the queries into one word, diseases and how to treat, and how to treat them;
(6) 6-gram: dog common diseases and how to, common diseases and how to treat, diseases and how to treat them;