I have a dataframe A containing docid(document ID), title(title of the article), lineid(line ID, aka the location of the paragraph), text, and tokencount(counts of words including white spaces):
docid title lineid text tokencount
0 0 A 0 shopping and orders have become more com... 66
1 0 A 1 people wrote to the postal service online... 67
2 0 A 2 text updates really from the U.S. Postal... 43
...
I want to create a new dataframe based on A including title
, lineid
, count
, and query
.
query
is the text string containing one or more words like "data analysis", "text message", or "shopping and orders".
count
is the counts of each word of the query
.
The new dataframe should look like this:
title lemma count lineid
A "data" 0 0
A "data" 1 1
A "data" 4 2
A "shop" 2 0
A "shop" 1 1
A "shop" 2 2
B "data" 4 0
B "data" 0 1
B "data" 2 2
B "shop" 9 0
B "shop" 3 1
B "shop" 1 2
...
How to make a function to generate this new dataframe?
I have created a new dataframe df
from A with a column count
.
df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)
Also, I have created a function to count word of query.
from collections import Counter
def occurrence_counter(target_string, query):
data = dict(Counter(target_string.split()))
count = 0
for key in query:
if key in data:
count += data[key]
return count
But, how can I use both of them to generate a function of a new dataframe?