Pre-process text string with NLTK

Question

I have a dataframe A containing docid(document ID), title(title of the article), lineid(line ID, aka the location of the paragraph), text, and tokencount(counts of words including white spaces):

  docid   title  lineid                                         text        tokencount
0     0     A        0   shopping and orders have become more com...                66
1     0     A        1  people wrote to the postal service online...                67
2     0     A        2   text updates really from the U.S. Postal...                43
...

I want to create a new dataframe based on A including title, lineid, count, and query.

query is the text string containing one or more words like "data analysis", "text message", or "shopping and orders".

count is the counts of each word of the query.

The new dataframe should look like this:

title  lemma   count   lineid
  A    "data"    0        0
  A    "data"    1        1
  A    "data"    4        2
  A    "shop"    2        0
  A    "shop"    1        1
  A    "shop"    2        2
  B    "data"    4        0
  B    "data"    0        1
  B    "data"    2        2
  B    "shop"    9        0
  B    "shop"    3        1
  B    "shop"    1        2
...

How to make a function to generate this new dataframe?

I have created a new dataframe df from A with a column count.

df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)

Also, I have created a function to count word of query.

from collections import Counter

def occurrence_counter(target_string, query):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in query:
        if key in data:
            count += data[key]
    return count

But, how can I use both of them to generate a function of a new dataframe?

score 0 · Answer 1 · answered Apr 25 '21 at 02:16

This should handle your scenario:

import pandas as pd
from collections import Counter

query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)

# row wise frequency count
df['text_new']  = df.text.str.split().apply(lambda x: Counter(x))

output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
    temp = dict()
    for word in wordlist:
        temp['title']  = row['title']
        temp['lemma']  = word
        temp['count']  = row['text_new'][word]
        temp['lineid'] = row['lineid']
    
    output = output.append(temp, ignore_index=True)
#print(output)

tdy · Accepted Answer · 2021-04-25T02:36:57.573

If I understand correctly, you can do this with built-in pandas functions: Series.str.count() to count the queries; melt() to reshape into the final column structure.

Given the sample df:

df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...',  1: 'people wrote to the postal service online...',  2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})

#   docid  title  lineid                                          text
# 0     0      A       0   shopping and orders have become more com...
# 1     0      A       1  people wrote to the postal service online...
# 2     0      A       2   text updates really from the U.S. Postal...

First count() the queries:

queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})

#   docid  title  lineid                                          text  tokencount  query_order  query_shop  query_text
# 0     0      A       0   shopping and orders have become more com...          66            1           1           0
# 1     0      A       1  people wrote to the postal service online...          67            0           0           0
# 2     0      A       2   text updates really from the U.S. Postal...          43            0           0           1

Then melt() into the final column structure:

df.melt(
    id_vars=['title', 'lineid'],
    value_vars=[f'query_{query}' for query in queries],
    var_name='lemma',
    value_name='count',
).replace(r'^query_', '', regex=True)

#   title  lineid  lemma  count
# 0     A       0  order      1
# 1     A       1  order      0
# 2     A       2  order      0
# 3     A       0   shop      1
# 4     A       1   shop      0
# 5     A       2   shop      0
# 6     A       0   text      0
# 7     A       1   text      0
# 8     A       2   text      1

Pre-process text string with NLTK

2 Answers2