I'm trying to create a function that gets the frequency of specific words from a dataframe. I'm using Pandas to convert the CSV file into a dataframe and NLTK to tokenize the text. I'm able to get the count for the entire column, but I'm having difficulty in getting the frequency for each row. Below is what I have done so far.
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from collections import defaultdict
words = [
"robot",
"automation",
"collaborative",
"Artificial Intelligence",
"technology",
"Computing",
"autonomous",
"automobile",
"cobots",
"AI",
"Integration",
"robotics",
"machine learning",
"machine",
"vision systems",
"systems",
"computerized",
"programmed",
"neural network",
"tech",
]
def analze(file):
# count = defaultdict(int)
df = pd.read_csv(file)
for text in df["Text"]:
tokenize_text = word_tokenize(text)
for w in tokenize_text:
if w in words:
count[w] += 1
analze("Articles/AppleFilter.csv")
print(count)
Output:
defaultdict(<class 'int'>, {'automation': 283, 'robot': 372, 'robotics': 194, 'machine': 220, 'tech': 41, 'systems': 187, 'technology': 246, 'autonomous': 60, 'collaborative': 18, 'automobile': 6, 'AI': 158, 'programmed': 12, 'cobots': 2, 'computerized': 3, 'Computing': 1})
Goal: Get freq for each row
{'automation': 5, 'robot': 1, 'robotics': 1, ...
{'automobile': 1, 'systems': 1, 'technology': 1,...
{'AI': 1, 'cobots: 1, computerized': 3,....
CVS file Format:
Title | Text | URL
What have I tried:
count = defaultdict(int)
df = pd.read_csv("AppleFilterTest01.csv")
for text in df["Text"].iteritems():
for row in text:
print(row)
if row in words:
count[w] += 1
print(count)
output:
defaultdict(<class 'int'>, {})
If anyone can offer any guidance, tips, or help, I would appreciate it so much. Thank you.