0

I'm trying to create a function that gets the frequency of specific words from a dataframe. I'm using Pandas to convert the CSV file into a dataframe and NLTK to tokenize the text. I'm able to get the count for the entire column, but I'm having difficulty in getting the frequency for each row. Below is what I have done so far.

import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from collections import defaultdict

words = [
    "robot",
    "automation",
    "collaborative",
    "Artificial Intelligence",
    "technology",
    "Computing",
    "autonomous",
    "automobile",
    "cobots",
    "AI",
    "Integration",
    "robotics",
    "machine learning",
    "machine",
    "vision systems",
    "systems",
    "computerized",
    "programmed",
    "neural network",
    "tech",
]

def analze(file):
    # count = defaultdict(int)
    df = pd.read_csv(file)
    for text in df["Text"]:
        tokenize_text = word_tokenize(text)
        for w in tokenize_text:
            if w in words:
                count[w] += 1


analze("Articles/AppleFilter.csv")
print(count)

Output:

defaultdict(<class 'int'>, {'automation': 283, 'robot': 372, 'robotics': 194, 'machine': 220, 'tech': 41, 'systems': 187, 'technology': 246, 'autonomous': 60, 'collaborative': 18, 'automobile': 6, 'AI': 158, 'programmed': 12, 'cobots': 2, 'computerized': 3, 'Computing': 1})

Goal: Get freq for each row

{'automation': 5, 'robot': 1, 'robotics': 1, ...
{'automobile': 1, 'systems': 1, 'technology': 1,...
{'AI': 1, 'cobots: 1, computerized': 3,....

CVS file Format:

Title | Text | URL

What have I tried:

count = defaultdict(int)
df = pd.read_csv("AppleFilterTest01.csv")
for text in df["Text"].iteritems():
    for row in text:
        print(row)
        if row in words:
            count[w] += 1
print(count)

output:

defaultdict(<class 'int'>, {})

If anyone can offer any guidance, tips, or help, I would appreciate it so much. Thank you.

Bore
  • 15
  • 1
  • 5
  • 1
    can u share a sample of ur tokenized dataset – sammywemmy Mar 18 '20 at 03:43
  • I believe answer # 2 of [this](https://stackoverflow.com/questions/46786211/counting-the-frequency-of-words-in-a-pandas-data-frame) post handles your problem. – Ukrainian-serge Mar 18 '20 at 04:07
  • Does this answer your question? [Counting the Frequency of words in a pandas data frame](https://stackoverflow.com/questions/46786211/counting-the-frequency-of-words-in-a-pandas-data-frame) – Ukrainian-serge Mar 18 '20 at 04:07

1 Answers1

1

Here is a simple solution that uses collections.Counter:

Sample to copy/paste:

0,review_body
1,this is the first 8 issues of the series. this is the first 8 issues of the series.
2,I've always been partial to immutable laws. I've always been partial to immutable laws.
3,This is a book about first contact with aliens. This is a book about first contact with aliens.
4,This is quite possibly *the* funniest book. This is quite possibly *the* funniest book.
5,The story behind the book is almost better than your mom. The story behind the book is almost better than your mom.

Import necessities:

import pandas as pd
from collections import Counter

df = pd.read_clipboard(header=0, index_col=0, sep=',')

Use .str.split() then apply() the Counter:

df1 = df.review_body.str.split().apply(lambda x: Counter(x))

print(df1)

0
1    {'this': 2, 'is': 2, 'the': 4, 'first': 2, '8'...
2    {'I've': 2, 'always': 2, 'been': 2, 'partial':...
3    {'This': 2, 'is': 2, 'a': 2, 'book': 2, 'about...
4    {'This': 2, 'is': 2, 'quite': 2, 'possibly': 2...
5    {'The': 2, 'story': 2, 'behind': 2, 'the': 2, ...

Do dict(Counter(x)) within apply(), .to_dict() at the end, etc to get the output format you need.


Hope that's helpful.

Ukrainian-serge
  • 854
  • 7
  • 12