How do I count the total number of words in a Pandas dataframe cell and add those to a new column?

Question

A common task in sentiment analysis is to obtain the count of words within a Pandas data frame cell and create a new column based on that count. How do I do this?

altabq · Answer 1 · 2021-08-01T10:47:56.240

9

Assuming that a sentence with n words has n-1 spaces in it, there's another solution:

df['new_column'] = df['count_column'].str.count(' ') + 1

This solution is probably faster, because it does not split each string into a list.

If count_column contains empty strings, the result needs to be adjusted (see comment below):

df['new_column'] = np.where(df['count_column'] == '', 0, df['new_column'])

edited Aug 01 '21 at 10:47

answered Jul 13 '18 at 12:54

altabq

1,322
1
20
33

I don't have enough reputation to downvote, but the reason I wanted to do so, is because 1 word and 0 words both have no spaces, hence those cases will be treated the same. I'd rather use split() – goidelg Jul 30 '21 at 14:15
1

`split(' ')` shows exactly the same result: `[len(c.split(' ')) for c in ['', 'car']] == [c.count(' ')+1 for c in ['', 'car']]` – altabq Aug 01 '21 at 10:44
And that's why I don't have that reputation :-) – goidelg Aug 01 '21 at 14:35

score 6 · Accepted Answer · answered Sep 26 '17 at 14:22

Let's say you have a dataframe df that you've generated using

df = pandas.read_csv('dataset.csv')

You would then add a new column with the word count by doing the following:

df['new_column'] = df.columnToCount.apply(lambda x: len(str(x).split(' ')))

Keep in mind the space in the split is important since you're splitting on new words. You may want to remove punctuation or numbers and reduce to lowercase before performing this as well.

df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('\d+', '', regex = True)
df = df.replace('[^\w\s\+]', '', regex = True)

Why not use nltk word tokenizer? – Bharath M Shetty Sep 26 '17 at 14:47 — Bharath M Shetty, Sep 26 '17 at 14:47

score 2 · Answer 3 · answered Jan 16 '21 at 17:17

For dataframe df remove punctuations from the selected column:

string_text = df['reviews'].str
df['reviews'] = string_text.translate(str.maketrans('', '', string.punctuation))

Get the word count:

df['review_word_count'] = df['reviews'].apply(word_tokenize).tolist()
df['review_word_count'] = df['review_word_count'].apply(len)

Write to a CSV with new column:

df.to_csv('./data/dataset.csv')

score 0 · Answer 4 · answered Sep 26 '17 at 14:24

0

from collections import Counter

df['new_column'] = df['count_column'].apply(lambda x: Counter(" ".join(x).split(" ")).items())

answered Sep 26 '17 at 14:24

A.Kot

7,615
2
22
24

This requires you to split each text cell in `count_column` into a list of words. (If each cell in `count_column` holds a single string, this counts characters.) Also, sorry if I'm missing something obvious, but why `Counter(' '.join(x).split(' '))`? Doesn't `Counter(x)` achieve the same result? **EDIT:** one reason to join and then split is to ensure you break up any strings in the list that contain multiple space-separated words. – Peter Leimbigler Sep 26 '17 at 15:05
@PeterLeimbigler How would you count characters if you split by a space? – A.Kot Sep 26 '17 at 15:09
running `' '.join(a_string_variable)` on a string inserts a space between each character in the string. – Peter Leimbigler Sep 26 '17 at 15:23

How do I count the total number of words in a Pandas dataframe cell and add those to a new column?

4 Answers4