A common task in sentiment analysis is to obtain the count of words within a Pandas data frame cell and create a new column based on that count. How do I do this?
Asked
Active
Viewed 1.3k times
4 Answers
9
Assuming that a sentence with n words has n-1 spaces in it, there's another solution:
df['new_column'] = df['count_column'].str.count(' ') + 1
This solution is probably faster, because it does not split each string into a list.
If count_column
contains empty strings, the result needs to be adjusted (see comment below):
df['new_column'] = np.where(df['count_column'] == '', 0, df['new_column'])

altabq
- 1,322
- 1
- 20
- 33
-
I don't have enough reputation to downvote, but the reason I wanted to do so, is because 1 word and 0 words both have no spaces, hence those cases will be treated the same. I'd rather use split() – goidelg Jul 30 '21 at 14:15
-
1`split(' ')` shows exactly the same result: `[len(c.split(' ')) for c in ['', 'car']] == [c.count(' ')+1 for c in ['', 'car']]` – altabq Aug 01 '21 at 10:44
-
And that's why I don't have that reputation :-) – goidelg Aug 01 '21 at 14:35
6
Let's say you have a dataframe df that you've generated using
df = pandas.read_csv('dataset.csv')
You would then add a new column with the word count by doing the following:
df['new_column'] = df.columnToCount.apply(lambda x: len(str(x).split(' ')))
Keep in mind the space in the split is important since you're splitting on new words. You may want to remove punctuation or numbers and reduce to lowercase before performing this as well.
df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('\d+', '', regex = True)
df = df.replace('[^\w\s\+]', '', regex = True)

muninn
- 473
- 1
- 4
- 12
2
For dataframe df remove punctuations from the selected column:
string_text = df['reviews'].str
df['reviews'] = string_text.translate(str.maketrans('', '', string.punctuation))
Get the word count:
df['review_word_count'] = df['reviews'].apply(word_tokenize).tolist()
df['review_word_count'] = df['review_word_count'].apply(len)
Write to a CSV with new column:
df.to_csv('./data/dataset.csv')

Isurie
- 310
- 4
- 9
0
from collections import Counter
df['new_column'] = df['count_column'].apply(lambda x: Counter(" ".join(x).split(" ")).items())

A.Kot
- 7,615
- 2
- 22
- 24
-
This requires you to split each text cell in `count_column` into a list of words. (If each cell in `count_column` holds a single string, this counts characters.) Also, sorry if I'm missing something obvious, but why `Counter(' '.join(x).split(' '))`? Doesn't `Counter(x)` achieve the same result? **EDIT:** one reason to join and then split is to ensure you break up any strings in the list that contain multiple space-separated words. – Peter Leimbigler Sep 26 '17 at 15:05
-
@PeterLeimbigler How would you count characters if you split by a space? – A.Kot Sep 26 '17 at 15:09
-
running `' '.join(a_string_variable)` on a string inserts a space between each character in the string. – Peter Leimbigler Sep 26 '17 at 15:23