Is there any function in pandas or sklearn like in graphlab-create "graphlab.text analytics.count_words" to count words of every row and make a new column in csv data sheet of word count ?
Asked
Active
Viewed 377 times
1 Answers
1
Of course you can do it. The easiest solution is to use Counter
:
from collections import Counter
data = {
"Sentence" : ["Hello World", "The world is mine", "World is big", "Hello you", "foo_bar bar", "temp"],
"Foo" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data) # create a fake dataframe
# Create a counter for every words
counter = Counter()
# update the counter with every rows of you dataframe
df["Sentence"].str.split(" ").apply(counter.update)
# You can check the result as a dict with counter.most_common() but if you want a dataframe you can do
pd.DataFrame(c.most_common(), columns = ["Word", "freq"])
Pay attention that you may have to pre-process text upfront (convert to lower, use a Stemmer, ...). For example with my test dataframe you have :
{'Hello': 2, 'The': 1, 'World': 2, 'bar': 1, 'big': 1, 'foo_bar': 1, 'is': 2, 'mine': 1, 'temp': 1, 'world': 1, 'you': 1}
and you can see that you have "World" = 2 and "world" = 1 because I didn't convert to lower/upper the text.
You can also look at other solution like the CountVectorizer
(link) or again the TF-IDF Vectorizer
(link)
I hope it helps,
Nicolas

Nicolas M.
- 1,472
- 1
- 13
- 26