-1

Is there any function in pandas or sklearn like in graphlab-create "graphlab.text analytics.count_words" to count words of every row and make a new column in csv data sheet of word count ?

Andreas Rossberg
  • 34,518
  • 3
  • 61
  • 72

1 Answers1

1

Of course you can do it. The easiest solution is to use Counter :

from collections import Counter

data = {
    "Sentence" : ["Hello World", "The world is mine", "World is big", "Hello you", "foo_bar bar", "temp"],
    "Foo" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)  # create a fake dataframe

# Create a counter for every words
counter = Counter()

# update the counter with every rows of you dataframe
df["Sentence"].str.split(" ").apply(counter.update)

# You can check the result as a dict with counter.most_common() but if you want a dataframe you can do
pd.DataFrame(c.most_common(), columns = ["Word", "freq"])    

Pay attention that you may have to pre-process text upfront (convert to lower, use a Stemmer, ...). For example with my test dataframe you have :

{'Hello': 2, 'The': 1, 'World': 2, 'bar': 1, 'big': 1, 'foo_bar': 1, 'is': 2, 'mine': 1, 'temp': 1, 'world': 1, 'you': 1}

and you can see that you have "World" = 2 and "world" = 1 because I didn't convert to lower/upper the text.

You can also look at other solution like the CountVectorizer (link) or again the TF-IDF Vectorizer (link)

I hope it helps,

Nicolas

Nicolas M.
  • 1,472
  • 1
  • 13
  • 26