5

I'm trying to count the individual words in a column of my data frame. It looks like this. In reality the texts are Tweets.

text
this is some text that I want to count
That's all I wan't
It is unicode text

So what I found from other stackoverflow questions is that I could use the following:

Count most frequent 100 words from sentences in Dataframe Pandas

Count distinct words from a Pandas Data Frame

My df is called result and this is my code:

from collections import Counter
result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
result2

I get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-6-2f018a9f912d> in <module>()
      1 from collections import Counter
----> 2 result2 = Counter(" ".join(result['text'].values.tolist()).split(" ")).items()
      3 result2
TypeError: sequence item 25831: expected str instance, float found

The dtype of text is object, which from what I understand is correct for unicode text data.

Community
  • 1
  • 1
Lam
  • 681
  • 1
  • 9
  • 17
  • Apparantely there are float values in your dataframe, what do you want to do with them? You want to count them as well? – Anand S Kumar Oct 20 '15 at 16:21
  • Since these texts are supposed to be all Tweets I want to count them as well. If this column also contains float values does that mean that I collected tweets that are just numbers? (makes me curious which ones are float) – Lam Oct 20 '15 at 16:26
  • yea that is possible. – Anand S Kumar Oct 20 '15 at 16:27

2 Answers2

8

The issue is occurring because some of the values in your series (result['text']) is of type float. If you want to consider them during ' '.join() as well, then you would need to convert the floats to string before passing them onto str.join().

You can use Series.astype() to convert all the values to string. Also, you really do not need to use .tolist() , you can simply give the series to str.join() as well. Example -

result2 = Counter(" ".join(result['text'].astype(str)).split(" ")).items()

Demo -

In [60]: df = pd.DataFrame([['blah'],['asd'],[10.1]],columns=['A'])

In [61]: df
Out[61]:
      A
0  blah
1   asd
2  10.1

In [62]: ' '.join(df['A'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-77e78c2ee142> in <module>()
----> 1 ' '.join(df['A'])

TypeError: sequence item 2: expected str instance, float found

In [63]: ' '.join(df['A'].astype(str))
Out[63]: 'blah asd 10.1'
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Thanks, that seems to work. Now the output is in a dict, Would it be logical to move it back to a pandas data frame or somehow just keep working within a df? – Lam Oct 20 '15 at 16:40
  • depends on what work you intend to do. but my guess is dataframe would be faster if you are intending to do some kind of analysis. – Anand S Kumar Oct 20 '15 at 16:43
  • Generic answer to generic question :D When I have a specific question I will make a new question. Thanks for the help! – Lam Oct 20 '15 at 16:45
2

In the end I went with the following code:

pd.set_option('display.max_rows', 100)
words = pd.Series(' '.join(result['text'].astype(str)).lower().split(" ")).value_counts()[:100]
words

The problem was however solved by Anand S Kumar.

Lam
  • 681
  • 1
  • 9
  • 17