54

I have a table like below:

      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd

And I want to count the frequency of all the words within the Firm_Name column, to get an output like below:

enter image description here

I have tried the following code:

import pandas as pd
import nltk
data = pd.read_csv("X:\Firm_Data.csv")
top_N = 20
word_dist = nltk.FreqDist(data['Firm_Name'])
print('All frequencies')
print('='*60)
rslt=pd.DataFrame(word_dist.most_common(top_N),columns=['Word','Frequency'])

print(rslt)
print ('='*60)

However the following code does not produce a unique word count.

Zero
  • 74,117
  • 18
  • 147
  • 154
J Reza
  • 579
  • 1
  • 4
  • 5

4 Answers4

99

IIUIC, use value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society       3
Ltd           2
James's       1
R.X.          1
Yah           1
Associates    1
St            1
Kensington    1
MMV           1
Big           1
&             1
The           1
Co            1
Oil           1
Building      1
dtype: int64

Or,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()

Or,

pd.Series(' '.join(df.Firm_Name).split()).value_counts()

For top N, for example 3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society    3
Ltd        2
James's    1
dtype: int64

Details

In [3380]: df
Out[3380]:
      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd
Zero
  • 74,117
  • 18
  • 147
  • 154
  • I’ll certainly be accepting your answer. I’m waiting in order to milk the open answer aspect. – piRSquared Oct 17 '17 at 16:17
  • `.split(expand=True).stack()` is a really clever option on small data, but it quickly runs out of memory on data of any size. Since it expands out a matrix for every unique word in `Firm_Name`, data sparsity explodes matrix columns without many observations. – data_steve Feb 01 '18 at 21:39
  • @Zero After getting top N using `pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]` How can I delete these words from my original dataframe column? – Jeeth Oct 08 '18 at 18:13
  • This is a great solution. Thanks!! @data_steve: you say it will run out of memory on "data of any size"... what sort of ball-park are we talking (if running on e.g. a recent MacBook Pro)? – Chris Aug 27 '19 at 10:47
  • 1
    Thhank you.What if i only want the first occurence of each word per row?How to eliminate repeat of the same word in the same row? – Nithin Nampoothiry Nov 29 '19 at 10:41
  • 3
    As has been noted, `.split(expand=True).stack().value_counts()` takes a TON of extra memory if your string data is variable length. Try this `.str.split().explode().value_counts()`. It does the exact same thing without allocating any extra memory. – William Gerecke Jul 30 '20 at 02:45
  • 9
    I tested he speed of all three of these methods, along with the 4th option (.explode())listed in the comment by @WilliamGerecke. My df had 500k rows and each row contained between 0 and 400 words. The results rounded to the nearest second were: ```df.Firm_Name.str.split(expand=True).stack().value_counts()``` - 77 seconds ```pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()``` - 24 ```pd.Series(' '.join(df.Firm_Name).split()).value_counts()``` - 4 ```df.Firm_Name.str.split().explode().value_counts()``` - 6. – Nicko Aug 14 '20 at 22:10
  • Why do you need stack in the first snippet? – Marc Mar 30 '22 at 18:12
  • It returns false result because it is case sensitive. This function counts "Foo" and "foo" separately. – zoltron May 27 '23 at 17:51
13

You need str.cat with lower first for concanecate all values to one string, then need word_tokenize and last use your solution:

top_N = 4
#if not necessary all lower
a = data['Firm_Name'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
print (word_dist)
<FreqDist with 17 samples and 20 outcomes>

rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
      Word  Frequency
0  society          3
1      ltd          2
2      the          1
3       co          1

Also is possible remove lower if necessary:

top_N = 4
a = data['Firm_Name'].str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(a)
word_dist = nltk.FreqDist(words)
rslt = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency'])
print(rslt)
         Word  Frequency
0     Society          3
1         Ltd          2
2         MMV          1
3  Kensington          1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks so much great solution – J Reza Nov 16 '17 at 17:17
  • Yes, it is up to you what answer is accepted. Both solutions working - Zero solution is pure pandas and my pure ntlk. And I guess there should be some small differences, `ntlk` a bit change data in `word_tokenize`. – jezrael Nov 22 '17 at 14:45
1

This is faster:

df.Firm_Name.str.split().explode().value_counts()

miguelfg
  • 1,455
  • 2
  • 16
  • 21
  • It returns false result because it is case sensitive. This function counts "Foo" and "foo" separately. – zoltron May 27 '23 at 17:52
  • @zoltron the question doesn't specify that it has to be case insensitive. But if needed, convert it all to lower before exploding. Like `df.Firm_Name.str.split().str.lower().explode().value_counts()` – miguelfg May 31 '23 at 08:37
  • `freq = df.Firm_Name.str.lower().str.split().explode().value_counts()` then `print(freq)` – zoltron Jun 01 '23 at 16:02
0

This answer can also be used - Count distinct words from a Pandas Data Frame. It utilizes the Counter method and applies it to each row.

from collections import Counter
c = Counter()
df = pd.DataFrame(
    [[104472,"R.X. Yah & Co"],
    [104873,"Big Building Society"],
    [109986,"St James's Society"],
    [114058,"The Kensington Society Ltd"],
    [113438,"MMV Oil Associates Ltd"]
], columns=["URN","Firm_Name"])
df.Firm_Name.str.split().apply(c.update)

Counter({'R.X.': 1,
         'Yah': 1,
         '&': 1,
         'Co': 1,
         'Big': 1,
         'Building': 1,
         'Society': 3,
         'St': 1,
         "James's": 1,
         'The': 1,
         'Kensington': 1,
         'Ltd': 2,
         'MMV': 1,
         'Oil': 1,
         'Associates': 1})
biophetik
  • 989
  • 8
  • 5