1

I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words (which are all known) in a certain column. This works for a single word.

d = df["Content"].str.contains("word").value_counts()

But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could be word2 or wordtwo, like so:

word1           40
word2/wordtwo   120

How do I accomplish this?

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
dnclem
  • 2,818
  • 15
  • 46
  • 64

2 Answers2

3

IMO one of the most efficient approaches would be using sklearn.feature_extraction.text.CountVectorizer passing it a vocabulary (list of words, that you want to count).

Demo:

In [21]: text = """
    ...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
    ...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
    ...: be word2 or wordtwo, like so"""

In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])

In [23]: df
Out[23]:
                                             Content
0  \nI have a pandas data frame with approximatel...
1  I want to find the number of occurrences of sp...
2                       This works for a single word
3  But I want to find out the occurrences of mult...
4      Also word2 could be word2 or wordtwo, like so

In [24]: from sklearn.feature_extraction.text import CountVectorizer

In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']

In [26]: vect = CountVectorizer(vocabulary=vocab)

In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
                         index=vect.get_feature_names())

In [28]: res
Out[28]:
word       1
words      2
word1      1
word2      3
wordtwo    1
dtype: int64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Very nice use of `CountVectorizer` - didn't think of that! – Ami Tavory May 05 '18 at 09:39
  • 1
    Accepting this one as the answer even though the other one is simpler.. great approach because I am dealing with millions of rows so this will come in handy. Thanks. – dnclem May 05 '18 at 09:50
  • 1
    @rayanisran, i've slightly corrected my solution, removing (`.A` - converting sparse matrix to density one), so it should be more memory efficient now... :-) – MaxU - stand with Ukraine May 05 '18 at 09:54
3

You could create a dictionary like this:

{w: df["Content"].str.contains(w).sum() for w in words}

Assuming words is the list of words.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185