0

I have a large DataFrame (~4M rows) with one column containing strings, which are sentences.

sentence
"john went for a ride with his new car"
"miranda took her dog out for a walk"
"my dog hates car rides, he feels sick"

I want to filter out rows that only contains common words. I other words, if one sentence contains a previously unseen word (or a word that has been seen fewer that X times) from all the rows above, I would like to keep the row, and otherwise drop the row.

Since this is a sequential thing, where I have to build a dictionary over words and how many times they have been seen so far for each row and base the decision on that, I guess that my only solution is to loop over my DataFrame .

Have I missed any possibility of avoiding looping?

EDIT: Billy's solution below (the accepted one) is a great approach. However the .toarray() did not work since my matrix was gigantic. With the help of this thread, I solved it for a sparse format. The resulting code is here:

def sparseCumsum(matrix):
    a = scipy.sparse.csc_matrix(matrix)
    indptr = a.indptr
    data = a.data
    for i in range(a.shape[1]):
        st = indptr[i]
        en = indptr[i + 1]
        np.cumsum(data[st:en], out=data[st:en])


def reduceSentences(df):
    vectorizer = CountVectorizer(min_df=0, analyzer='word', ngram_range=(1, 1))
    countMatrix = vectorizer.fit_transform(df['sentence'])

    sparseCumsum(countMatrix)
    df['max_freq'] = countMatrix.max(axis=1).toarray()
    return df.loc[df['max_freq'] < 3]
Becczor
  • 175
  • 1
  • 10
  • previously unseen word based on what? Do you have already a list of words? – Billy Bonaros Jan 14 '21 at 15:38
  • @BillyBonaros Ah sorry, might have been a bit unclear. By unseen I mean based on the rows above. I have made it more clear in the question now. – Becczor Jan 14 '21 at 15:40

1 Answers1

2

A solution is by using CountVectorizer and the get the Table with word frequencies and get the cumulative sum of them. Then, by taking their max by row, you can see the maximum times a word is been used. For example, If the number is 1 then all words are new.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df1={'sentence':
["john went for a ride with his new car",
"miranda took her dog out for a walk",
"my dog hates car rides, he feels sick"]}

v = CountVectorizer(min_df=0,analyzer='word',ngram_range=(1,1))
x = v.fit_transform(df1['sentence'])
f = pd.DataFrame(x.toarray(), columns = v.get_feature_names())

df1['max_freq']=f.reset_index(drop=True).cumsum().max(axis=1)

print(df1)


                                sentence  max_freq
0  john went for a ride with his new car         1
1    miranda took her dog out for a walk         2
2  my dog hates car rides, he feels sick         2

Now you can filter your sentences using max_freq. If you want to keep sentences that have been seen less or equal than 2 times then:

df1[df1['max_freq']<=2]
Becczor
  • 175
  • 1
  • 10
Billy Bonaros
  • 1,671
  • 11
  • 18
  • I like the approach! But I'm not sure this gives the intended result. The max_freq for the last row should be 2, which should be for both the word "dog" and "car". – Becczor Jan 14 '21 at 16:15
  • I found, the error, it is the old index that is not dropped when resetting. Now it works fine! Thank you Billy! – Becczor Jan 14 '21 at 16:26
  • Great! Have fun! – Billy Bonaros Jan 14 '21 at 16:44
  • 1
    Realized that the x.toarray() did not work, since my matrix was gigantic. Had to do some research to make cumsum work for sparse matrix. I will add my final code to the question. – Becczor Jan 18 '21 at 10:55