I have a large DataFrame (~4M rows) with one column containing strings, which are sentences.
sentence
"john went for a ride with his new car"
"miranda took her dog out for a walk"
"my dog hates car rides, he feels sick"
I want to filter out rows that only contains common words. I other words, if one sentence contains a previously unseen word (or a word that has been seen fewer that X times) from all the rows above, I would like to keep the row, and otherwise drop the row.
Since this is a sequential thing, where I have to build a dictionary over words and how many times they have been seen so far for each row and base the decision on that, I guess that my only solution is to loop over my DataFrame .
Have I missed any possibility of avoiding looping?
EDIT: Billy's solution below (the accepted one) is a great approach. However the .toarray()
did not work since my matrix was gigantic. With the help of this thread, I solved it for a sparse format. The resulting code is here:
def sparseCumsum(matrix):
a = scipy.sparse.csc_matrix(matrix)
indptr = a.indptr
data = a.data
for i in range(a.shape[1]):
st = indptr[i]
en = indptr[i + 1]
np.cumsum(data[st:en], out=data[st:en])
def reduceSentences(df):
vectorizer = CountVectorizer(min_df=0, analyzer='word', ngram_range=(1, 1))
countMatrix = vectorizer.fit_transform(df['sentence'])
sparseCumsum(countMatrix)
df['max_freq'] = countMatrix.max(axis=1).toarray()
return df.loc[df['max_freq'] < 3]