In R, I can extract rows (documents) which contain a particular term, say 'toyota' by intersecting a document term matrix (dtm) with required column name like so:
dtm <- DocumentTermMatrix(mycorpus, control = list(tokenize = TrigramTokenizer))
x.df<-as.matrix(dtm[1:ncorpus, intersect(colnames(dtm), "toyota"),drop=FALSE])
The problem is that I can't find an equivalent method in Python sklearn package. so I go about it in a roundabout way:
- first i get index values of rows where the relevant column ("toyota") in the tfidf frame is not null;columns names are feature names.
- then I slice the main pandas dataframe on identified row indices.
- Now I have a dataframe where each row contains "toyota".
MVP here:
rows_to_keep=tfidf_df[tfidf_df.toyota.notnull()].index
data=my_df.loc[rows_to_keep,:]
print(data.shape)
This works. Problem is how do I pass an iterator to this statement?
car_make=['toyota','ford','nissan','gmotor','honda','suzuki']
Then for zentity in car_make:
rows_to_keep=tfidf_df[tfidf_df.zentity.notnull()].index
does not work.
AttributeError: 'SparseDataFrame' object has no attribute 'zentity'
I purposefully chose zentity to avoid equivalence with any column name in the tfidf.
Is there a clean way to make the intersection and extract only rows where column is not null (NaN)? Any help will be appreciated.