-1

I have a NLP project where I would like to remove the words that appear only once in the keywords. That is to say, for each row I have a list of keywords and their frequencies.

I would like something like

if the frequency for the word in the whole column ['keywords'] ==1 then replace by "". 

I cannot test word by word. So my idea was creating a list with all the words and remove the duplicates, then for each word in this list count.sum and then delete. But I have no idea how to do that. Any ideas? Thanks!

Here's how my data looks like:

sample.head(4)

    ID  keywords                                            age sex
0   1   fibre:16;quoi:1;dangers:1;combien:1;hightech:1...   62  F
1   2   restaurant:1;marrakech.shtml:1  35  M
2   3   payer:1;faq:1;taxe:1;habitation:1;macron:1;qui...   45  F
3   4   rigaud:3;laurent:3;photo:11;profile:8;photopro...   46  F
Me.Ch
  • 3
  • 2

2 Answers2

1

To add on to what @jpl mentioned with scikit-learn's CountVectorizer, there exists an option min_df that does exactly what you want, provided you can get your data in the right format. Here's an example:

from sklearn.feature_extraction.text import CountVectorizer
# assuming you want the token to appear in >= 2 documents
vectorizer = CountVectorizer(min_df=2)
documents = ['hello there', 'hello']
X = vectorizer.fit_transform(documents)

This gives you:

# Notice the dimensions of our array – 2 documents by 1 token
>>> X.shape
(2, 1)
# Here is a count of how many times the tokens meeting the inclusion
# criteria are observed in each document (as you see, "hello" is seen once
# in each document
>>> X.toarray()
array([[1],
       [1]])
# this is the entire vocabulary our vectorizer knows – see how "there" is excluded?
>>> vectorizer.vocabulary_
{'hello': 0}
blacksite
  • 12,086
  • 10
  • 64
  • 109
  • 1
    yes! I indeed used vectorizer to train my classification model anyway but i didn't know about min_df. Thanks for vectorizer.vocabulary_ too which allows to further explore the tokens! – Me.Ch Apr 13 '20 at 16:28
0

Your representation makes that difficult. You should build a dataframe where each column is a word; then you can use easily pandas operations like the sum to do whatever you want.

However this will lead to a very sparse dataframe, which is never good.

Many libraries, e.g. scikit learn's CountVectorizer allow you to do what you want efficiently.

jpl
  • 367
  • 2
  • 11