-1

I am trying to remove stopwords from my data and I have used this statement to download the stopwords.

stop = set(stopwords.words('english'))

This has character 'd' as one of the stopwords. So, when I apply this to my function it is removing 'd' from the word. Please see the attached picture for the reference and guide me how to fix this.

enter image description here

1 Answers1

1

I checked out the code and noticed that you are applying the rem_stopwords function on the clean_text column, while you should apply it on tweet column.

Otherwise, NLTK removes d, I, and other characters when they are independent tokens, a token here is a word after you split on spaces, so if you have i'd, it will not remove d nor I since they are combined into a word. However if you have 'I like Football' it will remove I, since it will be an independent token.

You can try this code, it will solve your problem

import pandas as pd
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop = set(stopwords.words('english'))

df['clean_text'] = df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))
Farid
  • 48
  • 1
  • 6
  • I changed the code as you said. But now, I am getting blank column of clean_text. – Farheen Fatima Oct 03 '22 at 01:56
  • Can you try this line of code instead of the `rem_stopwords` function? `df['clean_text'] = df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))` – Farid Oct 03 '22 at 07:10