-1

While cleaning the texts, we remove the words like 'the', 'this' and even 'not'.

As I have analyzed, the y_pred vector has a 1 for the second review because it neglected 'not'. If it had considered 'not' also, maybe the accuracy would have been increased.

Is there a way/advanced method to include such important role-playing words?

Nitin1901
  • 140
  • 1
  • 10
  • 1
    Do not remove them, use the word embeddings since many times negation plays an important role in NLP. – Bharath M Shetty Mar 06 '20 at 05:30
  • "While cleaning the texts, **we remove** the words like 'the', 'this' and even 'not'." The most advanced technique, then, is to stop doing that. – Jongware Mar 06 '20 at 08:48
  • @usr2564301 if at all we stop purifying, how many columns are we going to get? And how are we going to handle it? – Nitin1901 Mar 06 '20 at 13:45

1 Answers1

0

Removing stopwords - simple functional words like "the" and "is" - from texts is a cleaning technique useful for some kinds of text analysis; if you're looking at frequency of words, for example, then getting rid of boring-but-common words is a good idea.

However, it's not appropriate for all (or even most) types of NLP; it's not helpful in translation, for example, and lots of models are capable of dealing with grammatical noise words on their own.

Even when you are doing a task which requires removing stopwords, you can always tailor your stopwords to the task at hand.

If you're filtering out stopwords using NLTK's stopwords lists, for example, deciding to keep the word "not" is just a question of removing it from the stopwords list before filtering:

from nltk.corpus import stopwords

stops = stopwords.words("english")

stops.remove("not")
Peritract
  • 761
  • 5
  • 13