Do you have to clean your test data before feeding into an NLP model?

Question

This is a natural language processing related question.

Suppose I have a labelled train and unlabelled test set. After I have cleaned my train data(stopword, stem, punctuations etc), I use this cleaned data to build my model.

When fitting it on my test data, will I also have to clean the test data text using the same manner as I did with my train set? or should I not touch the test data completly.

Thanks!

Data cleaning pipeline is generally the same for both train and test data — Abishek Bashyal, Feb 21 '21 at 10:32

score 0 · Accepted Answer · answered Feb 21 '21 at 13:04

0

Yes, you should do the same exact preprocessing on your training and testing dataset.

answered Feb 21 '21 at 13:04

Kevin Yobeth

939
1
8
17

score 0 · Answer 2 · answered Aug 13 '21 at 17:03

Yes, data cleaning is a mandatory step in machine learning or NLP problem. So you have to always first clean our data and then only have to feed it to the model.

Reg. Test and train data cleaning --> you can clean both data there is no harm of doing this.

Do you have to clean your test data before feeding into an NLP model?

2 Answers2