0

This is a natural language processing related question.

Suppose I have a labelled train and unlabelled test set. After I have cleaned my train data(stopword, stem, punctuations etc), I use this cleaned data to build my model.

When fitting it on my test data, will I also have to clean the test data text using the same manner as I did with my train set? or should I not touch the test data completly.

Thanks!

graphboy
  • 5
  • 2

2 Answers2

0

Yes, you should do the same exact preprocessing on your training and testing dataset.

Kevin Yobeth
  • 939
  • 1
  • 8
  • 17
0

Yes, data cleaning is a mandatory step in machine learning or NLP problem. So you have to always first clean our data and then only have to feed it to the model.

Reg. Test and train data cleaning --> you can clean both data there is no harm of doing this.