0

I am trying to train a model which takes a mixture of numerical, categorical and text features. My question is which one of the following should I do for vectorizing my text and categorical features?

  1. I split my data into train,cv and test for purpose of features vectorization i.e using vectorizor.fit(train) and vectorizor.transform(cv),vectorizor.transform(test)
  2. Use vectorizor.fit transform on entire data

My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.

Please share some thought on this. I am new to data-science.

Axe319
  • 4,255
  • 3
  • 15
  • 31
  • Welcome to SO, which is about *specific coding* questions; non-coding questions about machine learning theory & methodology are off-topic here, and should be posted at [Cross Validated](https://stats.stackexchange.com/help/on-topic) instead. Please notice the **NOTE** in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Sep 30 '20 at 18:54
  • I’m voting to close this question because it is not about programming as defined in the [help] but about ML methodology. – desertnaut Sep 30 '20 at 18:55

2 Answers2

0

If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.

whege
  • 1,391
  • 1
  • 5
  • 13
0

I think you should go with the 2nd option i.e vectorizer.fit_transform on entire data because if you split the data before, it may happen that some of the data which is in test may not be in train so in that case some classes may remain unrecognised