I am trying to train a model which takes a mixture of numerical, categorical and text features. My question is which one of the following should I do for vectorizing my text and categorical features?
- I split my data into
train
,cv
andtest
for purpose of features vectorization i.e usingvectorizor.fit(train)
andvectorizor.transform(cv)
,vectorizor.transform(test)
- Use
vectorizor.fit
transform
on entire data
My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.
Please share some thought on this. I am new to data-science.