0

I am a new people in NLP and I am try do the text classification job. Before doing the job, I know that we should do word embedding. My question is should I do word embedding job only on training data (so that testing data get vector just from pre-trained vec-model of training data), or both on training data & testing data?

Nils Cao
  • 1,409
  • 2
  • 15
  • 23

1 Answers1

-1

This is a very important question. In NN community what typically people do is to use a threshold (i.e. frequency < = 2) in the training set and replace all words which occur less than that threshold by UNK token. Then in the test time, if there is a word that doesn't match an actual training set word, UNK's representation will replace it.

user3639557
  • 4,791
  • 6
  • 30
  • 55