I’m taking my first steps in ML, specifically with classifiers for text sentiment analysis. My approach is to make the usual 80% train dataset and 20% test. Having a trained model what is the best way to proceed in a production environment when new features appear (new words in texts not present in the initial dataset)?
2 Answers
In classification task, all feature must be seen at train time and new features can not be add to prediction phase later. For your problem you can use, Stemming or Lemmatizing . Or Something like LDA or Word2Vec with large number of document they trained
this chapter could be useful: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

- 1,343
- 8
- 25
The problem that you are describing is generally known as "out of vocabulary" (OOV) words that appear in the test set but not in the training set. A traditional approach is to represent each OOV word with a special token, such as "UNKNOWN", and actually have those in the training data. This approach is discussed more fully in Section 4.3 of "Speech and Language Processing" by Jurafsky and Martin.
A more modern approach is to use Word2Vec. This is a really advanced topic that's found in neural networks.

- 38,621
- 48
- 169
- 217