I'm working with a dataset of emails' content which I want to transform with doc2vec. This is a labeled dataset (spam/not-spam) and it is unbalanced (90-10 ratio). My question is: when tokenizing the emails' content, should I first oversample (using SMOTE), or is it ok to use the dataset as is?
Asked
Active
Viewed 67 times
-1
-
Iām voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro & **NOTE** in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). ā desertnaut Jan 07 '21 at 11:40
1 Answers
0
Try both, pick which works better.
(Separately: avoid using the known-labels as the document-identifiers in Doc2Vec
, as in practice that turns the dataset into just two giant documents ā far too few for training doc-vectors of any useful dimensionality ā instead of the many varied documents that are needed for an interesting/useful high-dimensional doc-vector set.)

gojomo
- 52,260
- 14
- 86
- 115