-1

I'm working with a dataset of emails' content which I want to transform with doc2vec. This is a labeled dataset (spam/not-spam) and it is unbalanced (90-10 ratio). My question is: when tokenizing the emails' content, should I first oversample (using SMOTE), or is it ok to use the dataset as is?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro & **NOTE** in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Jan 07 '21 at 11:40

1 Answers1

0

Try both, pick which works better.

(Separately: avoid using the known-labels as the document-identifiers in Doc2Vec, as in practice that turns the dataset into just two giant documents – far too few for training doc-vectors of any useful dimensionality – instead of the many varied documents that are needed for an interesting/useful high-dimensional doc-vector set.)

gojomo
  • 52,260
  • 14
  • 86
  • 115