Tokenization of unbalanced dataset

Question

I'm working with a dataset of emails' content which I want to transform with doc2vec. This is a labeled dataset (spam/not-spam) and it is unbalanced (90-10 ratio). My question is: when tokenizing the emails' content, should I first oversample (using SMOTE), or is it ok to use the dataset as is?

I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro & **NOTE** in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). — desertnaut, Jan 07 '21 at 11:40

score 0 · Answer 1 · answered Jan 07 '21 at 17:59

Try both, pick which works better.

(Separately: avoid using the known-labels as the document-identifiers in Doc2Vec, as in practice that turns the dataset into just two giant documents – far too few for training doc-vectors of any useful dimensionality – instead of the many varied documents that are needed for an interesting/useful high-dimensional doc-vector set.)

Tokenization of unbalanced dataset

1 Answers1