What is the correct order for training the ML model?

Question

I have a dataset containing multiclass dependent variable which is imbalanced. I want to know which is the correct order for training the model:

1)Standardizing-oversampling-traintestsplit

2)traintestsplit-Standardizing-oversampling

3)traintestsplit-oversampling-standardizing

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

Welcome aboard.

About your question, the better approach may be:

preprocessing -> train test split -> normalizing -> over/undersampling

data cleaning and preprocessing

This must be your first task, this includes removing errors from data and joining all types of data needed scattered across the company.

train test split

This must be the next to do, because of 2 things:

If you normalize the dataset before the split, you may contaminate your model training with test data information (models must be able to deal with unseen values)
Test data must be real world data, as it is, if you apply any type of sampling on this, you are changing this reality.

Normalizing

Normalizing your data before sampling is a good practice, because some sampling methods use models to generate new data points, and receiving data normalized will make a better sampling generation.

Sampling

And at last, sample your data, i recommend you to evaluate different sampling methods and sampling ratios, and compare the results.

What is the correct order for training the ML model?

1 Answers1

data cleaning and preprocessing

train test split

Normalizing

Sampling