Using SMOTE on training data

Question

I have an unbalanced dataset and I want to use SMOTE. I am working with Azure ML. I have read many examples in the Microsoft Doku page. I am wondering why the SMOTE is set before the SPLIT DATA function and not after the SPLIT DATA on the 70% dataset for training. All the examples I have seen are before the SPLIT DATA function. Is that the right usage of SMOTE?

Here an example from Microsoft: https://imaginemedia.blob.core.windows.net/content/Lab%20PDF%20-%20Churn%20Prevention%20and%20Intervention-db9732e3e8c6.pdf

The provided link is broken. But in any case you're right, you should first split your data before applying any kind of modification or your evaluation will not be "real". Keep in mind that if you change the distribution of your test data what you are doing basically is testing your estimator on a different problem that you want to solve. — OSainz, May 26 '19 at 20:05
I have added the screen from the PDF. I don't know if Microsoft does not know how ist the right usage of the SMOTE. This will be strange because in all of their documents and examples I have seen the same constellation. What do you think is to best to do? — Mutatos, May 27 '19 at 06:54
For me makes no sense the idea of applying SMOTE on train and test data, you are actually changing your problem when you change the distribution of your test data. But I could be wrong, maybe if anyone else give us his/her point of view. — OSainz, May 27 '19 at 10:09
Yes, this is right, that you change the problem and the distribution of the test data, but if the training model is not good, it will maybe not match also the added rows. Maybe this is the thoughts of Microsoft? — Mutatos, May 28 '19 at 14:01

Using SMOTE on training data

0 Answers0