a proper start for generating synthetic data for classification problem in python

Question

i have a dataset with 9 features and 1300 rows. im trying to generate synthetic data on the present data which i have. the output is divided into 2 options. namely 1 and 0 which is (1-yes and 0-no) the problem here is almost 1100 cases have an output "0" and 200 cases have output "1" in them. previously, i tried training but the results aren't good. my professor suggested me to work on synthetic data and increase the the cases of output "1" such that it would help in developing the machine learning model. i have no idea about synthetic data. i admit it. i just dont know where to start. could anyone help? how to work on this type of problem. any suggestion is appreciated? any reference code would be useful for learning purpose. thanks

Matheus Torquato · Answer 1 · 2019-06-19T10:01:52.233

0

As I understood you need to use Data Augmentation.

Have a look at this and/or this.

You'll be able to drastically increase the size o your Dataset and potentially improve your training accuracy.

Something similar to this:

edited Jun 19 '19 at 10:01

answered Jun 19 '19 at 09:45

Matheus Torquato

1,293
18
25

i am familiar with data augumentation but im not dealing with pictures. mine is an imbalanced dataset. for which im thinking about SMOTE but im not able to find reference blogs where in someone has implemented this using python. but thanks – gendry Jun 19 '19 at 10:40

a proper start for generating synthetic data for classification problem in python

1 Answers1