Generating Artificial data from real data

Question

I have a dataframe consisting 2000 rows and 5 features (columns) as follows:

    my_data:
            Id,   f1,   f2,  f3,   f4(target_value)
            u1    34     sd  43        1
            u1    30     fd   3        0
            u1    01     sd  2.4       0
            ..    ..     ..   ..      .. 
            u1    13     sd  23        1
            u2    23     fd  12        0
            u2    30     fd   3        1
            u2    15     sd  2.4       0
            ..    ..     ..   ..      .. 
            u2    18     xd  20        0
            u3    66     ss  43        1
            u3    30     fd  23        1
            u3    50     sd  21        0
            ..    ..     ..   ..      .. 
            u3    37     sd  28        1

In this data frame for every Id (e.g., u1 or u2), there are only few instances e.g., 10, 13 or maximum 15 samples. Sine I want to do some classification and prediction tasks for each individual Id, this amount of data points are not good enough for ML task. Is there any way that I can generate some artificial datapoint for every Id (something like oversampling), which statistically can rely on the machine learning task?

yes, there are libraries for those things, but this may be off topic for this site. as a tip, search for libraries using oversampling, and smoting. — Paritosh Singh, May 18 '19 at 10:41
I have tried to implement this https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data. but this way doesn't work. — Spedo, May 18 '19 at 12:36

Generating Artificial data from real data

0 Answers0