0

I have a dataframe consisting 2000 rows and 5 features (columns) as follows:

    my_data:
            Id,   f1,   f2,  f3,   f4(target_value)
            u1    34     sd  43        1
            u1    30     fd   3        0
            u1    01     sd  2.4       0
            ..    ..     ..   ..      .. 
            u1    13     sd  23        1
            u2    23     fd  12        0
            u2    30     fd   3        1
            u2    15     sd  2.4       0
            ..    ..     ..   ..      .. 
            u2    18     xd  20        0
            u3    66     ss  43        1
            u3    30     fd  23        1
            u3    50     sd  21        0
            ..    ..     ..   ..      .. 
            u3    37     sd  28        1

In this data frame for every Id (e.g., u1 or u2), there are only few instances e.g., 10, 13 or maximum 15 samples. Sine I want to do some classification and prediction tasks for each individual Id, this amount of data points are not good enough for ML task. Is there any way that I can generate some artificial datapoint for every Id (something like oversampling), which statistically can rely on the machine learning task?

Spedo
  • 355
  • 3
  • 13
  • 1
    yes, there are libraries for those things, but this may be off topic for this site. as a tip, search for libraries using oversampling, and smoting. – Paritosh Singh May 18 '19 at 10:41
  • I have tried to implement this https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data. but this way doesn't work. – Spedo May 18 '19 at 12:36

0 Answers0