0

Trying to use pandas to oversample my ragged data (data with different lengths).

Given the following data samples:

import pandas as pd

x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})

Data (groups are separated with --- for convince):

    id  f1
0    1  11
1    1  12
2    1  13
-----------
3    2  22
4    2  22
-----------
5    3  33
6    3  34
7    3  35
8    3  36
-----------
9    4  44
-----------
10   5  55
-----------
11   6  66
12   6  66

Targets:

   id  target
0   1       1
1   2       0
2   3       1
3   4       0
4   5       0
5   6       0

I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3.

I'm looking for a way to oversample the data so the results would be:

    id  f1
0    1  11
1    1  12
2    1  13
-----------
3    2  22
4    2  22
-----------
5    3  33
6    3  34
7    3  35
8    3  36
-----------
9    4  44
-----------
10   5  55
-----------
11   6  66
12   6  66
-----------------
13   7  11
14   7  12 Replica of id 1
15   7  13
-----------------
16   8  33
17   8  34 Replica of id 3
18   8  35
19   8  36

And the targets would be balanced:

   id  target
0   1       1
1   2       0
2   3       1
3   4       0
4   5       0
5   6       0
6   7       1
8   8       1

With exactly 4 positive and 4 negative samples.

Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
  • Does the solution need to be in `pandas`? – artemis Nov 01 '21 at 15:21
  • @wundermahn Not really, at the end I'm using ndarray for the training – Shlomi Schwartz Nov 02 '21 at 08:08
  • Few question - Never working with this problems, so not sure if understand. Do you need only same number of `0,1` in `target`, so add `1` or `0` with consecutive id in first step? And then add 3 rows for each new `id` with 3 new lines random from `id` of original? Or it is more complicated? – jezrael Nov 02 '21 at 09:49
  • @jezrael allow me to elaborate: each id in the targets df represents n rows of data df with the same id (each sequence can be of different length). I would like to have the same amount of 0 and 1 targets, so I need to replicate data with id correlated with 1 in the targets, keeping the rows of the data in the same order.in the above example there are 6 targets total, to balance the data I need to add 2 more '1' targets, so I'm copying the data of id '1' and '3' adding it to the end of the data df with 2 new ids '7' & '8' also adding them to the targets id column. – Shlomi Schwartz Nov 03 '21 at 08:05
  • Please check out the updated question – Shlomi Schwartz Nov 03 '21 at 08:11

1 Answers1

1

You can use:

x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],
                  'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})

#more general sample
y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})

#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()

#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1), 
                   'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)


#filter by first value of new
add = y[y['target'].eq(new[0])]

#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
          .head(len(new))
          .assign(new = y1['id'].tolist())
          .merge(x, on='id', how='left')
          .drop('id', axis=1)
          .rename(columns={'new':'id'}))

#add to x
x2 = x.append(add, ignore_index=True)
print (x2)

Solution above working only for non balanced data, if possible sometimes balanced:

#balanced sample
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})

#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()

if len(new) > 0:

    #create helper df and add to y
    y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
                       'target':new})
    y2 = y.append(y1, ignore_index=True)
    print (y2)
    
    
    #filter by first value of new
    add = y[y['target'].eq(new[0])]
    
    #repeat values by np.tile or is possible change to np.repeat
    #add helper column by y1.id and merge to x
    add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
              .head(len(new))
              .assign(new = y1['id'].tolist())
              .merge(x, on='id', how='left')
              .drop('id', axis=1)
              .rename(columns={'new':'id'}))
    
    #add to x
    x2 = x.append(add, ignore_index=True)
    print (x2)
    
else:
    print ('y is already balanced')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252