Trying to use pandas to oversample my ragged data (data with different lengths).
Given the following data samples:
import pandas as pd
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
Data (groups are separated with ---
for convince):
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
Targets:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3.
I'm looking for a way to oversample the data so the results would be:
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
-----------------
13 7 11
14 7 12 Replica of id 1
15 7 13
-----------------
16 8 33
17 8 34 Replica of id 3
18 8 35
19 8 36
And the targets would be balanced:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
6 7 1
8 8 1
With exactly 4 positive and 4 negative samples.