Oversampling a class in classification problem

Asked Nov 28 '18 at 09:58

Active Nov 28 '18 at 12:32

Viewed 73 times

I have nearly 100000 data point with 15 features for 'disease' and 'no disease' as target.

But my data is imbalanced. 97% of my data is no disease and 3% is disease. To overcome this I manually created disease data by creating 7 copies from the actual data and merged it with the original data. using this code.

#selecting data with disease is 1 
# Even created unique 'patient ID' by adding a dummy letter as a suffix to the #original ID.
ia = df[df['disease']==1]
dup = pd.DataFrame()
for i,j in zip(['a','b','c','d','e','f'],['B','C','E','F','G','H']):
    i = ia.copy()    
    i['dum'] =  j
    i["patient ID"] = i["Employee Code"]+ i['dum']
    dup= pd.concat([dup,i])
# adding the copies to the original data
df = pd.concat([dup,df])

Please let me know if this is the correct method for oversampling.

edited Nov 28 '18 at 12:20

yatu

86,083
12
84
139

asked Nov 28 '18 at 09:58

aim

Oversampling a class in classification problem

0 Answers0