I got a big unbalanced classification problem and want to address this issue by oversampling the minor classes. (N(class 1) = 8,5mio, N(class n) = 3000)
For that purpose I want to get 100.000 sample for each of the n classes by
data_oversampled = []
for data_class_filtered in data:
data_oversampled.append(data_class_filtered.sample(n=20000, replace=True))
where data
is a list of class specific DataFrames and len(data)=10
, data.shape=(9448788,97)
That works as expected but unfortunately takes literally forever. Is there a more efficient way to do the same thing?