0

I got a big unbalanced classification problem and want to address this issue by oversampling the minor classes. (N(class 1) = 8,5mio, N(class n) = 3000)

For that purpose I want to get 100.000 sample for each of the n classes by

data_oversampled = []
for data_class_filtered in data:
    data_oversampled.append(data_class_filtered.sample(n=20000, replace=True))

where data is a list of class specific DataFrames and len(data)=10, data.shape=(9448788,97)

That works as expected but unfortunately takes literally forever. Is there a more efficient way to do the same thing?

Quastiat
  • 1,164
  • 1
  • 18
  • 37
  • what do you mean by "literally forever"? what's `len(data)` and what shapes are the dataframes? from a stats PoV: replicating the same values ~33 times seems like it would quickly bias estimates. maybe you could use a model that handles this more directly? – Sam Mason Oct 11 '19 at 19:16
  • I added some information about the data `len(data)=10`, `data.shape=(9448788,97)`. Do you know a rule of thumb of whats a adivseable dimension of oversampling? Takes for hours at least. – Quastiat Oct 12 '19 at 07:27
  • that's a lot of data, are you sure your computer isn't swapping? 9.5M*97 `float64`s takes ~7GB, and other data types could be a lot more! swapping can make things a million times slower, which randomly sampling from a dataframe would tend to do – Sam Mason Oct 12 '19 at 21:11
  • you've not given enough detail for me to comment on the stats. maybe post another question about it specifically, probably to https://stats.stackexchange.com – Sam Mason Oct 12 '19 at 21:23

0 Answers0