Handling Imbalanced Data with Large Dataset

Asked Nov 18 '21 at 04:08

Active Nov 18 '21 at 04:08

Viewed 337 times

I have a dataset of 6m+ rows and about 300 columns that I am currently pre-processing with dask in Python. I'm building a classifier and there is a severe class imbalance that I would normally handle using sampling methods through imblearn (random oversampling, undersampling, SMOTE, etc.) but I've read that dask doesn't work with imblearn since all the data needs to be read into memory first. Is there an equivalent method that would work with dask or would I need to do this with other big data tools like Spark?

asked Nov 18 '21 at 04:08

jxo

It looks like handling imbalanced datasets is still [a work in progress on dask-ml](https://github.com/dask/dask-ml/issues/317) – scj13 Mar 18 '22 at 00:11

Handling Imbalanced Data with Large Dataset

0 Answers0